fileserver crash, potential DOS

Mitch Collinsworth
Sat, 31 Mar 2001 01:35:25 -0500 (EST)

Greetings all,

I'm curious if anyone else has seen this and maybe add any useful
info.  Earlier this week one of our fileservers began crashing
periodically and today a 2nd one began doing so.  Both are Linux
2.2.16-3 with afs3.6 2.5.  One is smp, one not.  Our other servers,
all AIX have not (yet) experienced this.

Symptoms seen are that the fileserver processes die or hang in such
a fashion that they are still existent as seen from ps, but are
no longer doing anything useful, and bos no longer believes they're
running, as evidenced by it's attempts to start a new fileserver.
After running salvager, bos tries to start a new fileserver, which
promptly dies with a message about not being able to initialize RX.
This is of course because the dead fileservers already have the RX
port open.  At this point bos sees the new fileserver die and runs
salvager again, etc, etc, infinitum loopum.  If an admin steps in
and kills the dead/hung fileserver processes by hand, fileserver
will start and run ok until the next crash/hang.  This last step
has become so frequent that we've scripted it.  (ugh!)

In tracking the cause of the crashes, we increased fileserver's
logging level and found the same couple messages in FileLog immediately
prior to each crash:

Fri Mar 30 15:08:05 2001 FindClient: authenticating connection: authClass=0
Fri Mar 30 15:08:05 2001 CB: WhoAreYou failed for af2801xx.22811, error 1

(Address obscured to protect the still-presumed innocent.  The hex
address translates into the IP address of the connecting client if
converted to decimal and the octet order is reversed.)

In each case the logged address was the exact same client.

Conclusion: It is evidently possible for a client to (mis-)craft a fs
request in such a way as to knock over Linux AFS fileservers, at least
those running 3.6 2.5.  Potential DOS.

We contacted an admin at the client site (several states away and no
known affiliation to us).  His best guess was that someone had installed
an AFS client on an NT box with a fully populated CellServDB and clicked
some button that started it walking the AFS filespace.  His netflow logs
confirmed that this machine was talking to various AFS cells all over
the place.  His best guess at client versions is NT SP4 and AFS 3.5
patchlevel 0.  He firewalled the box for a day but it is now on the
loose again.  At this point we have local blocks (ipfw) on our Linux
fileservers preventing them from receiving packets from this address,
but we're still sitting ducks waiting for the next person to come along
and do this.

We've been trying to light a fire under Transarc to sort this out and
fix it, but so far the lack of a core file from the fileserver that
isn't quite dead enough to core has prevented any significant progress
on that front.  We've not been particularly eager to insist that the
offending client be put down, since it is so far the only known way of
re-producing the problem.

By chance has anyone else been seeing any of this at their site and if
so have you made any more or less progress in solving it?  Is the
fallout any more/less severe with a 3.5 fileserver or an OpenAFS
fileserver?  ipfw logs show the client is still probing at this time.