fileserver locking up

Stephen Joyce stephen@physics.unc.edu
Thu, 1 Feb 2001 09:20:51 -0500 (EST)


Brent,

Back when we were running AFS 3.5 3.17, we had almost exactly the same
problem.  Vos move commands would fail with "possible communications
failure", especially if several were occurring at once.  We also had
udpInOverflows and udpInCksumErrs as symptoms (revealed by netstat).

I opened a ticket with transarc, and the solution for us add -nojumbo
-udpsize 262144 to our volserver process (configured via bos) and restart.

They also directed me to www.transarc.com/Support/afs/news/fstuning.html
for more info (it explains what those options do so you can tweak more
needed).

FWIW, we've been running AFS 3.6 2.3 without adding those options to the
volserver and without seeing the problems we had before; perhaps it was
fixed in 3.6?

Cheers,
Stephen
--
Stephen Joyce
Systems Administrator                                            P A N I C
Physics & Astronomy Department                         Physics & Astronomy
University of North Carolina at Chapel Hill         Network Infrastructure
voice: (919) 962-7214                                        and Computing
fax: (919) 962-0480                               http://www.panic.unc.edu

On Wed, 31 Jan 2001, Brent Johnson wrote:

> Hello,
> 
> I'm running a cell with fileservers and db servers running solaris 2.6
> and afs v. 3.5 345.  Every since we upgraded to 3.5 (about nine months
> ago) we've been getting erratic behavior from our fileservers: vos
> commands timeout regularly (every weekday, next to never on weekends or
> after 7pm) with "possible communication failure", at least one (9 total)
> of the fileservers dumps core. Here recently the fileserver and
> volserver processes just hung--wouldn't take process any commands,
> rxdebug failed, snoop showed no outbound traffic from volserver or
> fileserver, disk usage was nil, and netstat -s output showed
> udpInOverflows and udpNoPorts increasing.
> 
> This every happen to anybody else?
> 
> -Brent
> 
>