Problem with Sun machines hanging

Marcus Watts mdw@umich.edu
Wed, 07 Mar 2001 13:41:55 -0500


"Colleen Hayase" <hayase@us.ibm.com> writes:
> Importance: Normal
> Subject: Problem with Sun machines hanging
> To: info-afs@transarc.com
> X-Mailer: Lotus Notes Release 5.0.4  June 8, 2000
> Message-ID: <OF286D76CC.56D82D81-ON85256A08.00628C4D@pok.ibm.com>
> From: "Colleen Hayase" <hayase@us.ibm.com>
> Date: Wed, 7 Mar 2001 13:11:51 -0500
> X-MIMETrack: Serialize by Router on D01ML223/01/M/IBM(Release 5.0.6a |January 17, 2001) at
>  03/07/2001 01:06:14 PM
> MIME-Version: 1.0
> Content-type: text/plain; charset=us-ascii
> Sender: owner-info-afs@transarc.com
> Precedence: bulk
> 
> This is probably a Solaris problem but I thought I'd throw it out to the
> AFS community in case it  rings a bell...
> 
> We have several Sun machines running either Solaris 2.5.1 or 2.7 that hang
> occasionally when you run "top" or "ps -ef" even though the load average is
> low (< 0.5) and lots of memory is available. After running "ps -ef" the
> process that's listed after the hang is the offending process. Killing the
> process often frees up resources and the system is back to normal. Also,
> once you find the "bad" process running "ls /proc/<PID>" will also hang.
> 
> I've also posted to the comp.unix.solaris newsgroup and a frequent comment
> was that there's a problem with an NFS mounted server. Our machines don't
> have any NFS mounted servers. The only thing that comes close is our
> AFS-mounted cell.
> 
> I don't have detailed statistics but most, if not all, the problem
> processes are applications running/writing from/to directories in AFS. If
> anyone has ideas I'd surely like to try them out. Thanks.
> 
> Colleen Hayase, EDA LAN Support
> IBM East Fishkill, Bldg 334 2L14-423
> hayase@us.ibm.com
> (t/l) 533-8976, 845-894-8976
> 
> 

With crash(1m) and a lot of patience, you should be able to localize
where things are hanging.  You might also try kdb(1m).

If it's an afs problem, cmdebug might tell you something useful
about what AFS thinks is happening.  rxdebug may also be useful.
"fs checks" is something else worth trying - if it doesn't come
back quickly, the local machine is having some sort of difficulty
talking to a remote server and that's certainly worth investigating.

Network health - are you dropping packets?  Do you have anything "weird"
like a fddi<->fast ethernet bridge?  (This has an MTU problem which can
result in bad behavior.)

You say "ls" hangs.  You probably have an alias "ls -s" or "ls -F" that
causes ls to do stats on the objects in proc.  Does plain "/bin/ls" hang
as well?  What does it hang on?  If plain /bin/ls doesn't hang, and
ls -F hangs on /proc/XXX/cwd, then it's likely an afs problem with
that directory or volume.

You should know something about the directory & volume it's hanging
on.  Is there something unusual about it, like an oversized directory?
Strange volume locks or other problems from administrative stuff that
perhaps got aborted uncleanly in the middle?  What version of the
fileserver, cache manager, vlserver, & all are you running?  Is it the
current version?  Does running different versions of these change the
behavior?  Does running the salvager on the volume change anything?
How about moving the volume to a different server (and thus doing
effectively a vos dump / vos restore and perhaps fixing weird problems
with the volume)?

				-Marcus Watts
				UM ITCS Umich Systems Group