Sun's 'logging' mount option -- our findings -- RUN AWAY

Harald Barth haba@pdc.kth.se
Wed, 21 Feb 2001 11:30:46 +0100


Hi Jeff,

your story is quite similar to what we experienced in the
stacken.kth.se cell. It cost some nights, and I got quite upset.
Fortunately we only had one file server enabled and it was not as
severe. I "only" had badly corrupted volumes, but no salvager
segfaults.

Lessons learned: 

1) Do NOT turn on logging on you vicep* partitions.

2) The salvager does a better job when called on a whole partition
than on a single volume. On the other hand this shuts down the
fileserver.

3) The salvager is not able to clean out broken backup volumes all the
time. The only way to "fix" these is to "vos zap -force" them.

4) vos backup is not able to overwrite such a corrupted backup volume.
You have to be very observant to check that you really get backup
volumes for all volumes when you do vos backupsys.

5) "vos backup volumename ; vos dump volumename.backup" seems to be a
reasonable check that your volumes are better again after you tried to
cure them with the salvager.

Questions still unanswered (same numbering as above):

What does the fileserver assume about the file system that breaks when
you turn on logging. Another way to see that is that if you write a
fileserver you should either write your own filesystem on the raw
device or use the file system as the vendor expects you to. Is the
fiddeling with inodes approach too complicated (there is no working
salvager) and should therefore the files stored in another way? Is the
files approach of the Linux port more robust? I don't understand why
the annoying overwrite of log files hasn't been fixed a long time ago.
We can afford a big /usr/afs/logs. Write all log files with date
(SalvageLog.20010221.111059.log or something). Never overwrite
logfiles.

> IMPORTANT: If your server is running with 'logging' on and you would
>            like to turn it off, DO NOT turn it off in /etc/vfstab and
>            then reboot the machine.  We did this with one of our two
>            problem servers and salvager went INSANE deleting files
>            allll over the place, which is what forced us to do a

I stopped the fileserver, then i tried to unmount the /vicep* one by
one and the ones I actually succeeded to unmount were doing better
later. Of course solaris did not give me any good reason why I was not
allowed to umount all of them. I guess that the unmount committed the
logging entries to the file system at a place where the fileserver
expected them to be. I really do not want to know _that_much_ about
the solaris file system.

Harald.