[info-afs] VLDB problem - Duplicate entries

William Setzer William_Setzer@ncsu.edu
Fri, 12 Sep 2008 14:30:02 -0400


We've been investigating why our "vos backupsys" processes have been
hanging, and have discovered something disturbing.  Upon dumping out
our VLDB via "vos listvldb > foo" it appears our VLDB has been
corrupted.  We're seeing two entries for a significant percentage
(1/4) of our volumes:

    adm.db 
        RWrite: 536899559     Backup: 536899561 
        number of sites -> 1
           server A.ncsu.edu partition /vicepa RW Site 

    adm.db 
        RWrite: 536899559     Backup: 536899561 
        number of sites -> 1
           server A.ncsu.edu partition /vicepa RW Site 


Right now, it's never more than two instances per volume, and
sometimes they point to the same server, sometimes they point to
different servers.

Our first thought is to do a "vos syncvldb"/"vos syncserv", but we
don't know if this will fix the problem, particularly in the case of
duplicate entries pointing to the same place.  Our second thought is
to do it after zeroing out the VLDB, but the downtime we'd suffer
isn't very appealing. :)   Our third thought is that we might have a
more serious corruption, since we had a problem with our VLDB several
months ago (which we thought we had fixed).

Right now, everything appears to be working "normally", excepting the
"vos backupsys" being very cranky about a large number of non-existent
volumes, but clearly something needs to be done and we're pretty much
out of our depth.

Our current OpenAFS version is 1.2.13, but our upgrade path to 1.4.7
was in progress when interrupted by this problem.  (We were starting
with file servers, so the databases are still at 1.2.13.)

So what do you think would be the safest and/or best course of action
to take?  Thanks in advance for your advice.


William Setzer
Systems & Hosted Services
Office of Information Technology
NC State University