Backup solutions

Posted on the March 7th, 2008 under Life by Christian

Well some people apparently completely *don’t* understand the use of a backup client like dsmc, additionally they don’t seem to have the slightest clue on how to draw up a “clever” backup solution.

Lemme describe the situation for you. We do have two Solaris systems at work, housing our mailing system(s). Now apparently, people are unable to install the Tivoli Storage Manager Client on Solaris (or get it working properly - which people are blaming on the software not working).

Now, they draw up this insane plan … We do have about 900GiB of mail space, which is currently located on our SAN. So people decide, they don’t want the backup client on their system, as it’s slow (which I do agree to, dsmc is *slow* for large amounts of data - especially if it’s 900GiB in 15MiB parts).

So they think of something like this:

  • Attach a second disk to the mail system
  • The mail server then creates a tar file (at which iteration I can’t say, but from the size of the volume, I’d figure once a day) on the secondary disk
  • The mail server exports said disk via NFS
  • Another, semi-independent system then imports said disk via NFS, while also housing the Tivoli Storage Manager client, to backup that big tar-file …

So much for *well* planned backup solutions ……… :lol:

Tags: ,

OCFS2 follow-up

Posted on the March 7th, 2008 under Life by Christian

OK, it turned out that said collegue wasn’t responsible at all. Turns out, the *real* trigger was me creating a new volume on our SAN, on the same array that houses the OCFS2 volume.

Apparently, during creation of an additional SAN volume, all other SAN volumes in this array are either read-only or delayed during that time, as you can see from the following log:

kernel: (13,3):o2hb_write_timeout:242 ERROR: Heartbeat write timeout to device sdd1 after 12000 milliseconds
kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 4):
kernel: Heartbeat thread stuck at waiting for read completion, stuffing current time into that blocker (index 4)
kernel: Index 5: took 0 ms to do submit_bio for read
kernel: Index 6: took 0 ms to do waiting for read completion
kernel: Index 7: took 0 ms to do bio alloc write
kernel: Index 8: took 0 ms to do bio add page write
kernel: Index 9: took 0 ms to do submit_bio for write
kernel: Index 10: took 0 ms to do checking slots
kernel: Index 11: took 0 ms to do waiting for write completion
kernel: Index 12: took 2002 ms to do msleep
kernel: Index 13: took 0 ms to do allocating bios for read
kernel: Index 14: took 0 ms to do bio alloc read
kernel: Index 15: took 0 ms to do bio add page read
kernel: Index 16: took 0 ms to do submit_bio for read
kernel: Index 17: took 0 ms to do waiting for read completion
kernel: Index 18: took 0 ms to do bio alloc write
kernel: Index 19: took 0 ms to do bio add page write
kernel: Index 20: took 0 ms to do submit_bio for write
kernel: Index 21: took 0 ms to do checking slots
kernel: Index 22: took 0 ms to do waiting for write completion
kernel: Index 23: took 2004 ms to do msleep
kernel: Index 0: took 0 ms to do allocating bios for read
kernel: Index 1: took 0 ms to do bio alloc read
kernel: Index 2: took 0 ms to do bio add page read
kernel: Index 3: took 0 ms to do submit_bio for read
kernel: Index 4: took 9995 ms to do waiting for read completion
kernel: (13,3):o2hb_stop_all_regions:1682 ERROR: stopping heartbeat on all active regions.
kernel: Kernel panic - not syncing: *** ocfs2 is very sorry to be fencing this system by panicing ***
Tags: , , ,

OCFS2 fun

Posted on the March 6th, 2008 under Life by Christian

Turns out, that said colleague has been playing with NFS on one off the web nodes, thus apparently rendering the remaining nodes offline (or semi-offline).

Now after all web nodes hung themselves, we had to hard reset them, now everything is tingly again .. *yay* for a great first day …

Tags: , , ,