The usual IT babble
Posts tagged Linux-HA
OCF agent for Tivoli Storage Manager: redux
Jun 5th
Well, after I finished my first OCF agent back in October 2008, we have it running in production now for about ten months. During that time, we found quite a few points in which we’d like to improve the behaviour with that Linux-HA should handle TSM.
- Shutdown TSM nicely if possible (Cancel client sessions, cancel running processes and dismount mounted volumes)
- Better error handling
So, after another week of writing and testing with a small instance, I present the new OCF agent for Tivoli Storage Manager. It still has one or two weak points, but they are negligible. I still need to write the documentation for it, but the script should just work …
TSM: Restoring the database/recovery log to a point-in-time
Apr 24th
Well, my co-worker just called on my cell (it’s Friday, 16:00), and asked me which start-up script he needed to change in order to restore the database. My first response was, “ummm, that’s gonna be hard, we’re using heartbeat”.
Okay, so after a bit of asking I got out of him what he wanted to achieve by changing the start-up script. Apparently he did something to crash Tivoli Storage Manager (or rather repeatedly crash it) and wanted to restore the database. He talked to one of the systems partner we do have (and I’m happy we have them most of the time), who in return told him how to do it, but forgot a minute after he hung up the phone.
So, I went digging while he still was telling me how he got Tivoli to kick his own ass … After a bit, I thought “hrrrrrm, shouldn’t this be covered in the Tivoli documentation ?”, and surprisingly it’s actually covered in the documentation.
It’s actually rather simple.
- Stop the dsmserv Linux-HA cluster service (tsm-control ha stop tsm1)
- Setup the environment (since we’re running multiple instances of Tivoli Storage Manager – export DSMSERV_DIR, export DSMSERV_CONFIG)
- Enter the path of the server
- Run dsmserv restore db
- Wait some time (took about half an hour to restore the 95G database and the 10G recovery log)
- Start the dsmserv Linux-HA cluster service (tsm-control ha start tsm1)
- Update the server-to-server communication, since the restore db changes the communication verification token
> tsm-control ha stop tsm1
- tsm1 (dsmserv) -> ha: [ OK ]
> export DSMSERV_DIR=/opt/tivoli/tsm/server/bin
> export DSMSERV_CONFIG=/opt/tivoli/tsm/server/tsm1/dsmserv.opt
> cd /opt/tivoli/tsm/server/tsm1
> /opt/tivoli/tsm/server/bin/dsmserv restore db todate=TODAY totime=08:00:00 source=dbbackup preview=no
.... wait some time ....
> tsm-control ha start tsm1
- tsm1 (dsmserv) -> ha: [ OK ]
Nagios: Watching Clustered environments (the other way)
Mar 19th
Well, recently I stepped up to watch our cluster environments … Michael has a good howto on how to watch Windows Cluster environments in the NSclient++ wiki.
Now, this has it’s own perks … Which I stumbled upon when trying to write a Linux-HA OCF resource agent for the Nagios NRPE server. Combining that Linux-HA with SLES10 is a good thing generally, but using startproc in that resource agent is not such a good idea.
Apparently Novell (or SuSE GmbH) thought it might be wise to include some additional logic into the wrapper. startproc, checkproc and killproc do check for the name of the executable. So if you try to start an additional process with the same name, you need to dig a bit deeper.
For this to work, you need two additional things (quotations directly from man 8 startproc):
-p pid_file
(Former option -f changed due to the LSB specification.) Use an alternate pid file instead of the default (/var/run/<basename>.pid). The pid read from this file is being matched against the pid of running processes that have an executable with specified path of the program. In order to avoid confusion with stale pid files, a not up-to-date pid will be ignored.
Now, then apparently this isn’t enough. startproc is still refusing to start a second process.
-i ignore_file
The pid found in this file is used as session id of the same binary program which should be ignored by startproc.
Linux-HA: Creating a random authkey
Mar 18th
I just looked over the slides of a presentation one of my trainees bought back from Chemnitz, and there was this nifty one-line command, with which you can generate a random sha1sum for your authkeys file.
Now, since I’m a bit lazy here’s the full command line to fill /etc/ha.d/authkeys for you:
node2 ~ [0] > echo "auth 1 1 sha1 $( dd if=/dev/urandom count=4 2> /dev/null | openssl dgst -sha1 )"
Linux-HA and Tivoli Storage Manager (Finito!)
Oct 5th
As I previously said, I was writing my own OCF resource agent for IBM’s Tivoli Storage Manager Server. And I just finished it yesterday evening (it took me about two hours to write this post).
Only took me about four work days (that is roughly four hours each, which weren’t recorded in that subversion repository) plus most of this week at home (which is 10 hours a day) and about one hundred subversion revisions. The good part about it is, that it actually just works
(I was amazed on how good actually). Now you’re gonna say, “but Christian, why didn’t you use the included Init-Script and just fix it up, so it is actually compilant to the LSB Standard ?”
The answer is rather simple: Yeah I could have done that, but you also know that wouldn’t have been fun. Life is all about learning, and learn something I did (even if I hit the head against the wall from time to time
during those few days) … There’s still one or two things I might want to add/change in the future (that is maybe next week), like
- adding support for monitor depth by querying the dsmserv instance via dsmadmc (if you read through the resource agent, I already use it for the shutdown/pre-shutdown stuff)
- I still have to properly test it (like Alan Robertson mentioned in his one hour thirty talk on Linux-HA 2.0 and on his slides, Page 100-102) in a pre-production environment
- I’m probably configure the IBM RSA to act as a stonith device (shoot the other node in the head) – just for the case one of them ever gets stuck in a case, where the box is still up, but doesn’t react to any requests anymore
