Since I got the pleasure of watching some Windows boxen with Nagios, I took the Windows Update plugin from Michal Jankowski and implemented it. It took me some time, to initially set up the nsclient++ correctly so it just works, but up till now the check plugin sometimes reported the usual “Service Check Timed Out”.
Usually I ended up increasing the cscript timeout, or the nsclient++ socket timeout, but it still kept showing up. Since I rely heavily on my surveillance tools, I have the demand, that as few as possible false positives show up. So I ended up chasing down this error today, and after that I have to say it was quite simple.
In my case, it wasn’t cscript (that timeout is set to 300 seconds), neither nsclient++ (socket timeout is set to 300 seconds too), nor the nrpe plugin itself (that has 300 seconds as well).
As it turns out, Nagios got an additional setting controlling these things, called service_check_timeout which defaults to 60 seconds. Sadly the plugin, or rather Windows needs longer than those 60 seconds to figure out whether or not it needs updating, thus Nagios is killing the plugin and returning a CRITICAL message.
After increasing the value of service_check_timeout that’ll be fixed hopefully.
Well, recently I stepped up to watch our cluster environments … Michael has a good howto on how to watch Windows Cluster environments in the NSclient++ wiki.
Now, this has it’s own perks … Which I stumbled upon when trying to write a Linux-HA OCF resource agent for the Nagios NRPE server. Combining that Linux-HA with SLES10 is a good thing generally, but using startproc in that resource agent is not such a good idea.
Apparently Novell (or SuSE GmbH) thought it might be wise to include some additional logic into the wrapper. startproc, checkproc and killproc do check for the name of the executable. So if you try to start an additional process with the same name, you need to dig a bit deeper.
For this to work, you need two additional things (quotations directly from man 8 startproc):
(Former option -f changed due to the LSB specification.) Use an alternate pid file instead of the default (/var/run/<basename>.pid). The pid read from this file is being matched against the pid of running processes that have an executable with specified path of the program. In order to avoid confusion with stale pid files, a not up-to-date pid will be ignored.
Now, then apparently this isn’t enough. startproc is still refusing to start a second process.
The pid found in this file is used as session id of the same binary program which should be ignored by startproc.
Well, most of you already know that I’m a Nagios fanatic. I like to watch as many aspects as I possibly can. So, yesterday I started figuring out ways to watch our different cluster groups (housing a bunch — try above 20.000 — of file shares).
Now, my first tries failed horribly. I brought down a complete cluster group, resulting in a major annoyance. Now, today I went at it a bit smarter :P I cloned myself two VM’s off my Windows Server 2003 Enterprise R2 template, created a new cluster.
After that, I tried it on the test cluster again, same result. The resource is successfully created, but once I try to take it online, it breaks and moves the whole cluster group to the other node (as cyclic moving between the cluster nodes with no end).
After that, I figured something has to be wrong with the command I’m trying to use, the one as instructed by the NSClient++ wiki. I then tried the command on the command line, but as soon as hitting <TAB> (oooold bash habit :P ), it completed the path, but put quotes around it … Don’t ask me.
If I try the path without the quotes, no-joy at all. Once you put quotes around it, everything becomes honky-dory and the resource comes online without the slightest trouble!
Hint to self: When creating a NSClient++ cluster resource (or any application resource using a command that needs switches for that matter), use a quoted command line along the lines of this:
For people, who are as click and point-lazy as me, here is how you restart the service without using the service management applet.
net stop "NSClientpp (Nagios) 0.3.5.2 2008-09-24 w32"
net start "NSClientpp (Nagios) 0.3.5.2 2008-09-24 w32"
Since we started utilizing Nagios‘s power two months ago, I finally came up with a C-based ram-plugin for nagios. The biggest problem I had with the python and perl based plugins, that some distributions (yes, SLES and Debian) don’t install either Python or Perl.
Since I wanted a manageable setup (as in unified code base across all distributions), I wanted it to work without installing too much. So I took the swap plugin and basically removed what wasn’t necessary and voila!
Here we go, yay ME!
Only thing I need to finish sometime soon, is getting the NSClient++ work on my Windows boxen (which I do have quite a few, the domain controllers, nas-cluster, …)