System monitoring is both an art and a pain. It is nice to have pretty graphs that one can show what's going on with a server or a service as well as having something that does proper notification of current or potential issues, but on the other side there is also a lot of pain and (boring) work involved in getting this up and running in a proper way.
I'm quite a fan of doing proper and detailed monitoring of systems - and after the latest issues with tribble I took a stab at improving the monitoring of that box but - well tribble is running FreeBSD and doing hardware related monitoring (vs. checking for things in the OS) is often more difficult there for various reasons.
The first thing I wanted to get monitored is the hardware itself - modern servers usually carry some sort of BMC (Baseboard Management controller) or some even more sophisticated solutions(RSAII, iLO - just to name a few) that are basically small independent computers on the mainboard.
Accessing the data those BMCs can provide is often done through complex and binary only drivers available only for Microsoft Windows and a limited number of commercially supported linux distributions(and some of them are even bloated java based GUI things) - however in the last few years a standard based solution to that kind of task has appeared - Intelligent Platform Management Interface (IPMI).
IPMI provides a standardized interface to manage and monitor servers even in the absence(!) of an operating system - it is a cool idea though in practice it bears a lot of similarity to ACPI in the sense that every vendor is implementing it a bit different and especially early implementations are buggy like hell.
Luckily for us tribble is running 1 FreeBSD 6.2 with is the first FreeBSD release to support ipmi(4) despite the fact that the man page claims it got added in 7.0 ...
For integration into the postgresql.org monitoring infrastructure I hacked up a small nagios check script which is simple calling ipmitool and looking for interesting output.
Sample output for tribble of that script looks like:
[stefan@tribble ~]$ sudo /usr/local/libexec/nagios/check_ipmi
OK - IPMI: (Ambient_Temp = 23 degrees C, CPU_1_Temp = 34 degrees C, CPU_2_Temp = 34 degrees
C, DASD_Temp = 31 degrees C, Fan_10_Presence = 0x02, Fan_10_Tach = 1830 RPM, Fan_11_Presence
= 0x02, Fan_11_Tach = 1800 RPM, Fan_12_Presence = 0x02, Fan_12_Tach = 1740 RPM,
Fan_1_Presence = 0x02, Fan_1_Tach = 1710 RPM, Fan_2_Presence = 0x02, Fan_2_Tach = 1650 RPM,
Fan_3_Presence = 0x02, Fan_3_Tach = 1830 RPM, Fan_4_Presence = 0x02, Fan_4_Tach = 1830 RPM,
Fan_5_Presence = 0x02, Fan_5_Tach = 1890 RPM, Fan_6_Presence = 0x02, Fan_6_Tach = 1680 RPM,
Fan_7_Presence = 0x02, Fan_7_Tach = 1680 RPM, Fan_8_Presence = 0x02, Fan_8_Tach = 1680 RPM,
Fan_9_Presence = 0x02, Fan_9_Tach = 1800 RPM, PS_1_Fan_Fault = 0x01, PS_1_Status = 0x01,
PS_2_Fan_Fault = 0x01, PS_2_Status = 0x01)
which is a bit verbose but I will work on that later ;-)
The script will also check the System Event Log (SEL) - which is basically a small NVRAM backed memory on the BMC holding all kinds of hardware monitoring events - for entries (in this case there are none) and will return a warning if it finds something.
Ok now that we had the basic hardware covered only one major thing is left - the monitoring of the integrated IBM ServeRAID 7k adapter which has two arrays (a 2 disk RAID 1 for the OS and related data and a 4 disk RAID 10 for the VMs).
Monitoring hardware RAID is a delicate thing on most BSDs (though OpenBSD made some promising progress on that front lately) - the lack of vendor support often results in only rudimentary drivers at best and useful tools to check the array status or even initiate rebuilds are often simply not available.
A bit of research turned the following post on the freebsd-scsi mailing list up.
Once compiled this tool indeed gives basic information about the status of ips(4) based raid controllers on FreeBSD - wrapping it once again into a nagios compatible check script results in:
[stefan@tribble ~]$ sudo /usr/local/libexec/nagios/check_raid
OK: /dev/ips0 - Volume: 0, ArrayState: OK; Volume: 1, ArrayState: OK;
so a the end of the day we have nice hardware monitoring for at least one of the projects servers - but there is still a lot to do in the future ...
It can at times become wearisome to divide the reliable extreme information from the dreadful.
Tracked: Nov 15, 11:02