HOWTO Watchdog Timer
From Gentoo Linux Wiki
| Installation • Kernel & Hardware • Networks • Portage • Software • System • X Server • Gaming • Non-x86 • Emulators • Misc |
Contents |
[edit] Introduction
A watchdog is a tool that is supposed to automatically reboot a computer when something goes wrong. For example, the kernel goes crazy, some program starts using 100% cpu cycles, or even other problems. The idea is to have a device /dev/watchdog and your computer must write to this device once in a minute. If you computer fails to do so it will be rebooted.
The main difference between hardware and software watchdogs is who created /dev/watchdog. If it's created by a piece of hardware it runs independently of your kernel and other software on your computer. If you just setup a software watchdog the device is created by your kernel and if the kernel locks up it cannot reboot your computer anymore.
The program that writes to the device is called watchdog (sys-apps/watchdog) and may also monitor other parts of your computer like if some chosen processes are running or your network interface is still receiving data.
[edit] Setup a software or hardware watchdog device
[edit] Tools/Skills needed
- Can compile a kernel
- For hardware watchdogs: Have a hardware watchdog card
- For hardware watchdogs: Know hardware specs for the watchdog and/or motherboard
[edit] Configure the kernel (2.6 series)
First you need to find out what kind of watchdog card you have. See your kernel's configuration menu for supported cards. If you don't have a hardware watchdog card select Software watchdog.
| Linux Kernel Configuration: Watchdog |
Device Drivers ->
Character Devices ->
Watchdog Cards ->
[*] Watchdog Timer Support
[ ] Disable watchdog shutdown on close
--- Watchdog Device Drivers
<*> Your Watchdog card or chip
|
For the start don't activate Disable watchdog shutdown on close as this means that you cannot shutdown your software watchdog without having your computer rebooted by the kernel.
If you finished the configuration (for hardware watchdogs see also below) compile your kernel and reboot.
[edit] Additional kernel configuration for hardware watchdogs
If you have a watchdog add-on card, you may need to have pci or isa support activated in your kernel. If your watchdog is part of your motherboard, you need to enable support for its control - for example i2c support, SMBus, or maybe just support for a chipset. You may need to turn your SMBus controller in your BIOS.
This HOWTO assumes that you already have this set up, or that you know how. You can find out what you need to do by starting at the websites of your motherboard and/or watchdog manufacturers. You may also need to emerge companion programs, for example, the program i2c which can be used to access on-board chips. Generally, companion programs are not needed to get watchdog functionality.
[edit] Check for the watchdog device entry
If your kernel configuration is alright and the watchdog card has been recognized by kernel (or simulated using a software watchdog) you should have a watchdog device node (/dev/watchdog):
# ls -l /dev/wa* crw-rw---- 1 root root 10, 130 /dev/watchdog
Normally you should get this device automatically but if you're using older device filesystems (devfs, ...) you may need to create the watchdog device manually:
# mknod -m 660 /dev/watchdog c 10 130
[edit] Install and configure control software
[edit] emerge sys-apps/watchdog
This program is used to write to the watchdog device to tell it that everything is OK. There maybe other tools that do the same but in this Howto we use sys-apps/watchdog. It is essential for software and hardware watchdogs!
# emerge watchdog
[edit] edit /etc/watchdog.conf
The default watchdog configuration doesn't monitor your system. If you just want to have you computer rebooted if your system locks up and watchdog can't write to /dev/watchdog anymore this is perfectly alright and you don't need to change this configuration file.
If you want to monitor some other system functionallity you can do it with watchdog but be aware that there are many bugs that may cause false alarms/reboots. You may want to setup a repair binary for some of these options because otherwise watchdog solves any issue by rebooting the system (which is not always the best solution). Also use --no-action and -v as startup options to test everything (see below).
watchdog-device allows you to specify the devicename of your watchdog. This should always be /dev/watchdog. If you don't specify the device your software or hardware watchdog will not be activated and cannot reboot your computer. Default is NULL so you need to activate this.
watchdog-device = /dev/watchdog
pidfile monitors program pidfiles. watchdog is checking if the corresponding process is still running. For example
pidfile = /var/run/metalog.pid pidfile = /var/run/apache2.pid pidfile = /var/run/authdaemon.pid pidfile = /var/run/imapd.pid pidfile = /var/run/sshd.pid pidfile = /var/run/svscan.pid
interface monitors if there was traffic between two watchdog intervals. If not the watchdog software assumes that the network is unreachable and calls a repair binary (or just reboots the computer). ATTENTION: This function is broken and cannot handle interfaces which had more than 2.1 GB of traffic (see Known Bugs and Patches section below).
interface = eth0
min-memory checks if enough memory is available. It's not measured in bytes but in pages.
min-memory = 1
max-load can monitor your current load and if it's too high reboot your computer.
max-load-1 = 24 max-load-5 = 18 max-load-15 = 12
ping pings a host and assumes network unreachable if the host doesn't reply. ATTENTION: The implementation of ping is broken and you'll have a lot of false alarms (see Known bugs and Patches section below). Currently you better don't use ping at all!
ping = 172.26.1.255
file monitors a file for changes. If for example a process is supposed to write to a logfile and stops doing so you can let watchdog repair this or reboot the system. Use change to specify how often the file has to be changed. The value is counted in watchdog intervals (which are normally at 10 seconds).
file = /var/log/everything/syslog change = 20
[edit] edit startup options in /etc/conf.d/watchdog
Now you can test you watchdog configuration. If you use watchdog to monitor system functionallity (traffic on interfaces, processes, ...) you should start with:
WATCHDOG_OPTS="-v --no-action"
Using -v activates verbose output to your syslog. As you will notice this is not very useful for longterm use as you get a lot of syslog messages (watchdog writes its status every 10 seconds by default). In the beginning you should also use --no-action as some of the watchdog monitor functions are broken and trigger false alarms. You don't want your computer to be rebooted without a reason (eg. you had more than 2.1 GB traffic on eth0 ...).
For longterm testing add -f and choose a higher value for interval in watchdog.conf (see above). This allows you to extend the interval to e.g. 300 seconds which means a lot less traffic in your syslog. But NEVER use -f without --no-action because otherwise your watchdog will reboot your computer after 60 seconds. (Btw: Don't expect logtick to function. It's broken.)
If you're using metalog you can monitor your logfiles using
# tail -f /var/log/everything/current | grep watchdog
If you see that the basic system monitoring works as you expect you can/should remove all startup options again. I recommend to also remove -v because the logtick option (see above) is broken and you don't want to see the watchdog status in you logfiles every 10 seconds. As soon as logtick is fixed (eg. you apply the unofficial patch) -v will be alright.
WATCHDOG_OPTS=""
Please note that as long as you use --no-action you don't really test your watchdog device. With --no-action watchdog doesn't open /dev/watchdog and doesn't write to it aell.
[edit] Setup a repair binary
There are some repair binaries shipped with watchdog, mainly shell scripts. You can modify them to handle some issues without rebooting the system. You need to specify the repair binary in your /etc/conf.d/watchdog as binary.
You can skip this part for now if you want and set it up later.
[edit] Start Watchdog
[edit] Run watchdog using startup script
The recommended way to manually start watchdog is
# /etc/init.d/watchdog start
Make sure you configured it correctly and don't add it to your boot runlevel before you know that everything is alright!
[edit] Add it to your boot runlevel
If you tested your installation for a while and if you're sure that no false alarms will trigger a reboot you may add it to your startup scripts.
# rc-update add watchdog boot
[edit] Known Bugs and Patches
- Monitoring a network interfaces fails after 2147483647 bytes received (See Bug 123404, resolved upstream?)
- Logtick option is broken (See Bug 120037)
- Bugreports and Patches
[edit] Comments, my experience with this procedure
On my machine, a dual Opteron, I was not able to get the program to work correctly when added to the boot runlevel. I ended up running it in /etc/conf.d/local.start. Not the best, but it works for me now. I note that the program didn't seem to log at the correct intervals at first. Now, however, it does. I don't know why. Maybe it is just not happy in the amd64 arch. I note that if I set it to ping my default gateway, it sometimes reboots, for no reason. I don't know why it couldn't get a ping thru. I have since guessed that my provider's gateway may not have responded to a ping, although I can't imagine why.
I also have the same problem with my dual opteron.
Concerns or Compliments? Please use the Discussion section.
