IPMI hardware watchdog with RHEL 7 / CentOS 7
Table of Contents
introduction
My new server has IPMI which includes a watchdog timer.
Since I regularly get told that hardware watchdogs are no fun to set up, I’ll record my setup steps for an IPMI watchdog here. This document is not about watchdogs in general, but rather specifically on how to use an IPMI watchdog on a recent Red Hat based distribution.
setup
first, determine your watchdog type
This command will only work if the watchdog service is not running.
[root@epyc ~]# wd_identify
IPMI
If your watchdog is of any other kind, then this post is not what you are looking for.
be careful stopping the service
Depending on your settings (see below) a systemctl stop watchdog.service
is either perfectly safe to run or will reboot your box. If you do stop the service, be sure to
check on IPMI watchdog shortly after stopping the service.
If on stopping it does not change to Watchdog Timer Actions: No action you will want to
issue a systemctl start watchdog.service
before the Present Countdown: reaches 0!
[root@epyc ~]# ipmitool mc watchdog get ; systemctl stop watchdog.service ; echo ; sleep 10 ; ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 59 sec
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 45 sec
second, ensure you load the kernel module at boot
[root@epyc ~]# cat /etc/modules-load.d/ipmi_watchdog.conf
# This loads the necessary kernel module to use an IPMI hardware watchdog
ipmi_watchdog
third, reboot and check the module got loaded automatically
The kernel module should be loaded after boot and you should see a /dev/watchdog
.
[root@epyc ~]# modprobe --first-time ipmi_watchdog
modprobe: ERROR: could not insert 'ipmi_watchdog': Module already in kernel
[root@epyc ~]# ls -l /dev/watchdog
crw-------. 1 root root 10, 130 Aug 22 21:04 /dev/watchdog
fourth, ensure you installed, configured and activated the daemon
This I do with the Ansible tasks close to those below, but you can just manually:
- install with
yum install watchdog
- adjust your
/etc/watchdog.conf
to set the watchdog-device - enable the service with
systemctl enable watchdog.service
- start the service with
systemctl start watchdog.service
- name: "WATCHDOG | ensure packages for using HW watchdog are installed"
yum:
name:
- watchdog
state: present
- name: "WATCHDOG | ensure config file uses /dev/watchdog"
lineinfile:
group: root
line: "watchdog-device = /dev/watchdog"
mode: 0644
owner: root
path: /etc/watchdog.conf
state: present
- name: "WATCHDOG | Ensure watchdog.service is started and enabled"
systemd:
name: watchdog.service
state: started
enabled: True
NOTE
loading the kernel module and starting the service will activate the watchdog, even though in BIOS it’s off.
As long as you cleanly systemctl stop watchdog.service
you should not get surprise reboots.
If you do, be sure to check your hardware’s system event log (SEL).
verify jumpers on your motherboard
I have a jumper JWD1 to override the hard reset function. I left if on it’s default of Reset, the other choices are NMI (non-maskable interrupt) and Disable
Check your server / motherboard manual for details in case your server does not reset when you see the assertion in the SEL.
lastly, decide if you want to enable watchdog in UEFI / BIOS
Since the above four steps will give you a functional watchdog and the ability to not use it by never starting the service (maybe while you run a rescue environment or if you decide to run an operating system installer), I recommend you leave the setting in UEFI / BIOS disabled, provided you test (see below) that the watchdog functions as you intend.
testing watchdog works as expected
Obviously, some of these tests will Hard Reset (0x01) your box, so save all work and kick off all users while you test.
n.b.: In UEFI (aka BIOS) the watchdog is off for all of this post!
checking IPMI watchdog status
When watchdog.service is active, use this IPMI command to check on the watchdog;
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 59 sec
If you use the default interval = 1
in /etc/watchdog.conf
, you should never see a Present Countdown: lower than
Initial Countdown minus 1 second. Read the man pages watchdog(8) and
watchdog.conf(5) before you decide to adjust the interval.
reading the SEL from commandline
Since during testing you will reboot your server a few times, you might prefer to read the system event log (SEL) content from the comfort of a shell on a remote Linux workstation, simply use ipmitool.
pcfe@karhu ~ $ ipmitool -H supermicro-bmc -U ADMIN -f ~/.ipmi-supermicro-bmc -I lanplus sel list
1 | 08/23/2018 | 16:20:36 | Watchdog2 #0xca | Timer interrupt () | Asserted
2 | 08/23/2018 | 16:20:37 | Watchdog2 #0xca | Hard reset () | Asserted
stopping watchdog.service cleanly
A clean stop of the service should set your Watchdog Timer Actions to No action (0x00). The timer will expire but no reboot will happen.
[root@epyc ~]# systemctl stop watchdog.service
[root@epyc ~]# date
Mi 22. Aug 22:03:52 CEST 2018
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 51 sec
When the counter reaches 0, I see in the SEL os the server’s BMC webUI;
EID | Time Stamp | Sensor Type | Description |
---|---|---|---|
53 | 2018/08/22 20:04:47 | Watchdog 2 | Out of Spec Definition (0x08) - Assertion |
54 | 2018/08/22 20:04:48 | Watchdog 2 | Timer Expired (Status Only) - Assertion |
but the box does not reset. As stated by Watchdog Timer Actions: No action (0x00).
simulating a hard system crash
To simulate a hard system crash, you can do two things;
kill -9 the watchdog process
kill -9 ...
the running watchdog daemon, when the counter hits 0
you should see in the SEL an assertion and the box should reboot.
EID | Time Stamp | Sensor Type | Description |
---|---|---|---|
57 | 2018/08/22 20:25:15 | Watchdog 2 | Out of Spec Definition (0x08) - Assertion |
58 | 2018/08/22 20:25:16 | Watchdog 2 | Hard Reset - Assertion |
sysctl c
alternatively, really crash the box with SysRq c or by writing to /proc/sysrq-trigger
.
[root@epyc ~]# date ; echo c > /proc/sysrq-trigger
Do 23. Aug 18:16:21 CEST 2018
note
On my Super Micro H11DSi-NT, this only rebooted the box after the BIOS default timeout, not the watchdog-timeout of my ipmi driver () The watchdog feature is off but the default timeout of 300 seconds still counts. (And I needed to fix my BMC’s clock, that was off by a few seconds. Now it’s on NTP as it should be)
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 60 sec
[root@epyc ~]# systemctl status watchdog.service -l
● watchdog.service - watchdog daemon
Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled)
Active: active (running) since Do 2018-08-23 18:15:44 CEST; 27s ago
Process: 1358 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
Main PID: 1365 (watchdog)
Tasks: 1
CGroup: /system.slice/watchdog.service
└─1365 /usr/sbin/watchdog
Aug 23 18:15:44 epyc.internal.pcfe.net systemd[1]: Starting watchdog daemon...
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: starting daemon (5.13):
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: int=1s realtime=yes sync=no soft=no mla=0 mem=0
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: ping: no machine to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: file: no file to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: pidfile: no server process to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: interface: no interface to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: hardware watchdog identity: IPMI
Aug 23 18:15:44 epyc.internal.pcfe.net systemd[1]: Started watchdog daemon.
[root@epyc ~]# date ; echo c > /proc/sysrq-trigger
Do 23. Aug 18:16:21 CEST 2018
EID | Time Stamp | Sensor Type | Description |
---|---|---|---|
1 | 2018/08/23 16:20:36 | Watchdog 2 | Out of Spec Definition (0x08) - Assertion |
2 | 2018/08/23 16:20:37 | Watchdog 2 | Hard Reset - Assertion |
end of tests to perform
If the bove tests worked as expected for you, then you are done setting up your IPMI watchdog.
my test log
The rest of this post is pretty much just a note to self and I might remove it in a future update. But for now I want the tests I did recorded somewhere.
fresh boot, BIOS off, watchdog.service off
Right after boot, with watchdog off in the BIOS. There is no timer action set (as expected). The counter does not decrement.
[root@epyc ~]# systemctl status watchdog.service
● watchdog.service - watchdog daemon
Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Aug 22 22:27:37 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Hint: Some lines were ellipsized, use -l to show in full.
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x04)
Watchdog Timer Is: Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 10 sec
Present Countdown: 10 sec
manually starting from previous state
I start the service, this sets the timer action and sets a countdown of 60 seconds.
[root@epyc ~]# systemctl start watchdog.service
[root@epyc ~]# systemctl status watchdog.service
● watchdog.service - watchdog daemon
Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)
Active: active (running) since Mi 2018-08-22 22:28:59 CEST; 2s ago
Process: 2704 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
Main PID: 2706 (watchdog)
Tasks: 1
CGroup: /system.slice/watchdog.service
└─2706 /usr/sbin/watchdog
Aug 22 22:28:59 epyc.internal.pcfe.net systemd[1]: Starting watchdog daemon...
Aug 22 22:28:59 epyc.internal.pcfe.net systemd[1]: Started watchdog daemon.
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: starting daemon (5.13):
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: int=1s realtime=yes sync=no soft=no mla=0 mem=0
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: ping: no machine to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: file: no file to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: pidfile: no server process to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: interface: no interface to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat...t=no
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: hardware watchdog identity: IPMI
Hint: Some lines were ellipsized, use -l to show in full.
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 59 sec
cleanly stopping watchdog.service, expect no reboot
I stopped the service cleanly with systemctl stop watchdog.service
and got,
after a minute, the expected timer expiry but no reboot as Watchdog Timer Actions got set to No action (0x00).
EID | Time Stamp | Sensor Type | Description |
---|---|---|---|
59 | 2018/08/22 20:30:44 | Watchdog 2 | Out of Spec Definition (0x08) - Assertion |
60 | 2018/08/22 20:30:45 | Watchdog 2 | Timer Expired (Status Only) - Assertion |
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x04)
Watchdog Timer Is: Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 0 sec
[root@epyc ~]# uptime
22:32:59 up 6 min, 1 user, load average: 0,00, 0,01, 0,02
[root@epyc ~]# systemctl status watchdog.service
● watchdog.service - watchdog daemon
Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: pidfile: no server process to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: interface: no interface to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat...t=no
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: hardware watchdog identity: IPMI
Aug 22 22:29:44 epyc.internal.pcfe.net systemd[1]: Stopping watchdog daemon...
Aug 22 22:29:44 epyc.internal.pcfe.net watchdog[2706]: stopping daemon (5.13)
Aug 22 22:29:49 epyc.internal.pcfe.net systemd[1]: Stopped watchdog daemon.
Aug 22 22:29:49 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Aug 22 22:29:49 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Aug 22 22:34:14 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Hint: Some lines were ellipsized, use -l to show in full.
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x04)
Watchdog Timer Is: Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 0 sec
kill the watchdog process, expect reboot
If the service dies, the timer hits 0 and the box reboots as expected.
[root@epyc ~]# cat /var/run/watchdog.pid
4054
[root@epyc ~]# date ; kill -9 4054 ; sleep 10 ; ipmitool mc watchdog get
Mi 22. Aug 22:38:37 CEST 2018
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 50 sec
EID | Time Stamp | Sensor Type | Description |
---|---|---|---|
61 | 2018/08/22 20:39:37 | Watchdog 2 | Out of Spec Definition (0x08) - Assertion |
62 | 2018/08/22 20:39:38 | Watchdog 2 | Hard Reset - Assertion |
Test to see if box does not reboot randomly in the next few hours
[root@epyc ~]# uptime
22:51:31 up 2 min, 1 user, load average: 0,17, 0,12, 0,05
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 60 sec
Present Countdown: 60 sec
[root@epyc ~]# systemctl status watchdog.service -l
● watchdog.service - watchdog daemon
Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled)
Active: active (running) since Mi 2018-08-22 22:48:45 CEST; 2min 54s ago
Process: 1352 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
Main PID: 1359 (watchdog)
Tasks: 1
CGroup: /system.slice/watchdog.service
└─1359 /usr/sbin/watchdog
Aug 22 22:48:45 epyc.internal.pcfe.net systemd[1]: Starting watchdog daemon...
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: starting daemon (5.13):
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: int=1s realtime=yes sync=no soft=no mla=0 mem=0
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: ping: no machine to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: file: no file to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: pidfile: no server process to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: interface: no interface to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: hardware watchdog identity: IPMI
Aug 22 22:48:45 epyc.internal.pcfe.net systemd[1]: Started watchdog daemon.
worked fine, box had 16 hours uptime on the following afternoon.