IPMI hardware watchdog with RHEL 7 / CentOS 7

IPMI hardware watchdog with RHEL 7 / CentOS 7

introduction

My new server has IPMI which includes a watchdog timer.

Since I regularly get told that hardware watchdogs are no fun to set up, I’ll record my setup steps for an IPMI watchdog here. This document is not about watchdogs in general, but rather specifically on how to use an IPMI watchdog on a recent Red Hat based distribution.

setup

first, determine your watchdog type

This command will only work if the watchdog service is not running.

[root@epyc ~]# wd_identify
IPMI

If your watchdog is of any other kind, then this post is not what you are looking for.

be careful stopping the service

Depending on your settings (see below) a systemctl stop watchdog.service is either perfectly safe to run or will reboot your box. If you do stop the service, be sure to check on IPMI watchdog shortly after stopping the service.

If on stopping it does not change to Watchdog Timer Actions: No action you will want to issue a systemctl start watchdog.service before the Present Countdown: reaches 0!

[root@epyc ~]# ipmitool mc watchdog get ; systemctl stop watchdog.service ; echo ; sleep 10 ; ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      59 sec

Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      45 sec

second, ensure you load the kernel module at boot

[root@epyc ~]# cat /etc/modules-load.d/ipmi_watchdog.conf
# This loads the necessary kernel module to use an IPMI hardware watchdog
ipmi_watchdog

third, reboot and check the module got loaded automatically

The kernel module should be loaded after boot and you should see a /dev/watchdog.

[root@epyc ~]# modprobe --first-time ipmi_watchdog
modprobe: ERROR: could not insert 'ipmi_watchdog': Module already in kernel
[root@epyc ~]# ls -l /dev/watchdog
crw-------. 1 root root 10, 130 Aug 22 21:04 /dev/watchdog

fourth, ensure you installed, configured and activated the daemon

This I do with the Ansible tasks close to those below, but you can just manually:

  • install with yum install watchdog
  • adjust your /etc/watchdog.conf to set the watchdog-device
  • enable the service with systemctl enable watchdog.service
  • start the service with systemctl start watchdog.service
- name: "WATCHDOG | ensure packages for using HW watchdog are installed"
  yum:
    name:
    - watchdog
    state: present
- name: "WATCHDOG | ensure config file uses /dev/watchdog"
  lineinfile:
    group: root
    line: "watchdog-device = /dev/watchdog"
    mode: 0644
    owner: root
    path: /etc/watchdog.conf
    state: present
- name: "WATCHDOG | Ensure watchdog.service is started and enabled"
  systemd:
    name:       watchdog.service
    state:      started
    enabled:    True

NOTE

loading the kernel module and starting the service will activate the watchdog, even though in BIOS it’s off. As long as you cleanly systemctl stop watchdog.service you should not get surprise reboots. If you do, be sure to check your hardware’s system event log (SEL).

verify jumpers on your motherboard

I have a jumper JWD1 to override the hard reset function. I left if on it’s default of Reset, the other choices are NMI (non-maskable interrupt) and Disable

Check your server / motherboard manual for details in case your server does not reset when you see the assertion in the SEL.

lastly, decide if you want to enable watchdog in UEFI / BIOS

Since the above four steps will give you a functional watchdog and the ability to not use it by never starting the service (maybe while you run a rescue environment or if you decide to run an operating system installer), I recommend you leave the setting in UEFI / BIOS disabled, provided you test (see below) that the watchdog functions as you intend.

testing watchdog works as expected

Obviously, some of these tests will Hard Reset (0x01) your box, so save all work and kick off all users while you test.

n.b.: In UEFI (aka BIOS) the watchdog is off for all of this post!

checking IPMI watchdog status

When watchdog.service is active, use this IPMI command to check on the watchdog;

[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      59 sec

If you use the default interval = 1 in /etc/watchdog.conf, you should never see a Present Countdown: lower than Initial Countdown minus 1 second. Read the man pages watchdog(8) and watchdog.conf(5) before you decide to adjust the interval.

reading the SEL from commandline

Since during testing you will reboot your server a few times, you might prefer to read the system event log (SEL) content from the comfort of a shell on a remote Linux workstation, simply use ipmitool.

pcfe@karhu ~ $ ipmitool -H supermicro-bmc -U ADMIN -f ~/.ipmi-supermicro-bmc -I lanplus sel list
   1 | 08/23/2018 | 16:20:36 | Watchdog2 #0xca | Timer interrupt () | Asserted
   2 | 08/23/2018 | 16:20:37 | Watchdog2 #0xca | Hard reset () | Asserted

stopping watchdog.service cleanly

A clean stop of the service should set your Watchdog Timer Actions to No action (0x00). The timer will expire but no reboot will happen.

[root@epyc ~]# systemctl stop watchdog.service
[root@epyc ~]# date
Mi 22. Aug 22:03:52 CEST 2018
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      51 sec

When the counter reaches 0, I see in the SEL os the server’s BMC webUI;

EID Time Stamp Sensor Type Description
53 2018/08/22 20:04:47 Watchdog 2 Out of Spec Definition (0x08) - Assertion
54 2018/08/22 20:04:48 Watchdog 2 Timer Expired (Status Only) - Assertion

but the box does not reset. As stated by Watchdog Timer Actions: No action (0x00).

simulating a hard system crash

To simulate a hard system crash, you can do two things;

kill -9 the watchdog process

kill -9 ... the running watchdog daemon, when the counter hits 0 you should see in the SEL an assertion and the box should reboot.

EID Time Stamp Sensor Type Description
57 2018/08/22 20:25:15 Watchdog 2 Out of Spec Definition (0x08) - Assertion
58 2018/08/22 20:25:16 Watchdog 2 Hard Reset - Assertion

sysctl c

alternatively, really crash the box with SysRq c or by writing to /proc/sysrq-trigger.

[root@epyc ~]# date ; echo c > /proc/sysrq-trigger
Do 23. Aug 18:16:21 CEST 2018
note

On my Super Micro H11DSi-NT, this only rebooted the box after the BIOS default timeout, not the watchdog-timeout of my ipmi driver () The watchdog feature is off but the default timeout of 300 seconds still counts. (And I needed to fix my BMC’s clock, that was off by a few seconds. Now it’s on NTP as it should be)

[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      60 sec
[root@epyc ~]# systemctl status watchdog.service -l
● watchdog.service - watchdog daemon
   Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled)
   Active: active (running) since Do 2018-08-23 18:15:44 CEST; 27s ago
  Process: 1358 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
 Main PID: 1365 (watchdog)
    Tasks: 1
   CGroup: /system.slice/watchdog.service
           └─1365 /usr/sbin/watchdog

Aug 23 18:15:44 epyc.internal.pcfe.net systemd[1]: Starting watchdog daemon...
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: starting daemon (5.13):
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: int=1s realtime=yes sync=no soft=no mla=0 mem=0
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: ping: no machine to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: file: no file to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: pidfile: no server process to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: interface: no interface to check
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no
Aug 23 18:15:44 epyc.internal.pcfe.net watchdog[1365]: hardware watchdog identity: IPMI
Aug 23 18:15:44 epyc.internal.pcfe.net systemd[1]: Started watchdog daemon.
[root@epyc ~]# date ; echo c > /proc/sysrq-trigger
Do 23. Aug 18:16:21 CEST 2018
EID Time Stamp Sensor Type Description
1 2018/08/23 16:20:36 Watchdog 2 Out of Spec Definition (0x08) - Assertion
2 2018/08/23 16:20:37 Watchdog 2 Hard Reset - Assertion

end of tests to perform

If the bove tests worked as expected for you, then you are done setting up your IPMI watchdog.

my test log

The rest of this post is pretty much just a note to self and I might remove it in a future update. But for now I want the tests I did recorded somewhere.

fresh boot, BIOS off, watchdog.service off

Right after boot, with watchdog off in the BIOS. There is no timer action set (as expected). The counter does not decrement.

[root@epyc ~]# systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Aug 22 22:27:37 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Hint: Some lines were ellipsized, use -l to show in full.
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x04)
Watchdog Timer Is:      Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      10 sec

manually starting from previous state

I start the service, this sets the timer action and sets a countdown of 60 seconds.

[root@epyc ~]# systemctl start watchdog.service
[root@epyc ~]# systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)
   Active: active (running) since Mi 2018-08-22 22:28:59 CEST; 2s ago
  Process: 2704 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
 Main PID: 2706 (watchdog)
    Tasks: 1
   CGroup: /system.slice/watchdog.service
           └─2706 /usr/sbin/watchdog

Aug 22 22:28:59 epyc.internal.pcfe.net systemd[1]: Starting watchdog daemon...
Aug 22 22:28:59 epyc.internal.pcfe.net systemd[1]: Started watchdog daemon.
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: starting daemon (5.13):
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: int=1s realtime=yes sync=no soft=no mla=0 mem=0
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: ping: no machine to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: file: no file to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: pidfile: no server process to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: interface: no interface to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat...t=no
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: hardware watchdog identity: IPMI
Hint: Some lines were ellipsized, use -l to show in full.
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      59 sec

cleanly stopping watchdog.service, expect no reboot

I stopped the service cleanly with systemctl stop watchdog.service and got, after a minute, the expected timer expiry but no reboot as Watchdog Timer Actions got set to No action (0x00).

EID Time Stamp Sensor Type Description
59 2018/08/22 20:30:44 Watchdog 2 Out of Spec Definition (0x08) - Assertion
60 2018/08/22 20:30:45 Watchdog 2 Timer Expired (Status Only) - Assertion
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x04)
Watchdog Timer Is:      Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      0 sec
[root@epyc ~]# uptime
 22:32:59 up 6 min,  1 user,  load average: 0,00, 0,01, 0,02
[root@epyc ~]# systemctl status watchdog.service
● watchdog.service - watchdog daemon
   Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: pidfile: no server process to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: interface: no interface to check
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat...t=no
Aug 22 22:28:59 epyc.internal.pcfe.net watchdog[2706]: hardware watchdog identity: IPMI
Aug 22 22:29:44 epyc.internal.pcfe.net systemd[1]: Stopping watchdog daemon...
Aug 22 22:29:44 epyc.internal.pcfe.net watchdog[2706]: stopping daemon (5.13)
Aug 22 22:29:49 epyc.internal.pcfe.net systemd[1]: Stopped watchdog daemon.
Aug 22 22:29:49 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Aug 22 22:29:49 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Aug 22 22:34:14 epyc.internal.pcfe.net systemd[1]: [/usr/lib/systemd/system/watchdog.service:9] Unknown lvalue '...ice'
Hint: Some lines were ellipsized, use -l to show in full.
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x04)
Watchdog Timer Is:      Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      0 sec

kill the watchdog process, expect reboot

If the service dies, the timer hits 0 and the box reboots as expected.

[root@epyc ~]# cat /var/run/watchdog.pid
4054
[root@epyc ~]# date ; kill -9 4054 ; sleep 10 ; ipmitool mc watchdog get
Mi 22. Aug 22:38:37 CEST 2018
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      50 sec
EID Time Stamp Sensor Type Description
61 2018/08/22 20:39:37 Watchdog 2 Out of Spec Definition (0x08) - Assertion
62 2018/08/22 20:39:38 Watchdog 2 Hard Reset - Assertion

Test to see if box does not reboot randomly in the next few hours

[root@epyc ~]# uptime
 22:51:31 up 2 min,  1 user,  load average: 0,17, 0,12, 0,05
[root@epyc ~]# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      60 sec
[root@epyc ~]# systemctl status watchdog.service -l
● watchdog.service - watchdog daemon
   Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled)
   Active: active (running) since Mi 2018-08-22 22:48:45 CEST; 2min 54s ago
  Process: 1352 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
 Main PID: 1359 (watchdog)
    Tasks: 1
   CGroup: /system.slice/watchdog.service
           └─1359 /usr/sbin/watchdog

Aug 22 22:48:45 epyc.internal.pcfe.net systemd[1]: Starting watchdog daemon...
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: starting daemon (5.13):
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: int=1s realtime=yes sync=no soft=no mla=0 mem=0
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: ping: no machine to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: file: no file to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: pidfile: no server process to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: interface: no interface to check
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: test=none(0) repair=none(0) alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no
Aug 22 22:48:45 epyc.internal.pcfe.net watchdog[1359]: hardware watchdog identity: IPMI
Aug 22 22:48:45 epyc.internal.pcfe.net systemd[1]: Started watchdog daemon.

worked fine, box had 16 hours uptime on the following afternoon.