Supermicro BMC watchdog-caused reboots

15,006

Eventually, I found a little bit strange solution: just leave watchdog jumper (JWD1) open (with neither NMI nor hard-reset selected). Watchdog is enabled in BIOS settings.

In this case watchdog works as expected -- system was stable for 25 minutes with bmc-watchdog running and rebooted after watchdog program termination.

Share:
15,006

Related videos on Youtube

Alexander Sergeyev
Author by

Alexander Sergeyev

Updated on September 18, 2022

Comments

  • Alexander Sergeyev
    Alexander Sergeyev almost 2 years

    I've recently acquired a SuperMicro X10SLL-F motherboard, which has a built-in BMC (Aspeed AST2400 chip). I want to use built-in watchdog controller when running linux on server (gentoo hardened).

    I enabled watchdog function in bios then switched motherboard jumper from hard-reset to NMI (watchdog timeout action, for testing purposes to avoid rebooting). About soft -- I installed and added to default runlevel watchdog program (sys-apps/watchdog) which is configured to ping watchdog device (/dev/watchdog, which is present) every 10 seconds. Watchdog timeout is set to 250 seconds.

    Programs apparently see watchdog hardware (ipmitool with openipmi enabled):

    # ipmitool mc watchdog get
    Watchdog Timer Use:     SMS/OS (0x44)
    Watchdog Timer Is:      Started/Running
    Watchdog Timer Actions: Hard Reset (0x01)
    Pre-timeout interval:   0 seconds
    Timer Expiration Flags: 0x10
    Initial Countdown:      254 sec
    Present Countdown:      253 sec
    

    Freeipmi:

    # bmc-watchdog --get
    Timer Use:                   SMS/OS
    Timer:                       Running
    Logging:                     Enabled
    Timeout Action:              Hard Reset
    Pre-Timeout Interrupt:       None
    Pre-Timeout Interval:        0 seconds
    Timer Use BIOS FRB2 Flag:    Clear
    Timer Use BIOS POST Flag:    Clear
    Timer Use BIOS OS Load Flag: Clear
    Timer Use BIOS SMS/OS Flag:  Set
    Timer Use BIOS OEM Flag:     Clear
    Initial Countdown:           254 seconds
    Current Countdown:           253 seconds
    

    However, after certain amount of time I get (with good "current countdown" values reported by programs above):

    [  294.107534] Uhhuh. NMI received for unknown reason 21 on CPU 0.
    [  294.107998] Do you have a strange power saving mode enabled?
    [  294.108437] Dazed and confused, but trying to continue
    

    Which is NMI, apparently caused by watchdog timeout. Little less than a minute after that machine hard-reset happens.

    Where is a problem and which direction should I dig to?

    EDIT: kernel messages related to ipmi:

    [    0.353090] ipmi message handler version 39.2
    [    0.353353] ipmi device interface
    [    0.353623] IPMI System Interface driver.
    [    0.353898] ipmi_si: probing via ACPI
    [    0.354172] ipmi_si 00:08: [io  0x0ca2] regsize 1 spacing 1 irq 0
    [    0.354444] ipmi_si: Adding ACPI-specified kcs state machine
    [    0.354790] ipmi_si: probing via SMBIOS
    [    0.355051] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
    [    0.355317] ipmi_si: Adding SMBIOS-specified kcs state machine duplicate interface
    [    0.355836] ipmi_si: probing via SPMI
    [    0.356095] ipmi_si: SPMI: io 0xca2 regsize 1 spacing 1 irq 0
    [    0.356362] ipmi_si: Adding SPMI-specified kcs state machine duplicate interface
    [    0.356906] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0
    [    0.390536] ipmi_si: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
    [    0.418476] ipmi_si 00:08: Found new BMC (man_id: 0x002a7c, prod_id: 0x0801, dev_id: 0x20)
    [    0.419004] ipmi_si 00:08: IPMI kcs interface initialized
    [    0.419272] IPMI SSIF Interface driver
    [    0.420350] IPMI Watchdog: driver initialized
    [    0.420635] Copyright (C) 2004 MontaVista Software - IPMI Powerdown via sys_reboot.
    [    0.421444] IPMI poweroff: ATCA Detect mfg 0x2A7C prod 0x801
    [    0.421710] IPMI poweroff: Found a chassis style poweroff function
    

    EDIT: I tried to use bmc-watchdog with configuration "-u 4 -p 2 -a 0 -F -P -L -O -i 300 -e 10". So only SMS/OS time is in use, pre-timeout interrupt is set to NMI, timeout action is set to NONE:

    # bmc-watchdog --get
    Timer Use:                   SMS/OS
    Timer:                       Running
    Logging:                     Enabled
    Timeout Action:              None
    Pre-Timeout Interrupt:       NMI / Diagnostic Interrupt
    Pre-Timeout Interval:        0 seconds
    Timer Use BIOS FRB2 Flag:    Clear
    Timer Use BIOS POST Flag:    Clear
    Timer Use BIOS OS Load Flag: Clear
    Timer Use BIOS SMS/OS Flag:  Set
    Timer Use BIOS OEM Flag:     Clear
    Initial Countdown:           300 seconds
    Current Countdown:           290 seconds
    

    But this led to no change at all.

    EDIT. Also when I trigger watchdog timer with echoing \0x00 to /dev/watchdog and then kept it untouched -- system is correctly rebooted after default 10 second timeout. So watchdog works good but at exactly 350 seconds from startup system reboots.

    EDIT. I checked BMC system event log (SEL) and found this after reboot:

    Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
    Sensor #202 | Watchdog 2 | Assertion Event | Timer expired, status only ; Timer use at expiration = SMS/OS ; Interrupt type = none
    

    What is interesting here -- is that event marked as "status only". And even so, system is rebooted. When I trigger watchdog timeout intentionally, logs are different:

    Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
    Sensor #202 | Watchdog 2 | Assertion Event | Hard Reset ; Timer use at expiration = SMS/OS ; Interrupt type = none