To determining if your VMware vSphere HA cluster has experienced a host failure
Steps:
- Collect one of the ESXi VMkernel coredumps
- Use Notpad++ to open the fdm log and search "dead"
Output:
Line 507: 2022-04-12T02:47:10.049Z verbose fdm[2815311] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Waited 15 seconds for disk heartbeat for host host-2075951 - declaring dead
Line 669: 2022-04-12T02:47:10.053Z info fdm[2815305] [Originator@6876 sub=Invt opID=SWI-7f9b698e] Host host-2075951 changed state: Dead
Line 752: --> host-2075951: Dead
Line 827: --> host-2075951 state=Dead reservedMem=398505017344 reservedCpu=115750 unreservedMem=197486706688 unreservedCpu=17840 inMaintenanceMode=false
Line 3545: --> state=Dead electionId=1507429880550
Line 7400: --> host-2075951: Dead
Line 7475: --> host-2075951 state=Dead reservedMem=398505017344 reservedCpu=115750 unreservedMem=197486706688 unreservedCpu=17840 inMaintenanceMode=false
Line 10139: --> state=Dead electionId=1507429880550
Line 10572: 2022-04-12T02:53:50.548Z verbose fdm[2815311] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Dead slave host-2075951 responded to icmp ping - going to FDMUnreachable
Note: In the log analysis result, one of the host in the cluster dead or manually pull-out to the Chasis or in the power cable
To Justify this. Compare the suspected host logs timeframe ( Base on the log analysis below the host dead causing the HA to restart the VM servers )
- To verify this. Check vmksummary log and sysboot log
Output of the vmksummary log - We can see here that the host booted "2022-04-12T02:55:43Z bootstop: Host has booted"
Output of the sysboot log - We can see here the reboot time and up time of the host
Summary: So the fdm log of the other host timestamp is sync to the vmksummary log and sysboot log timestamp of the dead/faulty host.
- 2022-04-12T02:47 > Host dead
- 2022-04-12T02:51 > Host rebooted
- 2022-04-12T02:54 > Host is fully booted up
Reference VMware KB: https://kb.vmware.com/s/article/2036544
Comments