SCOM alerts are only as good as SCOM agents are. From time to time agents go “dark”. Repairing or reinstalling the agent in question may help.
Unfortunately this tends to happen with the SCOM RMS/MS agent too, one way to cause this condition is to restart SQL server running OperationsManager (SCOM) database. Even during patching runs, it may be necessary to keep in mind this dependency and ensure that SCOM SQL server is serviced/rebooted first, before SCOM RMS/MS server.
So what do we do if the server responsible for monitoring stops working, without anyone noticing? Most interestingly, when this happens SCOM console may continue to work (at least at a basic launch/navigate level).
Some folks set up multiple monitoring servers with intention of setting up some cross-monitoring, but in some cases it may be better to just schedule a small program or a script that would run a check of all SCOM agents and take some action if any of the agents are not healthy.
Checking Agent Status
Dim agentCriteria As AgentManagedComputerCriteria Dim agents As ReadOnlyCollection(Of AgentManagedComputer) Dim mg As ManagementGroup Try Using mg = New ManagementGroup(My.Settings.ManagementGroup) agentCriteria = New AgentManagedComputerCriteria("LastModified >= '" + New DateTime(2000, 1, 1).ToString("G") + "'") agents = mg.Administration.GetAgentManagedComputers(agentCriteria) For Each agent As AgentManagedComputer In agents If agent.HealthState > 1 Then Log(agent.Name & " is monitored but not healthy. Status: " & agent.HealthState, EventLogEntryType.Warning, 390) 'do something to notify about the issue End If Next End Using Catch ex As Exception 'handle any unexpected errors here Finally agents = Nothing agentCriteria = Nothing End Try
This code snippet is pretty self-explanatory. AgentManagedComputer HealthState property can be of the following values:
- Agent health state 0 = not monitored
- Agent health state 1 = healthy
- Agent health state 3 = critical
I suspect that state 2 would be “dark” agent state, though I have not seen that condition yet after putting in this code. We can safely conclude that everything with a status of more than 1 is being monitored and is not healthy, and therefore is worthy of notifying about.
“Log” function in the code above should be substituted with whatever action you wish the script/program to do; in my case Log calls a function that writes an event into Event Log and I left the line to show the usage of Agent.Name and Agent.HealthState properties.
“Using” block can be used to open a new connection to the management group (“My.Settings.ManagementGroup” variable takes the name of the group from the VB.NET project settings). “Using” block automatically releases all resources used by the connection when the execution exits the block – quite convenient.
You can view other properties and methods of the AgentManagedComputer object class on MSDN.