Hey everyone, thanks in advance for your help.
Issue Description:
I have an ArcGIS Enterprise 10.9.1 infrastructure with four nodes configured to be highly available on Windows Server 2016. For the past three weeks, the nodes have been experiencing spontaneous crashes where the number of ArcSOC processes drops to about half or even zero (as observed in ArcGIS Monitor). When this happens, the services become intermittent, and some incoming requests to the affected server fail.
We have observed several behaviors:
Image 1. A server failing to lose ArcSOC Instances
Image 2. Servers failing (red mark). Yellow one tries to synchronize but stops all ArcSOC processes and then tries to create them again but fails. This image shows how I went from 3 nodes working well to just 1 in a pair of hours.
After the failure, other behaviors occur where the node remains blocked indefinitely or sometimes tries to recover autonomously. When it does this, the node stops all instances and starts creating them again. Sometimes, it works, and the node returns to its usual number of instances. Other times, the number of instances remains halved, and the service intermittency continues.
Actions Taken:
Actions Suggested by ESRI Support (Applied Without Success):
Additional Information:
@GEOKID, were you and support able to isolate the performance degradation to any specific services? Were logs set to DEBUG level during the timeframe they were unable to locate any meaningful information? It can be helpful to consider incrementally (and temporarily) disabling services to isolate the cause(s). It's especially helpful if a timeline can be established, which ideally includes environmental changes.
Thank you for the response, @JustinS. To address the issue, I halted the services experiencing the highest traffic and stopped or republished those that failed. I sent the debug mode logs to our Premium Support and carried out all the recommended solutions, such as installing patches, decreasing the number of instances, and confirming consistent configuration across all nodes in language, region, pagination, and permissions. There's no apparent pattern related to time, traffic, or any specific service that might be causing the failure. The servers, hosted on VMware machines, have been checked for connection, resources, and operating system integrity.
@GEOKIDwe have experienced similar behavior. We also have a 10.9.1 enterprise install, and run a stand alone server on a LinuxOS. I'm intrigued by your comment about spontaneous crashes around midnight. We applied 3 patches in early August and the Server service has since failed 3 times over the 6 weeks. All failures have occurred between midnight and early AM. Our staff haven't had a chance to look into the issue in detail so I unfortunately don't have much to offer except solidarity! Patches applied: ArcGIS Server Map and Feature Service Security 2023 Update 1 Patch (esri.com), ArcGIS Server Hosted Services Restart Patch (esri.com), ArcGIS Server 10.9.1 Utility Network and Data Management Patch 6 (esri.com