Select to view content in your preferred language

Help me Please 😭 ArcGIS Server Crashing

295
3
4 weeks ago
GEOKID
by
Emerging Contributor

Hey everyone, thanks in advance for your help.

Issue Description:

I have an ArcGIS Enterprise 10.9.1 infrastructure with four nodes configured to be highly available on Windows Server 2016. For the past three weeks, the nodes have been experiencing spontaneous crashes where the number of ArcSOC processes drops to about half or even zero (as observed in ArcGIS Monitor). When this happens, the services become intermittent, and some incoming requests to the affected server fail.

We have observed several behaviors:

  • At times, a single node may fail.
  • Occasionally, two nodes fail at the same time.
  • Once all nodes fail 🥲.
  • Sometimes, when attempting to restore one node, it enters synchronization mode and causes another node to fail, which can result in both nodes being down.

GEOKID_0-1726265756015.png 

Image 1. A server failing to lose ArcSOC Instances

GEOKID_1-1726267147011.png

Image 2. Servers failing (red mark). Yellow one tries to synchronize but stops all ArcSOC processes and then tries to create them again but fails. This image shows how I went from 3 nodes working well to just 1 in a pair of hours.

After the failure, other behaviors occur where the node remains blocked indefinitely or sometimes tries to recover autonomously. When it does this, the node stops all instances and starts creating them again. Sometimes, it works, and the node returns to its usual number of instances. Other times, the number of instances remains halved, and the service intermittency continues.

Actions Taken:

  1. I restarted the ArcGIS Server service on the affected node. Sometimes this works, but other times, the node returns with an unusual number of instances, causing service issues.
  2. I restarted the server, with the same behavior as restarting the service.
  3. The most effective procedure is to put the affected node in maintenance mode to stop services from being affected. Sometimes, this process involves closing and recreating all instances while in maintenance mode. When manual synchronization is activated and maintenance mode is removed, It works correctly most of the time.

Actions Suggested by ESRI Support (Applied Without Success):

  1. Reduced the number of instances from an average of 180 to 100.
  2. Increased machine resources (CPU and RAM) are now less than half utilized.
  3. Removed orphan jobs.
  4. We ensured no firewall or antivirus blocks.
  5. ESRI support found nothing in the logs.
  6. Used the repair program option on the ArcGIS Server installer.
  7. Uninstalled and reinstalled ArcGIS Server.
  8. Tested by shutting down the most robust services.
  9. Ensure permissions on file share and ArcGIS Server local folders.
  10. Ensure all nodes have the same configurations like region, languages, TLS, hardware, settings, and permissions.

Additional Information:

  • The services running on these nodes include map services (feature access, map server, and WMS Capabilities) and geoprocessing services.
  • There is no specific pattern in the timing or conditions under which the failures occur. It happened at midnight without traffic.
  • No specific error messages appear in the ArcGIS Server logs when the failure occurs other than "Server 1 can not connect to server 2"
3 Replies
JustinS
Esri Contributor

@GEOKID, were you and support able to isolate the performance degradation to any specific services? Were logs set to DEBUG level during the timeframe they were unable to locate any meaningful information? It can be helpful to consider incrementally (and temporarily) disabling services to isolate the cause(s). It's especially helpful if a timeline can be established, which ideally includes environmental changes. 

0 Kudos
GEOKID
by
Emerging Contributor

Thank you for the response, @JustinS. To address the issue, I halted the services experiencing the highest traffic and stopped or republished those that failed. I sent the debug mode logs to our Premium Support and carried out all the recommended solutions, such as installing patches, decreasing the number of instances, and confirming consistent configuration across all nodes in language, region, pagination, and permissions. There's no apparent pattern related to time, traffic, or any specific service that might be causing the failure. The servers, hosted on VMware machines, have been checked for connection, resources, and operating system integrity.

0 Kudos
AdamMesser1
Regular Contributor

@GEOKIDwe have experienced similar behavior. We also have a 10.9.1 enterprise install, and run a stand alone server on a LinuxOS. I'm intrigued by your comment about spontaneous crashes around midnight. We applied 3 patches in early August and the Server service has since failed 3 times over the 6 weeks.  All failures have occurred between midnight and early AM.  Our staff haven't had a chance to look into the issue in detail so I unfortunately don't have much to offer except solidarity! Patches applied:  ArcGIS Server Map and Feature Service Security 2023 Update 1 Patch (esri.com), ArcGIS Server Hosted Services Restart Patch (esri.com), ArcGIS Server 10.9.1 Utility Network and Data Management Patch 6 (esri.com

 

0 Kudos