10.6.1 HA ArcGIS Server Site Crashing

Justin_Greco · ‎01-30-2019

Since upgrading to 10.6.1, we have seen instances of machines in our ArcGIS Server site crashing (its not always the same machine). After doing some deep diving into the logs and ArcGIS Monitor, we are seeing spikes in the available memory (going from 5GB available to almost the full 32GB). Usually these are just spikes, but on occasion we see the available memory flatline until we go in and restart the service on the server. These spikes I have correlated with warnings of the machine synchronizing with the site. When this happens, it typically stops all the services and brings them all back up, which appears to be what synchronizing with the site does. The issue seems to be when one of the SOC processes hangs, killing this process (or restarting the ArcGIS Server service) brings the machine back online. Unfortunately it is never the same map service that is hanging.

I do have a premium support ticket open for this, but wanted to see if anyone else has experienced this. Premium support's recommendation was to increase the number of cores on both servers, which was done, but we continue to see the crashing.

Since we are coming from 10.3.1, I am not familiar with the new optimized app server architecture that was introduced at 10.6. I am thinking the synchronizing with site function was part of this. I have not found any documentation or anything into the logs as to when or why the server needs to synchronize with the site.

Some details on the site:

10.6.1 ArcGIS Server running on 7.5 RHEL

32GB RAM and 6 CPU cores

Federated with Portal as the hosting server

Running about 52 map image services (currently no hosted feature services yet)

There are no reports of synchronizing during the night or weekends, so it is definitely happening when there is heavy traffic.

MoginrajMohandas · ‎08-27-2019

Hi Justin and others,

We are looking into this as a possible bug in the software, and it is currently under investigation. To give you an understanding of what is happening, here is a synopsis. When there is any publishing activity or administrative activity like stop/start service, there is a bunch of internal operations that happen on all machines in the site(on an HA Server site). This includes clearing or REST and SOAP caches. The soap clear cache is what is failing at this point for all or most of these cases. Since this operation fails on all machines in the site, all machines are set to be synchronized. This is what causes the stop and start of all services(when each of the machines try to synchronize).

A few questions :

1. Do you have cached map services/tiled map services on the site. We have observed that this is most likely related to those kinds of services.

2. Have you tried moving to a single machine site(non-HA) for a period of time to see if the behavior persists? Our testing indicates that it does not happen if you go to a single machine site, because we had made update in 10.6 for machines to not do a synchronization if they are the only machine in the site.

Temporary solutions include (1) going with a single machine site until a fix is in, (2) periodic clearing of the SOAP handler using an internal api that tech support can help with.

View solution in original post

JonathanQuinn · ‎01-30-2019

Are you using multiple clusters, or a single cluster? If you only use one cluster, is the site configured with the Single Cluster Mode (About single cluster mode—ArcGIS Server Administration (Linux) | ArcGIS Enterprise )? Are your directories on an NFS share? Do you mount them on the Server machines or are they dynamically mounted, (/net/share)?

I'm not sure how much load the machines are handling, but is it possible to stop one of the machines for a day and see if it's reproducible? Do you see any errors before synchronizing with the site messages? I'd be surprised if that was the only thing you'd see.

Justin_Greco · ‎01-30-2019

Hi Jonathan,

Thanks for the quick response. We are only using one cluster and the single cluster mode is set to true. I passed your question about the NFS share over to our Linux admin and his response was:

"That volume is an NFS mount provided by our NetApp storage appliance, it is mounted on both systems through an entry in each system’s /etc/fstab file."

Support also recommended taking one of the servers out of the site completely and see if we see issues.

When I see the warning about it synchronizing the site, I don't see anything indicating why it needed to synchronize. Since going to 10.6.1 we have been consistently been seeing the "Response already committed. Cannot forward to error page." and "This exception was through after the response was committed. Access to this resource is not allowed". Which support has said isn't something to be alarmed about and there is an enhancement request to make that to not be a severe message.

One warning I did just notice is the following:

"Setting the synchronization flag on the machine '[servername]'. Failed to clear Soap handler cache. Could not connect to the ArcGIS component at URL 'http://[servername]:6080/arcgis/services/esriAdmin/cache/clear'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error:"

JonathanQuinn · ‎01-31-2019

NFS can cause some issues if caching is not disabled and oplocks are enabled on the share. For example, when I test with NFS shares, I can't use the dynamically mounted share, (/net/machine), or mount the directories using the default caching settings, (mount machine:<share> <mounted path). Since the default for attribute and directory caching is 30 seconds, that causes consistency and synchronization issues with ArcGIS Server in multi-machine sites. To work around this, I mount the directories with the noac or actimeo=0 options:

https://linux.die.net/man/5/nfs

I don't think this should be specific to 10.6.1 as it should be a problem at any version of Server, but it's something to consider.

Justin_Greco · ‎02-01-2019

Thanks for the recommendation, I have passed this over to our Linux admin and he sent me info about how we have it mounted:

xx.xxx.xxx.x:/GIS_arcgisentshared_prd /u01/arcgisserver nfs rw,bg,hard,nointr,tcp,vers=3,timeo=600,rsize=32768,wsize=32768,actimeo=600 0 0

He has actimeo set to 600, which is our standard for all NFS mounts. He said there's no reason we can't change that. Would your recommendation be to set the actimeo to 0? Are there any other changes you would recommend? I will send this over to premium support as well.

Justin_Greco · ‎02-01-2019

We did make the change to noarc first and then actimeo=0. The arcsoc processes all started up, but we could not access anything through REST calls. We have rolled back to actimeo=600. Do you see anything else we would need to change in our settings?

JonathanQuinn · ‎02-01-2019

Hm, I don't see why disabling attribute caching would affect REST calls. Would the request timeout, or return an error?

The thing that concerns me about those settings is that while the timeo value is in deciseconds, (which equates to 60 seconds), the actimeo value is in seconds:

acregmin=n

The minimum time (in seconds) that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. If this option is not specified, the NFS client uses a 3-second minimum.

acregmax=n

The maximum time (in seconds) that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. If this option is not specified, the NFS client uses a 60-second maximum.

acdirmin=n

The minimum time (in seconds) that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. If this option is not specified, the NFS client uses a 30-second minimum.

acdirmax=n

The maximum time (in seconds) that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. If this option is not specified, the NFS client uses a 60-second maximum.

actimeo=n

Using actimeo sets all of acregmin, acregmax, acdirmin, and acdirmax to the same value. If this option is not specified, the NFS client uses the defaults for each of these options listed above.

https://linux.die.net/man/5/nfs

This means that the Server machines will cache the directory and attribute information for files and folders for 600 seconds. If the Server is attempting to retrieve information from the config-store and that information has expired by up to 600 seconds yet that's what's returned due to the cache, then you should be seeing considerable problems. If these machines are behind a load balancer, then take one of the machines out of the rotation and make the change on that machine specifically. Then, you don't need to worry about requests hitting a machine that may not respond while still being able to troubleshoot as if it were up and active.

Justin_Greco · ‎02-01-2019

Turns out one of the mounts was set with just “defaults” while the other had actimeo set at 600. Our 10.3.1 was set as defaults, so we changed all the mounts to that for now. Will see how it goes next week. Do you suggest taking both servers down to make the actimeo=0 change? I am wondering if the issue is that we tried one at a time.

Justin_Greco · ‎04-04-2019

We continue to have this issue and we have come to the conclusion that the issue is related to the SOAP cache handler failing and indicated in the warning message where the synchronization flag is being set. The majority (if not all) of SOAP requests come from ArcMap users connecting, which would mostly be internal staff. This would explain why this issue is only occurring during business hours. As a test, we blocked SOAP requests through IIS for a day and did not see the synchronization occur. I have also tested posting to /arcgis/services/esriAdmin/cache/clear in Postman. This either gives me a result of {'success': true} or it will fail. When it fails, I can get it to succeed again by reloading the /arcgis/services handler in the admin API.

Do you have any thoughts why the SOAP handler would be causing these issues?

Anonymous User · ‎04-10-2019

Hi Justin, I have the exact same issue and have been trying to resolve this for the past 3 months with ESRI. It began around early January. After starting or stopping any service within Server Manger - I get the same log error as you do. This causes a cascading restart of all my ArcGIS Servers and disrupts the entire enterprise.

Setting the synchronization flag on the machine 'ServerA'. Failed to clear Soap handler cache. Could not connect to the ArcGIS component at URL 'http://ServerA:6080/arcgis/services/esriAdmin/cache/clear'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error:

My site:

Three 10.6.1 ArcGIS Servers running on Window Server 2008 R2 on VM

32GB RAM and 4 CPU cores on each

Federated with Portal as the hosting server

Separate Portal Server - Separate Postgresql(Managed) Server - Separate Directories/Config Store/Cache File Server

Running about 147 server services

Outside firewall IIS 7

We have tried a variety of troubleshooting workflows:

The "oplocks" on the shares

Symantec Anti-virus scanning exceptions https://support.esri.com/en/technical-article/000012517

The SOC account(arcgis local account) added to Administrators group

Re-applied all user and file permissions and re-credential arcgis user

Verified single cluster mode and reduced cluster to "one" machine

Provided all ArcGIS Server, OS and Tomcat logs/updates to ESRI Development team - no answer

Repaired all 10.6.1 installs

Set system server properties AppServer to "Optimized"

Verified all OS and ArcGIS service packs

Verified SOAP on all machines https://<<machine name>>:6443/arcgis/services?wsdl

Resetting the handlers and caches within Admin tools

https://<ServerA>:6443/arcgis/admin/local/manageHandler

Any progress or work-around would be appreciated.