10.6.1 HA ArcGIS Server Site Crashing

11758
54
Jump to solution
01-30-2019 08:30 AM
Justin_Greco
Occasional Contributor III

Since upgrading to 10.6.1, we have seen instances of machines in our ArcGIS Server site crashing (its not always the same machine).  After doing some deep diving into the logs and ArcGIS Monitor, we are seeing spikes in the available memory (going from 5GB available to almost the full 32GB).  Usually these are just spikes, but on occasion we see the available memory flatline until we go in and restart the service on the server.  These spikes I have correlated with warnings of the machine synchronizing with the site.  When this happens, it typically stops all the services and brings them all back up, which appears to be what synchronizing with the site does.  The issue seems to be when one of the SOC processes hangs, killing this process (or restarting the ArcGIS Server service) brings the machine back online. Unfortunately it is never the same map service that is hanging.

I do have a premium support ticket open for this, but wanted to see if anyone else has experienced this.  Premium support's recommendation was to increase the number of cores on both servers, which was done, but we continue to see the crashing.

Since we are coming from 10.3.1, I am not familiar with the new optimized app server architecture that was introduced at 10.6.  I am thinking the synchronizing with site function was part of this.  I have not found any documentation or anything into the logs as to when or why the server needs to synchronize with the site.

Some details on the site:

10.6.1 ArcGIS Server running on 7.5 RHEL

32GB RAM and 6 CPU cores

Federated with Portal as the hosting server

Running about 52 map image services (currently no hosted feature services yet)

There are no reports of synchronizing during the night or weekends, so it is definitely happening when there is heavy traffic.

54 Replies
MarkChilcott
Occasional Contributor III

Hi Moginraj,

We have a number of cached map Services.  You can see out site here:

LISTmap

We noted in the logs a lot of failed requests via soap to our services as we are in the process of deprecating  ImageServers and moving to MapServers.

We will look at the cron job to clear the cache. 

Any additional information on what the SOAP cache is and what it does?

With the load on the server, it may not be possible to move to single machine site.  However we are considering Single-machine high-availability (active-active) deployment.

Any thoughts on this configuration?

Another consideration is 10.7.1.  I did hear mention there may be changes in this release in the manner the servers communicate with each other  that may address this.  Thoughts?

Cheers,

Mark

0 Kudos
MoginrajMohandas
New Contributor III

Mark Chilcott

Any additional information on what the SOAP cache is and what it does? - The soap and rest caches cache response information in memory so you get better performance for frequently used operations on services. 

With the load on the server, it may not be possible to move to single machine site.  However we are considering Single-machine high-availability (active-active) deployment.

Any thoughts on this configuration? - If you have a set number of services, and that number does not change often, this configuration may be an option. Is that the case?

Another consideration is 10.7.1.  I did hear mention there may be changes in this release in the manner the servers communicate with each other that may address this.  Thoughts?

- We have made some improvements in 10.7/10.7.1 for sites with multiple machines which adds more reliability and stability to the site. Pls check What's new in ArcGIS Server 10.7.1—Documentation | ArcGIS Enterprise  under sections titled "Management" and "Monitoring". As for the issue you are seeing, it is a bug which we are actively investigating.

0 Kudos
MarkChilcott
Occasional Contributor III

Hi Moginraj,

Just as an aside, we are also seeing when the machine in the site starts the synchronization process and attempts to redeploy all the services - if it is under load it will fail to start.  The only method to successfully start the machine is to block traffic via the proxy/load balancer to enable the machine to start, and once up, re-open traffic to the server.  It has required a restart of ArcGIS Server on a number of occassions.

While in the past we have seen this sort of behaviour, not to this degree.

PS:  we have not implemented the cron job yet to clear caches.  Looking into that now.

Cheers,

Mark

0 Kudos
MoginrajMohandas
New Contributor III

Synchronization happening under heavy load can potentially cause this behavior. This is more drastic in your case since synchronization is happening often and it is a very destructive operation. Ideally, a synchronization is only meant for a machine that was temporarily down/unavailable and then comes back. 

10.7.1 does not have a fix for the issue, we are working on the fix. Now, 10.7/10.7.1 has a feature called "Under Maintenence" that can be set on any machine which will make that machine return a failed healthcheck. If your proxy/LB uses ArcGIS Server's healtcheck url to check its health(this is per machine and the url is https://machineName.domain.com:6443/arcgis/rest/info/healthcheck ), then you can leverage this to allow your proxy to not send traffic to a specific machine and then revert the "Under Maintenence" flag when you think it is ready. 

0 Kudos
MarkChilcott
Occasional Contributor III

Hi Moginraj,

We have fire season in 8 weeks time. 

Last year we had fires that burned for 3 months; burnt through 210,000 hectares, putting communities and lives at risk between December 2018 and March this year.  A total of 6 per cent of our world heritage was burnt covering 95,430 hectares, including 2,300ha of different threatened native vegetation communities.  Approximately 14 per cent of Tasmania's very tall forests were burned.  We spent $40 million on fire fighting aircraft.  Also lost houses and infrastructure and businesses.  It was the second largest fires in our history, and this year is not looking any better.

Underpinning all the decision making during the campaign - our mapping.

We can't afford to have this unresolved.  We are looking to implement the Single-machine high-availability (active-active) deployment in the near future.  We will report back how it works out.

Single-machine high-availability (active-active) deployment

Cheers,

Mark

MoginrajMohandas
New Contributor III

Mark and others,  

                              This behavior is being tracked as BUG-000124827 and a fix is currently being built. Once a fix is available, we plan on releasing patches with the fix for both ArcGIS Server 10.6.1 and 10.7.1. I can't give a specific date for when those patches will be ready, but I hope sometime within the next 5-6 weeks if not sooner.

0 Kudos
MarkChilcott
Occasional Contributor III

Hi Moginraj,

Thanks - appreciate all the effort that is going into resolving this.

Cheers,

Mark

0 Kudos
MarkChilcott
Occasional Contributor III

Hi Moginraj,

I noted today the status of BUG-000124827 has changed to implemented in 10.8.

Any idea of time frame for 10.6.1 patch?

Cheers,

Mark

0 Kudos
MarkChilcott
Occasional Contributor III

Hi Peoples,

Esri have released the ArcGIS Server Unintended Service Restart Patch  to resolve this issue.

Anyone tried it yet?

Cheers,

Mark

0 Kudos
Justin_Greco
Occasional Contributor III

Funny you ask, I was in the middle of applying it in production when I saw this come up in my inbox.  I applied to it to test last week and put it under some load tests and did not see any issues.  Will let you know how it goes once we get actually production traffic.

0 Kudos