10.6.1 HA ArcGIS Server Site Crashing

Justin_Greco · ‎01-30-2019

Since upgrading to 10.6.1, we have seen instances of machines in our ArcGIS Server site crashing (its not always the same machine). After doing some deep diving into the logs and ArcGIS Monitor, we are seeing spikes in the available memory (going from 5GB available to almost the full 32GB). Usually these are just spikes, but on occasion we see the available memory flatline until we go in and restart the service on the server. These spikes I have correlated with warnings of the machine synchronizing with the site. When this happens, it typically stops all the services and brings them all back up, which appears to be what synchronizing with the site does. The issue seems to be when one of the SOC processes hangs, killing this process (or restarting the ArcGIS Server service) brings the machine back online. Unfortunately it is never the same map service that is hanging.

I do have a premium support ticket open for this, but wanted to see if anyone else has experienced this. Premium support's recommendation was to increase the number of cores on both servers, which was done, but we continue to see the crashing.

Since we are coming from 10.3.1, I am not familiar with the new optimized app server architecture that was introduced at 10.6. I am thinking the synchronizing with site function was part of this. I have not found any documentation or anything into the logs as to when or why the server needs to synchronize with the site.

Some details on the site:

10.6.1 ArcGIS Server running on 7.5 RHEL

32GB RAM and 6 CPU cores

Federated with Portal as the hosting server

Running about 52 map image services (currently no hosted feature services yet)

There are no reports of synchronizing during the night or weekends, so it is definitely happening when there is heavy traffic.

Anonymous User · ‎08-19-2019

Hi Stuart, nope, no resolution. I've tried everything but fix the kitchen sink. ESRI development never gave me an answer - other than - rebuild my entire system.

MarkChilcott · ‎08-15-2019

Hi Peoples,

We are running two machines in the one site, on CentOS Linux. Just witnessed server crash with error message:

Setting the synchronization flag on the machine 'SERVER Name'. Failed to clear Soap handler cache. Could not connect to the ArcGIS component at URL 'http://cradle20o.thelist.tas.gov.au:6080/arcgis/services/esriAdmin/cache/clear'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error:

Anyone know if this is resolved?

Cheers,

Mark

Anonymous User · ‎08-19-2019

Hi Mark, I know this is happening to at least 4 or 5 organizations. No resolution to it's cause. I would try to work it out through tech support/development . . . this has to ring some alarm bells. Indiscriminate "synchronization" OR restarting all your 10.6.1 server services is bad, very bad. Let us know if you contact them and get anywhere.

Justin_Greco · ‎08-19-2019

I would recommend looking at your network load balancer (NLB) configuration. We have the same issue, but believe Esri has been able to identify that we need to have an internal NLB to handle the communication between the servers in the site and also with the Portal servers.

If you look at some of the diagrams on this page, you will be an NLB behind the firewall labeled "lb2" and arrows to and from s1, s2, p1, and p2.

https://enterprise.arcgis.com/en/portal/latest/administer/windows/ha-scenarios-web-gis.htm

The direct link to the image of the diagram I am referring to is:

https://enterprise.arcgis.com/en/portal/latest/administer/windows/GUID-EF93D4A2-1A2F-45E4-818D-E2870...

Again, we have not resolved the issue since we have not implemented this additional NLB, but are hoping to put this in place in the next few weeks. Will let you know if we have any luck.

MarkChilcott · ‎08-19-2019

Hi Justin / Dan,

Not good.

We saw this on Friday after we upgraded from 10.3.1 to 10.6.1 and moved our Test environment into Production. When we upgraded, we built the sites from scratch. New Linux virtual machines, new os, new install of ArcGIS Server, new SSL certificates, new Gluster bricks, re-publish everything, tested all the apps. Did not see this one coming.

On Friday it was interesting. Today when I published, while the site did not crash, it did trigger the synchronization, which in turn triggers the redeploy of all the services and blocks admin operations - which effectively takes the REST points out for 10 minutes, and leaves one wondering if the site will come back up. Since Friday I have moved from being mildly interested and somewhat concerned - to fairly worried.

It looks like the synchronize was introduced with 10.6, and can be initiated manually:

Synchronize With Site

As Justin points out: the new optimized app server architecture that was introduced at 10.6.

It would appear to me the optimized architecture can be overridden at 10.6.1 in order to use legacy architecture. Anyone tried this?

Server Properties - AppServer

Another silly question - anyone seriously considering 10.7.1 at this point?

We have logged this with support - but if you guys can't get this sorted with premium support, we probably have Buckleys chance of resolving it.

Cheers,

Mark

Justin_Greco · ‎08-19-2019

Did you have Portal at 10.3.1 or did you just add Portal at 10.6.1? That was the case for us. Also we never saw this issue coming until we went live with it, since it seems to only be when production load is exposed to the site.

You said it only happens to one of the multi-machine sites. Is the site where you are not seeing this behavior federated with Portal? If they are both federated with Portal, is the site where you are not seeing this receive less traffic than the site you are having trouble with?

Also are you using a load balancer?

Also tech support had us try switching to legacy mode early on and that did not make a difference.

Justin_Greco · ‎08-26-2019

This was just a best practice recommended to us, NOT related the the synchronization.

MarkChilcott · ‎08-19-2019

Hi Justin,

We don't have Portal. Pure ArcGIS Server Advanced. We don't have Web Adapter either. Third party load balancer / Proxy servers / reverse proxy.

It is now occurring on both machines in the site. We are in the process of setting up Test (yeah - long story there). We might have seen the synchronization error on a single machine site on Test - which I would not have thought possible. Looking into it.

Legacy mode not working - Bugger.

10.7.1 anyone?

Cheers,

Mark

Justin_Greco · ‎08-26-2019

One thing to look at is if you are able to post to:

http://cradle20o.thelist.tas.gov.au:6080/arcgis/services/esriAdmin/cache/clear

I used Postman for this, just need to make sure you get a token. The normal behavior is a {"success": true} response. However, when I am seeing the synchronization warnings, I get a 400 error. If you go to .../arcgis/admin/local/manageHandler and reload the /arcgis/services (SOAP) handler, you should get {"success": true} again and not see any messages about the server not being reachable (I validate the federation to verify that warning message does not appear).

Not a fix, just a symptom I have noticed.

MoginrajMohandas · ‎08-27-2019

Hi Justin and others,

We are looking into this as a possible bug in the software, and it is currently under investigation. To give you an understanding of what is happening, here is a synopsis. When there is any publishing activity or administrative activity like stop/start service, there is a bunch of internal operations that happen on all machines in the site(on an HA Server site). This includes clearing or REST and SOAP caches. The soap clear cache is what is failing at this point for all or most of these cases. Since this operation fails on all machines in the site, all machines are set to be synchronized. This is what causes the stop and start of all services(when each of the machines try to synchronize).

A few questions :

1. Do you have cached map services/tiled map services on the site. We have observed that this is most likely related to those kinds of services.

2. Have you tried moving to a single machine site(non-HA) for a period of time to see if the behavior persists? Our testing indicates that it does not happen if you go to a single machine site, because we had made update in 10.6 for machines to not do a synchronization if they are the only machine in the site.

Temporary solutions include (1) going with a single machine site until a fix is in, (2) periodic clearing of the SOAP handler using an internal api that tech support can help with.