10.6.1 HA ArcGIS Server Site Crashing

12966
54
Jump to solution
01-30-2019 08:30 AM
Justin_Greco
Frequent Contributor

Since upgrading to 10.6.1, we have seen instances of machines in our ArcGIS Server site crashing (its not always the same machine).  After doing some deep diving into the logs and ArcGIS Monitor, we are seeing spikes in the available memory (going from 5GB available to almost the full 32GB).  Usually these are just spikes, but on occasion we see the available memory flatline until we go in and restart the service on the server.  These spikes I have correlated with warnings of the machine synchronizing with the site.  When this happens, it typically stops all the services and brings them all back up, which appears to be what synchronizing with the site does.  The issue seems to be when one of the SOC processes hangs, killing this process (or restarting the ArcGIS Server service) brings the machine back online. Unfortunately it is never the same map service that is hanging.

I do have a premium support ticket open for this, but wanted to see if anyone else has experienced this.  Premium support's recommendation was to increase the number of cores on both servers, which was done, but we continue to see the crashing.

Since we are coming from 10.3.1, I am not familiar with the new optimized app server architecture that was introduced at 10.6.  I am thinking the synchronizing with site function was part of this.  I have not found any documentation or anything into the logs as to when or why the server needs to synchronize with the site.

Some details on the site:

10.6.1 ArcGIS Server running on 7.5 RHEL

32GB RAM and 6 CPU cores

Federated with Portal as the hosting server

Running about 52 map image services (currently no hosted feature services yet)

There are no reports of synchronizing during the night or weekends, so it is definitely happening when there is heavy traffic.

54 Replies
Justin_Greco
Frequent Contributor

Mark,

So far things are looking great since applying the patch.  We have gone through two full days without seeing the synchronization.  Will let you know if it happens at all next week.

0 Kudos
by Anonymous User
Not applicable

Hi Justin, is the patch still working? I wanted to try it today and have

vSphere ready.

0 Kudos
MarkChilcott
Frequent Contributor

Hi Justin,

Thanks - helpful to know.

Cheers,

Mark

0 Kudos
MoginrajMohandas
Occasional Contributor

Hello Mark and others, I hope you were able to try out the 10.6.1 and 10.7.1 patches. It would be good to verify that this resolved the synchronization and service restart issues you all were observing. Thanks!!

0 Kudos
by Anonymous User
Not applicable

Hi Moginraj, after applying the 10.6.1 patch on three servers - it appears to have worked on two of them.  If I remove that 3rd machine from the cluster - and run on only two -It does not attempt to synchronize and fixes the issue.  I have tried numerous things to fix that third machine(remove/add to cluster, repair installation and managehandlers)   - but, no luck.

0 Kudos
MoginrajMohandas
Occasional Contributor

Hi Dan, are you still seeing this?

0 Kudos
MarkChilcott
Frequent Contributor

Hi Moginraj,

Rolled out a new set of servers with the patches this morning.  Two machines in the one site.

So far, so good.  Publishing has not caused the synchronization and restart of services. 

Our load averages and other stats for the servers are looking good.  Log files are not showing anything out of the ordinary thus far.

Cheers,

Mark

MarkChilcott
Frequent Contributor

Hi Peoples,

The upgrade is still looking good from what we are seeing.  Patch appears to have fixed the issue.

Cheers,

Mark

by Anonymous User
Not applicable

We have recently applied the patch. Now about a week in without any synchronization issues. Our primary driver for seeing issues was publishing services, the work around was to bring down one of the servers in the site to publish and then bring it back up once completed. We no longer have to do that.

Justin_Greco
Frequent Contributor

Mark,

I have been told that upgrading to 10.7.1 did not resolve this issue for another customer, so I would not recommend doing that just to fix this.  We have been on a single server site for the past two months and have been able to handle our typical production load with no problems, definitely has been much more stable.  Since they have identified the bug, I am going to sit tight for the bug fix.

0 Kudos