10.6.1 HA ArcGIS Server Site Crashing

11748
54
Jump to solution
01-30-2019 08:30 AM
Justin_Greco
Occasional Contributor III

Since upgrading to 10.6.1, we have seen instances of machines in our ArcGIS Server site crashing (its not always the same machine).  After doing some deep diving into the logs and ArcGIS Monitor, we are seeing spikes in the available memory (going from 5GB available to almost the full 32GB).  Usually these are just spikes, but on occasion we see the available memory flatline until we go in and restart the service on the server.  These spikes I have correlated with warnings of the machine synchronizing with the site.  When this happens, it typically stops all the services and brings them all back up, which appears to be what synchronizing with the site does.  The issue seems to be when one of the SOC processes hangs, killing this process (or restarting the ArcGIS Server service) brings the machine back online. Unfortunately it is never the same map service that is hanging.

I do have a premium support ticket open for this, but wanted to see if anyone else has experienced this.  Premium support's recommendation was to increase the number of cores on both servers, which was done, but we continue to see the crashing.

Since we are coming from 10.3.1, I am not familiar with the new optimized app server architecture that was introduced at 10.6.  I am thinking the synchronizing with site function was part of this.  I have not found any documentation or anything into the logs as to when or why the server needs to synchronize with the site.

Some details on the site:

10.6.1 ArcGIS Server running on 7.5 RHEL

32GB RAM and 6 CPU cores

Federated with Portal as the hosting server

Running about 52 map image services (currently no hosted feature services yet)

There are no reports of synchronizing during the night or weekends, so it is definitely happening when there is heavy traffic.

54 Replies
Justin_Greco
Occasional Contributor III

Hi Dan,

I think I can put checks next to everything you listed.  Only difference being that we are running Red Hat Linux, which rules out that it is an OS issue in my opinion.  A few questions I have for you:

1) Is there a pattern for when the synchronization occurs?  We have seen a pattern of it occurring in 15 minute intervals (30 minutes per server).  

2) Does the synchronization seem to only occur during business hours?  We only see it occur between 8am and 5pm, never at night, never on weekends, and never on holidays.  Since the only thing I know that uses SOAP over REST in our environment is ArcMap, which would be internal staff.

I am waiting for some info from premium support this week.  Since I put in my workaround I mentioned above (reloading the SOAP cache handler every few minutes), we went from having 20 synchronizations a day to no more than 1.  This by no means is a solution, its really just a band-aid.  

0 Kudos
by Anonymous User
Not applicable

Hi - thanks for responding so quickly - this is one tough issue.

1 - No pattern. When I start or stop a service via Server Manager - it

triggers in the logs showing it will synchronize. OR when anyone in our

organization publishes a service(start/stop) - I think.

2 - We are 24/7 and I can publish from home after hours. I can trigger it

after hours. So, I guess it has to be triggered by me or staff doing

"something".

To me, it appears intermittent but I wonder what happens every 30 minutes

on your servers. Is that the expiration of a token or a certificate?

I have a scheduled python script to reset this handler. But, it does't

appear to reset the handler always OR work all the time.

I can sometimes stop/start a service after this a couple of times and then

it comes back.

local_server_url = 'https://

'server_sub_domain'.macomb.county:6443/arcgis/admin/local/manageHandler'

What is the exact admin API path you are using to "reload" the soap handler?

My ESRI Solutions (Chris) told me about your GeoNet article - maybe our

premium support needs to work together.

0 Kudos
Justin_Greco
Occasional Contributor III

So I did hear from our support analyst that someone was seeing the synchronizations during a start/stop of services, which we were not about to reproduce on our system.  Ours just happens without any human interaction. 

Another question I have is when the synchronization happens, does the machine ever stay in a stopped state?  What we have seen is that the machine that is synchronizing never fully restarts, instead we see a single arcsoc process hanging.  We can either kill the process or just restart the arcgisserver service to get it started up again.  This is only something that happens once in a while.

0 Kudos
by Anonymous User
Not applicable

Hi Justin, still troubleshooting.  Resetting the handlers every 5 minutes appears to work as a "band aid" - thanks.  On more question for you:  Do you have your config-store on the same machine as your ArcGIS Server - and not on a network share?  I am assuming you have a single server site.  

0 Kudos
Justin_Greco
Occasional Contributor III

No this is a multi-machine site, is yours a single server site?  We never see this issue if one of the servers is stopped or removed from the site.  So our config-store is on a network share since it is multi-machine.

0 Kudos
by Anonymous User
Not applicable

Mine is multi machine(3), also. When I remove 1 or 2 . . . . I still see

the soap handler cache error and it gets flagged for

"re-synchronization"(restarts ArcGISServer.exe shortly after)

The next troubleshooting suggestion from ESRI is moving the config-store

local(same machine) as ONE of the ArcGIS servers - and see what happens.

Have you tried this already?

0 Kudos
Justin_Greco
Occasional Contributor III

Do you not see the synchronization when you have 3 machines in the site?  Do you only see it when you have 2 machines?  I haven't tried moving the config-store local to the server, its a bit different in Linux.

0 Kudos
by Anonymous User
Not applicable

Yes, it synchronization happens with 3 machines in the cluster OR 2

machines OR 1 machine.

0 Kudos
by Anonymous User
Not applicable

No, it never hangs during the ArcGIS Server Service(SOC manager?) recycle - for us.

Did you test restarting a service from Server Manager and then immediately looking at you logs?  It takes a minute or so before it restarts for me.

0 Kudos
StuartPidgeon1
New Contributor

Hi Justin (and Dan)

Did you ever find a longer term solution for this?  We are experiencing similar issues with one of our two-machine AGS sites after upgrading to 10.6.1.  The issue mostly arises whenever someone publishes a service but can also occur simply when a service is stopped.  Interestingly the issue only affects one of our three sites - another two sites set up in much the same way are not affected.  I was interested to see thread but after a detailed discussion it stops rather abruptly.  If there's any further information you're able to offer that would be greatly appreciated.

Regards,

Stuart

0 Kudos