ArcGIS (10.1 SP1) Site and Web Adapter randomly crash and stop responding

Anonymous User · ‎05-21-2013

Original User: btelliot

We are struggling with two main problems since moving from ArcServer 10.0 to 10.1.

1. Poor Performance
2. Constant Server downtime / general site instability.

Our ideal server architecture would be to have a multiple virtual machine site, with 2 clusters, and a single web adapter running only using SSL. See attached image for configuration.
[ATTACH=CONFIG]24566[/ATTACH]

We currently have about 350 services running on our site.

~300 of which are configured with a minimum instance of 0 (should turn themselves off) and a max instance of 2.

~20-30 are cached

All running in High-Isolation

licensed for 4 cores per machine and additional staging license (12 cores total).

16GB ram per machine.

web adapter is running with 1 core.

Performance

Our main issue with performance comes from administering / publishing services. Since we have multiple machines, we need to reference the config store from a UNC path. This is a known bug that should be fixed in SP2. (Why they haven't released a hotfix for this is beyond me). For more details see thread: http://forums.arcgis.com/threads/66388-Slow-performance-administering-services-in-ArcCatalog-and-Arc...

However, we also have performance issues on the web client side of things. These issues are intermittent and and difficult to replicate. We can measure this latency using the Network tab of the "Developer Tools" in google chrome. It will sometimes take 3-5 minutes for the server to return the data to the web browser, even on cached services that are already running.

Depending on our configuration and the complexity of the MXD, publishing a service usually takes around 5 minutes at the best of times. At the worst, republishing an existing map document can take up to 30 minutes. If anyone else has experienced any of these issues please let me know!

We have monitored our system resources on the virtual machines, and we rarely hit upwards of 30% CPU usage, unless caching or restarting the machines.

Stability

Since moving to 10.1, we have maybe had a maximum of 1 week go by without a server outage / issue. As we are growing as a company, more people are relying on our services in their workflows, and downtime becomes less and less bearable. In theory, a multiple machine site should be more stable. One server goes offline, the web adapter recognizes this and redirects the traffic to a different server.

Main Issue:

We have noticed that ArcServer running on one of the machines will periodically crash and stop working.

We don't see a spike in system resources, or any other telltale signs on the vms, it just stops responding as it should.

We will experience this at least once a day.

Our normal fix is:

Check to see if the web adapter is responding; if not, restart the VM

Check to see if each individual machine is responding (try to log-into the ArcGIS Service Manager); if not, restart the VM

Reboot whichever server is crapping out, if that doesn't work, try the other one(s).

If it still doesn't work, try stopping the machine from https://[machinename]:6443/arcgis/admin, and then starting it again.

If anyone has some insight into what may be causing this issue, please let us know.

Thank you for taking the time to read this!

Brett

TL;DR: ArcServer 10.1 SP1 is still very buggy. ArcServer will randomly stop working, and we will need to reboot the virtual machine it is running on. We have to do this A LOT.

AndrewSchumpert · ‎01-15-2015

I'll add that I too have these issues and concerns. Our 10.0 ArcGIS Server was running with about 500 services. Then our transition to 10.2.2 began to get unstable with around ~200 services.

I found that setting the service recycle time to a random value so that all the services will not restart at the same time (00:00) helps. Also found that setting the service start-up timeout to a higher value (currently 15 minutes is working) seems to ensure the services start-up. Otherwise some will timeout, fail to start, and then my guess is the config-store gets confused, and managing these particular services sometimes fails.

Hernando_CountyProperty_Apprai · ‎02-09-2015

"Also found that setting the service start-up timeout to a higher value (currently 15 minutes is working) seems to ensure the services start-up. "

Hello Andrew. Can you tell me where this setting is? Are you referring to The maximum time a client will wait to get a service?

ErinBrimhall · ‎02-09-2015

Priscilla,

See the maxStartupTime property of a service. You can set this value through the ArcGIS REST API.

Hernando_CountyProperty_Apprai · ‎02-09-2015

Thank you Erin. So this is not something that you can set through the manager correct? Is there any way to change the default?

ErinBrimhall · ‎02-09-2015

Correct, it doesn't appear that maxStartupTime is exposed through either the Desktop or Manager UIs. It's likely one of those properties that the vast majority of users never need to change, so it's omitted for brevity.

And, to my knowledge, the default values for the various service "timeout" properties are not configurable, or at least the configuration is not officially documented by Esri.

JasonHarris2 · ‎02-19-2015

I seem to have become more stable over time - probably because I reboot the servers once a week now over the weekend.

I have one lingering issue that's driving me nuts. Intermittently, my REST responsiveness will drop dramatically. Looks like about every 10 minutes, and only for a couple of seconds. But, and request made during that time hangs real bad. I have a script that pings my web adapter, 6080 on the server, and a web adapter on the server. Generally speaking, they all experience the slowdown at the same time - so I dont think its a web adapter specific issue, but something at the AGS root level. Check out the perfectly acceptable response times, then Boom!

"Information","ajp-bio-8014-exec-2","02/19/15","13:38:15",,"119 ms"

"Information","ajp-bio-8014-exec-2","02/19/15","13:38:30",,"116 ms"

"Information","ajp-bio-8014-exec-3","02/19/15","13:38:45",,"119 ms"

"Information","ajp-bio-8014-exec-2","02/19/15","13:39:13",,"13462 ms"

"Information","ajp-bio-8014-exec-3","02/19/15","13:39:17",,"2984 ms"

"Information","ajp-bio-8014-exec-3","02/19/15","13:39:30",,"184 ms"

"Information","ajp-bio-8014-exec-2","02/19/15","13:39:45",,"102 ms"

"Information","ajp-bio-8014-exec-2","02/19/15","13:40:00",,"95 ms"

I believe I have ruled out network issues. Done lots and lots of ping and connectivity tests. Everything seems to check out there. Its just like the box 'chokes' every ten minutes for a couple secs. Nothing in the logs that correspond with these 'choke' times that I can see. I thought maybe Disk IO on the data store, but that seems to be fine as well. Im at a loss.