We have about 600 map services in a multi-machine AGS site. To keep the number of SOCs low we have the minimum instance set to 0 and max as 2, for most of these services. When the webmaps containing these layers with 0 min instances are accessed in the map viewer, the layers fail to load, and eventually timeout.
The AGS server logs, simply state failed to create an instance, even though the validation of the registered db passes. If we set the minimum instance to 1, then the services load without any issues. I am starting to think the server is failing to spin up new ArcSOCs when it has a large number of services published.
Interestingly, these services performed better in 10.8.1 server setup, we had to keep validating the database every few hours though to make it work in 10.8.1, and doing the same doesn't seem to help in 11.1
Any inputs from community members and Esri staff would be helpful.
We have tried using shared instances, and didn't have much luck with that either. The "Initialization failed." error happens for that too.
somethings in here might help:
Yeah tried increasing the windows heap size to 1280 by editing the registry on all participating machines. It didn't seem to help.
we had this TCPKEEPALIVE and firewall issue years ago, not sure if it is still relevant
https://webhelp.esri.com/arcims/9.3/General/topics/stability_firewall.htm
Sounds like you are hitting your server limits. How many ArcGIS Servers do you have participating in your site? I'm a bit surprised that shared instances aren't working for you. I would recommend everything on shared instances except your mission critical services.
I've seen the same services consuming more RAM on my server while set to dedicated instances as compared to shared instances. This is with the service being idle, there are still things behind the scenes consuming resources.
How is CPU and RAM when you experience these errors? Dumb approach but you may want to turn off all your services, then slowly start turning them on in batches of 10 or so to see when the performance hit begins. You can also use tools like JMeter for load testing: Performance Testing with Apache JMeter (An Introdu... - Esri Community
In addition to this, how many ArcSOC processes do you have running on each server within your site when these errors are occurring? The article below contains commands you can use to obtain the ArcSOC count
