We have an existing ArcGIS Server (AGS) 10.0 solution that is hosting close to 1,000 mapping services. We have been working on an upgrade to this environment to 10.2.1 for a few months now and we are having a hard time getting a stable environment. These services have light use, and our program requirements are to have an environment that can handle large amounts of services with little use. In the AGS 10.0 space we would set all services to 'low' isolation with 8 threads/instance. We also had 90% of our services set to 0 min instances/node to save on memory. Below is a summary of our approaches and where we are today. I'm posting this to the community for information, and I am really interested in some feedback and or recommendations to make this move forward for our organization.
Background on deployment:
We have a few arcgis server deployments that look just like this and are all running fairly stable and with decent performance.
Approach 1: Try to mirror (as close as possible) our 10.0 deployment methodology 1:1
our first problem we ran into was publishing services with 0 instances/node. Esri confirmed 2 'bugs':
#NIM100965 GLOCK files in arcgisserver\config-store\lock folder become frozen when stop/start a service from admin with 0 minimum instances and refreshing the wsdl site
#NIM100306 : In ArcGIS Server 10.2.1, service with 'Minimum Instances' parameter set to 0 gets published with errors on a non-Default cluster
So... that required us to publish all of our services with at least 1 min instance per node. At 1,000 services that means we needed 100-125GB of ram for all the ArcSOC.exe processes running without any future room for growth....
Approach 2: Double the RAM on the AGS Nodes
The file-server crash was clearly caused by publishing a large amounts of services to this new arcgis server environment. We caused our clustered file servers to crash 3 separate times all during this publishing workflow. We had no choice but to isolate this config-store/directories to an alternate location. We moved it to a small web-server to see if we could simulate the crashes there and continue moving forward. So far it has not crashed that server since.
During bootups, with the AGS node hosting all the services, the service startup time was consistently between 20 and 25 minutes. We were able to find a start-up timeout setting at each service that was set to 300 seconds (5 minutes) by default. we set that to 1800 seconds (30 minutes) to try and get these machines to start-up properly. What was happening is that all the arcsoc.exe processes would build and build until some point they would all start disappearing.
In the meantime, we also reviewed the ArcGIS 10.2.2 Issues Addressed List which indicated:
NIM099289 Performance degradation in ArcGIS Server when the location of the configuration store is set to a network shared location (UNC).
We asked our Esri contacts for more information regarding this bug fix and basically got this:
…our product lead did provide the following as to what updates we made to address the following areas of concern listed inNIM099289:
- 1. The Services Directory
- 2. Server Manger
- 3. Publishing/restarting services
- 4. Desktop
- 5. Diagnostics
ArcGIS Server was slow generating a list of services in multiple places in the software. Before this change, ArcGIS Server would read from disk all services in a folder every time the a list of services was needed - this happened in the services directory, the manager, ArcCatalog, etc. This is normally not that bad, but if you have many many services in a folder, and you have a high number of requests, and your UNC/network is not the fastest, then this can become very slow. Instead we remember the services in a folder and only update our memory when they have changed.
Approach 3: Upgrade to 10.2.2 and add 3 more servers
This is the closest we have gotten. At least all services are published. Unfortunately it is not very stable. We continually receive a lot of errors, here is a brief summary:
Level Message Source Code Process Thread SEVERE Instance of the service '<FOLDER>/<SERVICE>.MapServer' crashed. Please see if an error report was generated in 'C:\arcgisserver\logs\SERVERNAME.DOMAINNAME\errorreports'. To send an error report to Esri, compose an e-mail to ArcGISErrorReport@esri.com and attach the error report file. Server 8252 440 1 SEVERE The primary site administrator '<PSA NAME>' exceeded the maximum number of failed login attempts allowed by ArcGIS Server and has been locked out of the system. Admin 7123 3720 1 SEVERE ServiceCatalog failed to process request. AutomationException: 0xc00cee3a - Server 8259 3136 3373 SEVERE Error while processing catalog request. AutomationException: null Server 7802 3568 17 SEVERE Failed to return security configuration. Another administrative operation is currently accessing the store. Please try again later. Admin 6618 3812 56 SEVERE Failed to compute the privilege for the user 'f7h/12VDDd0QS2ZGGBFLFmTCK1pvuUP1ezvgfUMOPgY='. Another administrative operation is currently accessing the store. Please try again later. Admin 6617 3248 1 SEVERE Unable to instantiate class for xml schema type: CIMDEGeographicFeatureLayer <FOLDER>/<SERVICE>.MapServer 50000 49344 29764 SEVERE Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat <FOLDER>/<SERVICE>.MapServer 50001 49344 29764 SEVERE Unable to instantiate class for xml schema type: CIMGISProject <FOLDER>/<SERVICE>.MapServer 50000 49344 29764 SEVERE Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat <FOLDER>/<SERVICE>.MapServer 50001 49344 29764 SEVERE Unable to instantiate class for xml schema type: CIMDocumentInfo <FOLDER>/<SERVICE>.MapServer 50000 49344 29764 SEVERE Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat <FOLDER>/<SERVICE>.MapServer 50001 49344 29764 SEVERE Failed to initialize server object '<FOLDER>/<SERVICE>': 0x80043007: Server 8003 30832 17
Error: Error exporting map
Options for the future
We have an existing site with the ArcGIS SOM instance name of 'arcgis'. These 1,000 services are running in that 10.0 site for the past few years. Users have interacted with this using a URL like: http://www.example.com/arcgis/rest/services/<FOLDER>/<MapService>/MapServer
We are trying to host all these same services so that users accessing this URL will be un-impacted. If we cannot, we will switch to 1 server in 1 cluster in 1 site (and instead have 7 sites). We will then be re-publishing all our content to individual sites but will have different URL's:
We would have extensive amount of work to either (or both) communicate all the new URL's to our end users (and update all metadata, products, documentation, and content management systems to point to the new URL's) and/or build URL Re-direct (or URL Re-write) rules for all the legacy services. Neither of two options are ideal, but right now we seem to have exhausted all other options.
Hopefully this will help other users while they troubleshoot thier arcserver deployment. Any ideas are greatly appreciated with our strategy to make this better. Thanks!
I should have also have asked about the ArcGIS Server heap size settings. We noticed these available settings under the following URL: https://servername.domain/agspub/admin/machines/machinename.doamin?f=pjson
"platform": "Windows Server 2008 R2-amd64-6.1",
Specifically interested in the 'appServerMaxHeapSize' and the 'socMaxHeapSize' settings. Can anyone provide more insight to those settings?
We briefly knocked those up (doubled them at one point in time), but that did not seem to help the situation. We doubled again (and even a third time) to see if it helped with any stability or performance. The highest we went was:
It ended up causing a crash on one of the AGS nodes that logged the following windows events:
Log Name: System
Date: 10/7/2014 8:47:13 AM
Event ID: 2004
Task Category: Resource Exhaustion Diagnosis Events
Keywords: Events related to exhaustion of system commit limit (virtual memory).
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: javaw.exe (56032) consumed 954195968 bytes, ArcGISServer.exe (41556) consumed 927776768 bytes, and ArcSOC.exe (20524) consumed 407146496 bytes.
and we subsequently placed those settings back to the default.
Sounds like you have your hands full. First, we also have a large, distrubuted site running 10.2.2. In all the, the site consisits of 3 app servers hosting ags10.2.2. All app servers are Windows Server 2008R2 machines, where each applicaiton server has 32GB RAM, dual 4 core cpus with hyperthreading enabled. Two machines participate in a mapCluster, with the 3rd on the gpCluster that handles asyc tasks, caching, data extract tasks, etc.
The webServer hosting our WebAdaptor is a virtual machine with 8GB RAM. The File Server hosting our directories and stores has 16GB, same cpu setup on a 10-1 raided array. The files server was previously a SAN gateway but we removed it due to performance issues.
The database server hosting sde 10.2.2, SQL Server 2008R2 (SqlGeometry Storage) also has 32GB Ram, Dual 4-core cpu's but NO HyperThreading enabled.
We are stable in terms of performance and availability with some 70 services. The mapCluster is usually running some 8GB under load with 32-35 SOC procs balanced between each.
Couple of observations. Why are you running one server per cluster? To us, we thought that defeated the whole purpose of the web adaptor's load balancing. I would place 3 of your servers in one mapCluster, and place your fourth in a gpCluster. Also, if you are using full web-tier and windows domain authentication, why are you using the built-in roles and not AD roles? We found that using our AD without using AD roles led to alot of instability. Lastly, for us, we found that reverting back to gis-tier, gis-user defined store and internal roles greatly enhanced our control and scalability-
Hope this helps-
Hi David, thanks for the response!
We are running 1 server/cluster to provide scalability in terms of the amount of services that can be hosted rather than increase the amount of requests that need to be fulfilled. When 2 servers are added to 1 cluster it scales the amount of requests a service can handle rather than how many services can be hosted. The 'minInstancesPerNode' and 'maxInstancesPerNode' setting specify how man instances per node the publishers would like available for their services.
Assume we set 1 'min instances' and 2 'max instances' on each service (default). Assume we had 4 servers each are spec'd to host 250 total instance executables...
Scenario 1: Scale the amount of services to be hosted
If we have 1 server/cluster and 4 total clusters, then we can host 1000 total services when idle (4 servers*250 instances/server). assuming each service only needs 1 instancePerNode. Obviously this has no room for growth in terms of hosting more services or having more instances available for requests.
Scenario 2: Scale the amount of requests that need to be fulfilled
If we had a 3 node cluster, then while idle ArcGIS Server will build 1 instance on each node in that cluster for each service (so that 3 executables can handle requests rather than 1 executable in the scenario above). with that scenario, I can only host (at most) 250 total services and assume they will not need a second 'instancePerNode' spun up. Obviously this has no room for growth in terms of hosting more services.
Based on our business requirements (for this project).... this arcgis server platform needs to provide large amounts of services that require little use (at this point in time). Isolating the servers into individual clusters meet that need and also provide some isolation from each-other. Monitoring usage in the future may change this publishing model (where we may have a multi-node cluster and move services to that cluster that become popular). Make sense?
As for the AD Roles... we had some pretty serious performance issues at ArcGIS Server 10.1 and have not revisited since. I'm not sure if those have been worked out or not in 10.2.2.
I am interested in the QFE you referenced. We have been pushing secure support for a hot fix on both of these issues Without much success. Do you have more information about this QFE regarding what was fixed, how to get ahold of this and how to deploy this? Thanks!