Large ArcGIS Server 'Site': stability issues

PF1 · ‎10-14-2014

We have an existing ArcGIS Server (AGS) 10.0 solution that is hosting close to 1,000 mapping services. We have been working on an upgrade to this environment to 10.2.1 for a few months now and we are having a hard time getting a stable environment. These services have light use, and our program requirements are to have an environment that can handle large amounts of services with little use. In the AGS 10.0 space we would set all services to 'low' isolation with 8 threads/instance. We also had 90% of our services set to 0 min instances/node to save on memory. Below is a summary of our approaches and where we are today. I'm posting this to the community for information, and I am really interested in some feedback and or recommendations to make this move forward for our organization.

Background on deployment:

Targeted ArcGIS Server 10.2.1
Config-store/directories are hosted on a clustered file server (active/passive) and presented as a share: \\servername\share
Web-tier authentication
- 1 web-adaptor with anonymous access
- 1 web-adaptor with authenticated access (Integrated Windows Authentication with Kerberos and/or NTLM providers)
- 1 web-adaptor 'internally' with authenticated access and administrative access enabled (use this for publishing)
User-store: Windows Domain
Role-store: Built-In

We have a few arcgis server deployments that look just like this and are all running fairly stable and with decent performance.

Approach 1: Try to mirror (as close as possible) our 10.0 deployment methodology 1:1

Build 4 AGS 10.2.1 nodes (virtual machines).
Build 4 individual clusters & add 1 machine to each cluster
Deploy 25% of the services to each cluster. The AGS Nodes were initially spec'd with 4 CPU cores and 16GB of RAM.
- Each ArcSOC.exe seems to consume anywhere from 100-125MB of RAM (sometimes up to 150 or as low as 70).
- Publishing 10% of the services with 1 min instance (and the other 90 to 0 min instances) would leaving around 25 ArcSOC.exe on each server when idle.
- The 16GB of RAM could host a total of 100-125 total instances leaving some room for services to startup instances when needed and scale slightly when in use.

our first problem we ran into was publishing services with 0 instances/node. Esri confirmed 2 'bugs':

#NIM100965 GLOCK files in arcgisserver\config-store\lock folder become frozen when stop/start a service from admin with 0 minimum instances and refreshing the wsdl site

#NIM100306 : In ArcGIS Server 10.2.1, service with 'Minimum Instances' parameter set to 0 gets published with errors on a non-Default cluster

So... that required us to publish all of our services with at least 1 min instance per node. At 1,000 services that means we needed 100-125GB of ram for all the ArcSOC.exe processes running without any future room for growth....

Approach 2: Double the RAM on the AGS Nodes

We added an additional 16GB of RAM to each AGS node (they now have 32GB of RAM) which should host 200-250 arcsoc.exe (which is tight to host all 1,000 services).
We published about half of the services (around 500) and started seeing some major stability issues.
- During our publishing workflow... the clustered file server would crash.
  - This file server hosts the config-store/directories for about 4 different *PRODUCTION* arcgis server sites.
  - It also hosts our citrix users work spaces and about 13TB of raster data.
  - During a crash, it would fail-over to the passive file server and after about 5 minutes the secondary file server would crash.
  - This is considered a major outage!
- On the last crash, some of the config-store was corrupted. While trying to login to the 'admin' or 'manager' end-points, we received an error that had some sort of parsing issue. I cannot find the exact error message. We had disabled the primary site admin account, so went in to re-enable, but the super.json file was EMPTY! We had our backup team restore the entire config-store from the previous day, and copied over the file. I'm not sure what else was corrupted. after restoring that file we were able to login again with our AD accounts.

The file-server crash was clearly caused by publishing a large amounts of services to this new arcgis server environment. We caused our clustered file servers to crash 3 separate times all during this publishing workflow. We had no choice but to isolate this config-store/directories to an alternate location. We moved it to a small web-server to see if we could simulate the crashes there and continue moving forward. So far it has not crashed that server since.

During bootups, with the AGS node hosting all the services, the service startup time was consistently between 20 and 25 minutes. We were able to find a start-up timeout setting at each service that was set to 300 seconds (5 minutes) by default. we set that to 1800 seconds (30 minutes) to try and get these machines to start-up properly. What was happening is that all the arcsoc.exe processes would build and build until some point they would all start disappearing.

In the meantime, we also reviewed the ArcGIS 10.2.2 Issues Addressed List which indicated:

NIM099289 Performance degradation in ArcGIS Server when the location of the configuration store is set to a network shared location (UNC).

We asked our Esri contacts for more information regarding this bug fix and basically got this:

…our product lead did provide the following as to what updates we made to address the following areas of concern listed inNIM099289:

1.       The Services Directory

2.       Server Manger

3.       Publishing/restarting services

4.       Desktop

5.       Diagnostics

ArcGIS Server was slow generating a list of services in multiple places in the software. Before this change, ArcGIS Server would read from disk all services in a folder every time the a list of services was needed - this happened in the services directory, the manager, ArcCatalog, etc. This is normally not that bad, but if you have many many services in a folder, and you have a high number of requests, and your UNC/network is not the fastest, then this can become very slow. Instead we remember the services in a folder and only update our memory when they have changed.

Approach 3: Upgrade to 10.2.2 and add 3 more servers

We added 3 more servers to the 'site' (all 4CPU, 32GB RAM) and upgraded all to 10.2.2. We actually re-built all the machines from scratch again
We threw away our existing 'config-store' and directories since we knew at least 1 file was corrupt. We essentially started from square 1 again.
All AGS nodes were installed with a fresh install of 10.2.2 (confirmed that refreshing folders from REST page were much faster).
Config-store still hosted on web-server
We mapped our config-store to a DFS location so that we could move it around later
Published all 1,000 ish services successfully with across 7 separate 'clusters'
Changed all isolation back to 'high' for the time being.

This is the closest we have gotten. At least all services are published. Unfortunately it is not very stable. We continually receive a lot of errors, here is a brief summary:

Level Message Source Code Process Thread

SEVERE Instance of the service '<FOLDER>/<SERVICE>.MapServer' crashed. Please see if an error report was generated in 'C:\arcgisserver\logs\SERVERNAME.DOMAINNAME\errorreports'. To send an error report to Esri, compose an e-mail to ArcGISErrorReport@esri.com and attach the error report file. Server 8252 440 1

SEVERE The primary site administrator '<PSA NAME>' exceeded the maximum number of failed login attempts allowed by ArcGIS Server and has been locked out of the system. Admin 7123 3720 1

SEVERE ServiceCatalog failed to process request. AutomationException: 0xc00cee3a - Server 8259 3136 3373

SEVERE Error while processing catalog request. AutomationException: null Server 7802 3568 17

SEVERE Failed to return security configuration. Another administrative operation is currently accessing the store. Please try again later. Admin 6618 3812 56

SEVERE Failed to compute the privilege for the user 'f7h/12VDDd0QS2ZGGBFLFmTCK1pvuUP1ezvgfUMOPgY='. Another administrative operation is currently accessing the store. Please try again later. Admin 6617 3248 1

SEVERE Unable to instantiate class for xml schema type: CIMDEGeographicFeatureLayer <FOLDER>/<SERVICE>.MapServer 50000 49344 29764

SEVERE Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat <FOLDER>/<SERVICE>.MapServer 50001 49344 29764

SEVERE Unable to instantiate class for xml schema type: CIMGISProject <FOLDER>/<SERVICE>.MapServer 50000 49344 29764

SEVERE Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat <FOLDER>/<SERVICE>.MapServer 50001 49344 29764

SEVERE Unable to instantiate class for xml schema type: CIMDocumentInfo <FOLDER>/<SERVICE>.MapServer 50000 49344 29764

SEVERE Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat <FOLDER>/<SERVICE>.MapServer 50001 49344 29764

SEVERE Failed to initialize server object '<FOLDER>/<SERVICE>': 0x80043007: Server 8003 30832 17

Other observations:

Each AGS node makes 1 connection (session) to the file-server containing the config-store/directories
During idle times, only 35-55 files are actually open from that session.
During bootups (and bulk administrative operations), the file's open jump consistently between 1,000 and 2,000 open files per session
The 'system' process on the file server spikes especially during bulk administrative processes.
The AGS nodes are consistently in communication with the file server (even when the site is idle). CPU/Memory and Network monitor on that looks like this:
AGS nodes look similar. It seems there is a lot of 'chatter' when sitting idle.
Requests to a service succeed 90% of the time but 10% of the time we receive HTTP 500 errors:

Error: Error exporting map

Code: 500

Options for the future

We have an existing site with the ArcGIS SOM instance name of 'arcgis'. These 1,000 services are running in that 10.0 site for the past few years. Users have interacted with this using a URL like: http://www.example.com/arcgis/rest/services/<FOLDER>/<MapService>/MapServer

We are trying to host all these same services so that users accessing this URL will be un-impacted. If we cannot, we will switch to 1 server in 1 cluster in 1 site (and instead have 7 sites). We will then be re-publishing all our content to individual sites but will have different URL's:

http://www.example.com/arcgis1/rest/services/<FOLDER>/<MapService>/MapServer

http://www.example.com/arcgis2/rest/services/<FOLDER>/<MapService>/MapServer

...

http://www.example.com/arcgisN/rest/services/<FOLDER>/<MapService>/MapServer

We would have extensive amount of work to either (or both) communicate all the new URL's to our end users (and update all metadata, products, documentation, and content management systems to point to the new URL's) and/or build URL Re-direct (or URL Re-write) rules for all the legacy services. Neither of two options are ideal, but right now we seem to have exhausted all other options.

Hopefully this will help other users while they troubleshoot thier arcserver deployment. Any ideas are greatly appreciated with our strategy to make this better. Thanks!

PF1 · ‎10-15-2014

HI Michael,

unfortunately a cloud hosted solution (Amazon IaaS or AGOL SaaS) are not an option for us at this time. I work for a government agency that requires FISMA accreditation/compliance and a sighed Authority To Operate (ATO). We are working that for a few cloud vendors, but we are not quite ready for that as an agency. As for hardware, we have plenty we can throw at it on our on-premise deployment, but that does not seem to be our bottleneck.

As for the question regarding the number of mapping services and potential re-architecture: these are contract deliverables that represent a static view of these project areas. The currently 1000 services are our *legacy* footprint which has been frozen. Basically... There are some potential enhancements for future services but these legacy ones are frozen in time and we in the IT department are on the hook with providing a hosting platform that meets our agencies business needs.

thanks!

PaulDavidson1 · ‎04-23-2015

Did not AGOL receive FISMA certification in summer 2014?

It was announced at UC, at least that's what I recall.

Doesn't fix your ATO and other issues however.

I also wonder about 1000 services.

That is huge. Of course, so is the area the BLM covers...

Especially if these are static, it would seem you could put multiple layers onto one mxd and publish like that.

In fact, since they're static, you never have to touch the mxds again, once the deliverable (an mxd?) is received.

It sounds like your end users are basically demanding one service per layer?

I have learned one thing about being the IT side of the GIS equation.

Sometimes the end users don't understand the tech difficulties and have to be flexible with their work flows....

Sometimes IT has to say, it can't be done that way but we can do it this way....

Or, we can do it like that but the time and cost is X and doing it this other way costs Y and X >10Y

That is something management usually pays attention to.

Publishing a rarely used, single static layer in a map service just seems like a huge waste of resources.

I am quite curious to know if 10.3 fixes the 0 instance errors.

George_Thompson · ‎04-23-2015

Here is some more information on the AGOL Compliance certifications:

Compliance—Trust ArcGIS | ArcGIS

Trust | ArcGIS

Hope this helps!

-George

--- George T.

AaronKreag · ‎04-13-2015

I am having the exact same issue now with 10.2.2. Would love to chat about it sometime. Did you switch to 10.3? It seems I haven't seen the same issues mentioned online with 10.3 and I am wondering if it is a more stable version of AGS. My email is akreag at bisconsultants dot com. In fact I will say, anyone in this thread....please send me a note. Thanks!!

DanHuber · ‎04-17-2015

Aaron,

While we did switch to 10.3, the only resolution for this matter was to break the site up into smaller sites. This is also how Esri is now recommending to implement this type of solution. So rather than 7 servers participating in a single site, we have 4 new sites with no more than two servers each in their own cluster.

Contact me directly if you want to discuss further, and I hope this helps.

Dan

PaulDavidson1 · ‎04-24-2015

Dan:

Is 2 servers per cluster now considered best practice?

In which case why bother with the cluster and instead create lots of individual sites.

But then don't we loss out on the concept of letting ArcServer allocate for us?

I have been told 2 VMs of 2 cores each is better than 1 with 4 cores (depending on load)

Interestingly, this coincides with the software direction Esri started pushing of smaller directed apps vs. the one big do it all map.

Tx

-Paul

AaronKreag · ‎04-21-2015

As a follow up. In consultation with ESRI we found a series of XML type errors below in the logs:

Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat - 88

Unable to instantiate class for xml schema type: CIMDEGeographicFeatureLayer - 29

Unable to instantiate class for xml schema type: CIMDocumentInfo - 30

Unable to instantiate class for xml schema type: CIMGISProject - 29

In the AGS logs.

\logs\servername\services\MXDNAME.MapServer

Verified that CPU and RAM were not over utilized, ruled out firewall, antivirus and network. Looks like at 10.1 and newer...not only does ESRI recommend a min of 4GB RAM per core in a virtual server but from the sound of it, the virtual memory settings in windows seem to be a bigger issue. Adjusted the virtual memory to best and system managed...rebooted the server and have gone now 4 days with no issues.....

According to ESRI....they are thinking that virtual memory and the "committed memory" have a bigger impact than actual RAM now. As a side note, they also mentioned adjusting windows HEAP and stated that they recommend no more than 200 map services per site (max). Hope this helps someone.

Aaron

PaulDavidson1 · ‎04-23-2015

Are you using VMware? I know in VMware Workstation, (Edit:Preferences (before firing off any VM box)) there is a setting that tells the VM boxes to: Fit all virtual machine memory into reserved host RAM, Allow some... to swap ., Allow most... to swap

I assume the VMware ESXi has a similar setting. And can ask our systems guys if needed.

It's been shown that in virtual boxes, swapping into virtual space is quite slow and degrades performance.

Locking them into actual RAM is noticeably faster. Of course, in your case, that is probably not a good idea because your locking RAM away that could rarely be used.

.

As is the file system you are using. e.g. Don't use shared folders, the ones you access via VMware tools.

Dog slow... only use to transfer between guest and host. I'd doubt any are in use in an ESXi sphere.

I think this is more related to local development VMware Workstation boxes.

We have a server with ~220 services and it has always been problematic, runs right on the edge with 24GB of RAM. I've put it onto a standard reboot service due to memory leaks in javaw and the pkill process.

I inherited this and I'm looking for ways to drop the # of services. I was hoping to go to 0 instances, but now I'm gun shy.

You might find the following worth a read:

Comparing Filesystem Performance in Virtual Machines

I hope you continue to post your findings and solutions.

Overcoming your issues with such a huge installation will really help those of us with less complicated but not trivial setups 😉

AaronKreag · ‎04-23-2015

Paul,

We run HyperV here and the system guys have it tuned to the max. Its been now a week and we are still running without an issue on the stability side. I have found that a weekly reboot at a minimum does help.... I have never had very good luck running VMware with my enterprise GIS deployments.

Aaron

PaulDavidson1 · ‎04-23-2015

Well, that’s concerning…

and perplexing given my experience.

What sort of issues have you had with VMware?

Was this with Workstation or ESXi/Sphere?

Was it back with early versions or recent incantations?

I've been at two enterprise shops that have gone VMware w. NetApp and we've had great success and very few issues.

We've taken a number of serious enterprise servers into the virtual world with maybe an afternoon of downtime.

Run a virtual converter and publish the box and we're done.

Since I'm in the process of designing a new 10.3 server farm this thread is quite pertinent and I appreciate your responses.