tad.hammer_Greeley

Server Sizing for "large" number of services

Discussion created by tad.hammer_Greeley on Apr 29, 2020

A discussion similar to this one is here:  Related discussion

 

I started this discussion in the hopes of getting as well as giving help to those that hit the similar issue that I suspect can manifest itself in different ways-- the errors generated in the related article were slightly different-- different services/XML parse issues occurred on different tools.  This learning was derived by a 3.5+ hour tech support call and research through a lot of GeoNet posts, documentation and other supporting information. Sometimes, I find that there isn't a lack of shared knowledge as much as there is a lack of either hashtag usage or indexing the topics, so hopefully I've entered enough tags to help reach more people.

 

I have a new implementation of Enterprise GIS 10.7.1 upgraded from 10.3.1 running on Windows Servers (2016).  Both servers are configured with 4 CPUs and 64 GB of RAM in our ArcServer Site.

 

After publishing about 300 services through a python script against Arc Map 10.7.1 .MXD documents, the server would refuse to publish any more, or would get to various states of publishing (it wasn't consistent between SDDraft creation, Staging or uploading) and throw some error(s) to the log (see below) and then hang.  I thought my server was just slow, so I waited.. and waited.. and waited.. but the errors on systems running over capacity don't appear to raise an error and stop the processing.

 

This is one of the errors it threw:

The server was unable to create a GIS service instance. The exact cause is unknown. Please contact Esri Support for further assistance and provide the following information. 0x80040111 - ClassFactory cannot supply requested class at com.esri.arcgis.gisclient.AGSServerConnectionFactory.open(Unknown Source) at com.esri.arcgis.discovery.servicelib.impl.ConfigFactoryImpl$e.run(ConfigFactoryImpl$e.java:546)

 

Also, intermittently, the server would refuse to start the Publishing Tools (system services).

 

Core server call to create service failed (/admin/createService).

InitToolbox Failed

The toolbox did not load: C:\Program Files\ArcGIS\Server\ArcToolBox\Services\Publishing Tools.tbx

 

 

I also got the following errors, which also weren't consistent between publishing attempts, but did recur frequently in my troubleshooting efforts:

Failed to construct instance of service 'Data_Services_TEST/test3.MapServer'.

Failed to initialize server object 'Data_Services_TEST/test3': 0x80043007:

 

Unable to instantiate class for xml schema type: CIMDEMap

Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat

 

Unable to instantiate class for xml schema type: CIMGISProject

Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat

 

Unable to instantiate class for xml schema type: CIMDocumentInfo

Invalid xml registry file: c:\program files\arcgis\server\bin\XmlSupport.dat

 

Failed to construct instance of service 'Data_Services_TEST/test3.MapServer'. XML Parser initialization failed.

Failed to initialize server object 'Data_Services_TEST/test3': 0x80040111: XML Parser initialization failed.

 

Failed to construct instance of service 'System/PublishingTools.GPServer'. The toolbox did not load: C:\Program Files\ArcGIS\Server\ArcToolBox\Services\Publishing Tools.tbx

Failed to initialize server object 'System/PublishingTools': 0x80040111: The toolbox did not load: C:\Program Files\ArcGIS\Server\ArcToolBox\Services\Publishing Tools.tbx

 

If you research XmlSupport.dat, you find an article (among a couple others) from version 9.3 about that particular file being corrupt.   I compared that file (downloaded from the article) against the 10.7.1 version.  The first thing I noticed is that my file had not been updated since more than a year before, so I made a quick assumption that since no change was made very recently, the file was not corrupted by something I had done (I'm pretty sure in some cases, it could be corrupt, but I couldn't think of any off the top of my head).  I opened up the file for editing and the content looked ok-- nothing garbled (this is a very readable file). But when new functionality is added to ArcMap/Server, information about the new services/functionality is stored here that references esri.com (haven't researched what all this entails).  Having said that, the old and new files are vastly different.  I decided to keep looking.

 

Also, at one point, I received: ERROR 001369: Failed to create the service.  While this is noted as a generic error, it was an indicator that only part of the publishing process could be complete-- apparently, in my case, the server reached its maximum processing capacity and just stopped.

 

Some history on our upgrade:  One of the reasons we decided to "re-architect" our structure and upgrade to 10.7.1 and add a second server to the ArcServer site was to avoid these issues since we hit the "halt" in publishing-- but with different errors-- back in ArcServer 10.3.1.  It just occurred to me to check the number of services that I had built when I consistently saw the server run at >95% capacity via the information from task manager.  From the related article(top of this discussion), you can read about some of the things that Windows Server will do to "assist" you when you hit the wall-- memory swapping, for instance, but these activities still don't preclude the server from becoming "unstable" when run at that capacity.

 

Something important I learned:  adding servers to a site only spreads processing capabilities (spreads the work out AFTER the service gets published between the different machines in the Server Site) aka horizontal scaling or scaling out.  It will NOT scale the system vertically (up) to allow more services to exist on your system at a given time.  I found this ESRI user conference powerpoint presentation helpful in understanding scaling and tuning more clearly.  I'm sure it's pretty close to the regular ESRI documentation, but sometimes, things presented pictorially work better for me.

 

I'm still working on how we want to resolve this, but I know that by reducing the number of active services (by deleting or stopping the service(s)) it will help alleviate the issue.  The problem in troubleshooting?  Since all server (hardware) configurations are different, it's hard to say when ArcServer will hit the wall.   You must monitor the server and determine the need for more capacity-- memory or processing (CPU)-- or a different configuration (if possible).  I used task manager and I'm sure some more System Admin whiz-bang tools will work, to see how "hot" the server was running.   Historically (I've been around a while) when servers run at or more than 85% of capacity, you start noticing unusual behavior from software running on the server, apparently, it's still about the same.    Meaning, if the server (software) is utilizing more than 80% of CPU or memory (hardware), it's time to investigate increasing capacity (usually through your organization's System Admin request) or by making adjustments to configuration (or both).

 

I'll be requesting additional CPU and memory but I would like a more software (ArcServer) controlled/configuration approach that would allow for me to mitigate the issue.  I thought that I had found such configuration through expanded functionality of ArcServer 10.7 by use of shared services (vs dedicated) which is directly related to the number of server processes ArcServer spawns (ArcSOC) upon system startup.  Unfortunately, we are still using ArcMap and as the documentation for the shared/dedicated pools points out  (Configure shared Services):

 

            Services published from ArcMap cannot use shared instances

 

This, was a bummer.   We'll have to upgrade to ArcPro in order to use this functionality, but even then there are restrictions (see the documentation).  Unfortunately, our user base is not ready for this.

 

I'm currently reviewing the tuning and configuration documentation to learn what I can do to tune the server to allow for more services to run before having to increase server capacity.   I'll update this discussion when I can resolve my issue without further system scaling.  Any suggestions for how to do this would be appreciated!!!

 

Here are the links to the doc I received from Tech support after I pointed out that reducing the number of running services allowed me to publish again:

 

 

If anyone sees any flaws in the discussion, I'd be happy to learn more about them and update it accordingly. 

Learning is good

Outcomes