Monitoring ArcGIS Server Shared Services

RyanSutcliffe · ‎08-18-2021

I am calibrating our site and testing and monitoring shared service instance usage on the ArcGIS Server Site. I would like to be able to detect when the amount of requests to use our pool of shared instances exceeds the instances available-- particularly if there is a wait required for a free instance and when all instances are in use. I want to be able to track metrics on this wait time for requests to understand what % of the total time for a request it represents, know when it is a potential bottleneck, etc.

For dedicated services I don't have a great method for this either but I usually rely on gathering baselines for various request types and assuming slower responses that are concurrent with all instances for a service being in use as due to waiting for free arcsoc instances.

But with shared service instances it gets a bit more tricky. I understand that the " DynamicMappingHost" arcsoc processes handle shared service requests. But these are always running. So if I set our ArcGIS Server site to allow 8 instances, we will always have 8 running regardless of requests.

How can I determine when all of our shared service instances are in use and we are seeing response lag due to exhaustion of the pool of shared services and due to wait-time for a worker?

Some things I've considered:

- trying to look at connections in our database (all our shared services use SDE layers at moment) and determining # of connections, particularly for shared services. But I haven't determined how or if its possible to map a conn back to an arcsoc instance or whether it is initiated by something else, like a direct SDE or database conn.

- use logs and metrics from our proxy server and try to track ones to shared services to see load over time. This isn't ideal, as different types of requests trigger vastly different amounts of work for the server and determining what constitutes a bottleneck just from counting instances is not really possible.

I welcome thoughts and advice as well as corrections to any of the assumptions above-- like if a single dynamicMappingHost parallels requests rather than processing them one at a time, or if a single request will use multiple instances for faster response time.

Also I'm also interested for general ideas for dedicated services as well.

Sidenote: a particular beef: ArcGIS Server's built-in monitoring tools will detect service timeouts but they're pretty weak for calculating stats on timeouts or times on waiting for ArcGIS Server to start working on a request.

RyanSutcliffe · ‎11-11-2021

Overdue followup: ESRI did respond with confirmation of **most** (some even they were unsure about) of the fields and suggested that there is no harm in using the dat files this way with the usual disclaimer that things could change at any future release without notice.

View solution in original post

David_Horsey_old · ‎08-19-2021

Hi @RyanSutcliffe. I feel your pain with trying to get useful system metrics re. shared instances - I have yet to find a way to do this. In the ArcGIS Monitor presentation at the 2021 UC, I think I remember them talking about the difficulties of getting the ArcSOC Optimizer to work with shared instances for this very reason - no useful metrics to describe how they’re working.

However, re. dedicated instances, have you looked at System Log Parser? This returns a heap of useful information that you can use to tune your services. I'm sure you'd get something from it if you aren't already using it.

RyanSutcliffe · ‎08-19-2021

Thanks @David_Horsey_old, yes I am currently looking at SystemLogParser within our dev environment. Especially looking at what information it relies on and the assumptions it makes. Still need to do some work there. We generally collect and pipe logs/metrics to AWS Cloudwatch so I'm trying to reproduce the analysis they provide in Excel into there. I'm also a bit leary of turning up ArcGIS Server log levels to INFO on PROD but that's all that is probably another topic.

RyanSutcliffe · ‎09-13-2021

I had a think on this and chatted with an ESRI Canada support person where I realized that the best way to really monitor service instance capacity problems was to watch response timings both from the proxy server side and then from ArcGIS Server's usage statistics response times. You look at the time that ArcGIS spent processing a request for a service and compare that to the total response time from the webserver. If there is suddenly a growing descrepancy between these two-- you likely have a bottleneck waiting for available service instances (arcsoc processes). With shared services it is a bit more tricky to understand what is causing the bottleneck, capacity issue but otherwise the process is the same.
In my case some tricks I've done to test this is:
- increase usage statistics aggregation interval to 1min detail
- write a python script that pulls all summary statistics by each service every 1min
- use to a tool like k6 to write some load tests and try seeing capacity of services under different load scenarios (K6 happens to provide really nice result stats that show details of a request timing (TTFB, download time, response time, etc)
- add request time information in access_logging for our web proxy (piped to AWS Cloudwatch for easy analysis)
- pipe ArcGIS Server usage stats raw logs from the machine to AWS Cloudwatch also. They can be found stored in plain-text in .dat_u files.

The last bit with the .dat_u files is the final outstanding challenge. Although we can run a scheduled task to extract usage statistic report data from our ArcGIS Servers and pipe into our monitoring app (Cloudwatch). But this is a bit of a task. We prefer just to pull the complete raw logs so that we can analyze and summarize directly with our monitoring tools ourselves. We don't want to rely on static snapshots from ArcGIS Server.

But when I asked for details on the structure of the .dat_u files (what each column/field represented) I got cited this:

"The data within the -u and -i files was meant to be used by the system and was not designed for user interaction. The recommended alternative solution is to use the Administration API to create requests in the admin/usagereports/add resource. Details are provided in the admin API documentation. Once created, they can use the admin/usagereports/<reportName>/data resource to get the data for their report (# of responses for a service for example). The responses are available in JSON, CSV, and HTML formats. This data can then be used in custom applications."

( source: ENH-000099004 - Provide documentation for the ArcGIS Server data logged in the dat-i and dat-u files in the statistics directories)

I'm still waiting to get a bit more information to understand this better. Does ESRI not want us reading the dat_u files because there is some risk of stability there-- or they are not written to as simple log files. Is it that they don't want to provide any documentation on their structure because they're subject to change, or is there something else there?
I think it's pretty easy to garner which of the files is max/avg response time and also which is request count. But other values, so far are less clear.

I've been using these anyway and so far with piping this and our proxy webserver response time stats into AWS Cloudwatch we've been able to watch for divergences between response time from ArcGIS Server and our Webserver for our services-- even setting alarms to warn us when they are diverging beyond normal thresholds.

Will post back here any info I hear about why ESRI doesn't want us reading .dat_u files. Also hoping maybe someone from ESRI Inc. might pipe in on this thread with more details/background.

RyanSutcliffe · ‎11-11-2021

Overdue followup: ESRI did respond with confirmation of **most** (some even they were unsure about) of the fields and suggested that there is no harm in using the dat files this way with the usual disclaimer that things could change at any future release without notice.