Solved! Go to Solution.
Overdue followup: ESRI did respond with confirmation of **most** (some even they were unsure about) of the fields and suggested that there is no harm in using the dat files this way with the usual disclaimer that things could change at any future release without notice.
Hi @RyanSutcliffe. I feel your pain with trying to get useful system metrics re. shared instances - I have yet to find a way to do this. In the ArcGIS Monitor presentation at the 2021 UC, I think I remember them talking about the difficulties of getting the ArcSOC Optimizer to work with shared instances for this very reason - no useful metrics to describe how they’re working.
However, re. dedicated instances, have you looked at System Log Parser? This returns a heap of useful information that you can use to tune your services. I'm sure you'd get something from it if you aren't already using it.
Thanks @David_Horsey_old, yes I am currently looking at SystemLogParser within our dev environment. Especially looking at what information it relies on and the assumptions it makes. Still need to do some work there. We generally collect and pipe logs/metrics to AWS Cloudwatch so I'm trying to reproduce the analysis they provide in Excel into there. I'm also a bit leary of turning up ArcGIS Server log levels to INFO on PROD but that's all that is probably another topic.
I had a think on this and chatted with an ESRI Canada support person where I realized that the best way to really monitor service instance capacity problems was to watch response timings both from the proxy server side and then from ArcGIS Server's usage statistics response times. You look at the time that ArcGIS spent processing a request for a service and compare that to the total response time from the webserver. If there is suddenly a growing descrepancy between these two-- you likely have a bottleneck waiting for available service instances (arcsoc processes). With shared services it is a bit more tricky to understand what is causing the bottleneck, capacity issue but otherwise the process is the same.
In my case some tricks I've done to test this is:
- increase usage statistics aggregation interval to 1min detail
- write a python script that pulls all summary statistics by each service every 1min
- use to a tool like k6 to write some load tests and try seeing capacity of services under different load scenarios (K6 happens to provide really nice result stats that show details of a request timing (TTFB, download time, response time, etc)
- add request time information in access_logging for our web proxy (piped to AWS Cloudwatch for easy analysis)
- pipe ArcGIS Server usage stats raw logs from the machine to AWS Cloudwatch also. They can be found stored in plain-text in .dat_u files.
The last bit with the .dat_u files is the final outstanding challenge. Although we can run a scheduled task to extract usage statistic report data from our ArcGIS Servers and pipe into our monitoring app (Cloudwatch). But this is a bit of a task. We prefer just to pull the complete raw logs so that we can analyze and summarize directly with our monitoring tools ourselves. We don't want to rely on static snapshots from ArcGIS Server.
But when I asked for details on the structure of the .dat_u files (what each column/field represented) I got cited this:
"The data within the -u and -i files was meant to be used by the system and was not designed for user interaction. The recommended alternative solution is to use the Administration API to create requests in the admin/usagereports/add resource. Details are provided in the admin API documentation. Once created, they can use the admin/usagereports/<reportName>/data resource to get the data for their report (# of responses for a service for example). The responses are available in JSON, CSV, and HTML formats. This data can then be used in custom applications."
( source: ENH-000099004 - Provide documentation for the ArcGIS Server data logged in the dat-i and dat-u files in the statistics directories)
I'm still waiting to get a bit more information to understand this better. Does ESRI not want us reading the dat_u files because there is some risk of stability there-- or they are not written to as simple log files. Is it that they don't want to provide any documentation on their structure because they're subject to change, or is there something else there?
I think it's pretty easy to garner which of the files is max/avg response time and also which is request count. But other values, so far are less clear.
I've been using these anyway and so far with piping this and our proxy webserver response time stats into AWS Cloudwatch we've been able to watch for divergences between response time from ArcGIS Server and our Webserver for our services-- even setting alarms to warn us when they are diverging beyond normal thresholds.
Will post back here any info I hear about why ESRI doesn't want us reading .dat_u files. Also hoping maybe someone from ESRI Inc. might pipe in on this thread with more details/background.
Overdue followup: ESRI did respond with confirmation of **most** (some even they were unsure about) of the fields and suggested that there is no harm in using the dat files this way with the usual disclaimer that things could change at any future release without notice.