Ever-slowing Service Response Times

GregCarlino2 · ‎10-25-2022

Greetings Enterprise Community,

Our organization's ArcGIS Enterprise system is having a reoccurring issue with creeping service response times. After about two weeks of use, the web services start experiencing intermittent but increasing freeze-ups, as does ArcGIS Server Manager. Web maps and services will be fine one minute, and completely frozen the next, and back again. These 'frozen' intervals keep increasing over the next 3 - 4 days until the system becomes unusable, and a server reboot is our only option.

After that, services immediately start running quickly and consistently again... for another two weeks, or so. This problem has persisted through several Enterprise version upgrades.

We are currently running Enterprise 10.9. We have also migrated our GeoDatabase (SQL Server 2016 - 64 bit) onto it's own machine, so as to not compete with ArcGIS Server for system resources. I've also switched all viable services to Shared instances (Currently running about 40 ArcSOC processes). Portal for ArcGIS is also on it's own machine. Our Server machine meets and exceeds all recommended system reqs (CPU @ 3.4 GHz, 4 processors; 50 GB RAM; 400GB disk space). We've had several ongoing Esri helpdesk tickets, but are still unable to ascertain a cause.

The Server Manager statistics graphs consistently show this creeping response time (The discrepancy between 'Max' and 'Avg' response times indicates the intermittent nature of the freezing). I've also noticed the OpenJDK Platform Binary starts at about 470 MB immediately after reboot, and balloons to about 3.5 GB within two weeks. Is this expected behavior? Has anyone else encountered this and been able to determine a cause?

We are beginning to consider a complete Enterprise reinstall from scratch, at great expense, but are obviously hoping to avoid this, if possible. Any insight would be greatly appreciated! Thank you.

Scott_Tansley · ‎10-25-2022

What type of service and how many have you got published?

It sounds like you're using one server for the hosting role and as a general-purpose mapping server, but have you got other services like imagery, print, geoprocessing, and network analysis on there?

The competing workloads of each of those can have the sort of impact that you're discussing here.

Scott Tansley
https://www.linkedin.com/in/scotttansley/

GregCarlino2 · ‎10-26-2022

Hi @Scott_Tansley,

Yes, we only have one server, with a pretty good mix of services. Specifically: 44 Map Services (37 are referenced from SQL Server, 7 are aerials stored directly on machine), 6 Feature Layers (Data Store), 26 Geoprocessing services, 1 Geometry service, 3 image services.

Scott_Tansley · ‎10-26-2022

It sounds to me like you may have some contention between services, whereby they're fighting for resources. If you imagine that each CPU is like a production line, then a request comes into the server and it fires a request off to the data source. The process is 'sort of' locked until the data comes back, then the map can be drawn and the request fulfilled. The response then goes back to the user.

Map services and Hosted Feature Services tend to be pretty snappy and cause few problems. However, if you put imagery or geoprocessing services on the same machine, then they may take a long time to 'fulfill' the gap between the request and response. This causes the other requests to 'queue up' behind these big requests, and increase the wait times.

Actually, the issue is sort of discussed here. Most of my clients have dedicated hosting and general-purpose servers. The use of Imagery services is minimized because image data is 'BIG' and if they need lots then they have a third server. Geoprocessing can be big or small, depending on what it is. Some GP services are milli-seconds, others can take hours.

My guess is that you may be asking too much of a 4 core machine. The memory and disk sound more than reasonable, but the small number of cores may be somewhat limiting. Separating the imagery/GP services may be a good call. There are multiple monitoring tools, like ArcGIS Monitor that would give you the necessary insight to see what is happening at a more granular level and plan your deployment appropriately.

Scott Tansley
https://www.linkedin.com/in/scotttansley/

GregCarlino2 · ‎11-04-2022

Hi Scott,

I understand what you're saying about services queuing up behind big requests, but does that explain why the delays are seemingly cumulative? Once a big request is fulfilled, shouldn't subsequent requests then resume a 'normal' speed? CPU utilization can vary greatly, but for the most part, hangs out around 4-5%, even when response times are lagging.

ArcGIS Monitor seems very useful (and we apparently already paid for the license) so I will likely get that implemented as soon as I can.

One other question: Have you noticed significant improvements in clients' system performance after splitting workload into separate Hosting and General-purpose servers?

Thank you,

Greg