We are facing service interruption in ArcGIS Enterpsise 10.6.1 (Windows Server 2016 and Geodatabase on a AWS RDS Instance PostgreSQL 9.6 ). The issue is reproduced randomly everyday at different times and suddenly we cannot access to Portal Home, ArcGIS Server Manager or services (from 7443 or 6443 neither) getting a timeout. If we restart ArcGIS Server Windows service everything start to work.
We don’t see any related error in Server logs or Event Viewer and Hardware resources are ok at the time of failure. ArcGIS Enterprise is deployed in a r5.xlarge instance (4 vCPU and 32 Gb RAM), RAM usage is around 40% and sometimes we get sporadic spikes of CPU. We increased to r5.xlarge instance (8 vCPU and 32 Gb RAM) but the issue is still happening.
Analyzing ArcGIS Server Statistics we saw that some services are heavily used, in particular there is a Feature Service that some days is having more than 30,000 requests.
This service is used for editing (Online and Offline) by 17 users through Collector for ArcGIS, so we increased min and max instances and also for System/SyncTools but we are still having downtime. It appears the problem is happening when this kind of feature services are heavily used.
We really appreciate any help on this matter.
right now 56 for 30 services and when the system is under heavy load probably around 70-75 as we increased max instances of more used services. RAM around 50% and CPU 10-20 % with some spikes above 60-70%. Appears that for any reason when those Feature Services are used (online/offline editing) Portal and Server become unavailable for 3-4 min or even more, just restarting ArcGIS Server windows service everything back to life.
If I understand correctly the feature service being edited in Collector is registered with your PostgreSQL 9.6. Have you taken a look at resource usage on the database machine?
When the feature service is under load is it using all the available instances? If you have the option I would also recommend testing with a hosted feature service. As this may help confirm or eliminate any database related issues.
Geodatabse is implemented on an AWS PostgreSQL RDS Instance and monitoring shows that CPU is never above 15% with enough RAM memory. This instance is a db.m4.large (2vCPU and 8 Gb RAM) while ArcGIS Enterprise instance is r4.xlarge (4vCPU and 32 GbRAM). Max instances of heavy used services are set to 4 and in few time 4 ArcSOC.exe processes are running up and maybe there is a bottlneck if the RDS intances is only 2vCPU. The first thing i will try is to increase RDS Instance to 4 vCPU.
In any case, do you think is normal that one or two services can affect the availability of the whole system even if hardware resources are ok?
One more thing, i was also wondering if setting up cacheControlMaxAge property to those Feature Services can help.
Thanks again for your help.
In case it might help, the user edit the feature services through Collector for ArcGIS (Android) and also upload photos quite often (image resolution isn’t at maximum). It seems this kind of requests could be the root of the issue
It's strange that you don't see any error messages in the ArcGIS Server logs regarding the timeouts. If you haven't already I would recommend reviewing the ArcGIS Server logs at debug level.
I would recommend filtering for error messages relating to services that timeout and the feature services being edited in Collector. Is your virtual memory paging file size set to windows managed?
ArcGIS Server can sometimes use all the available virtual memory before maxing out RAM usage. However usually you would see error messages in the Windows Event viewer application logs.
Are your users editing with Collector encounter any issues syncing edits? If the feature services used with Collector are regularly maxed out under load you may want to test increased the maximum available instances.
please find attached Debug log of a recent service interruption that started at 11:38 AM. We've restarted ArcGIS Server 11:43.
How can I check if the virtual memory paging file size set to windows managed. I am using an AWS ArcGIS Enterprise AMI on an EC2 Instance.
And related to max instance those services are set to 12 in 8vCPU and 32 Gb RAM Instance.
Hi again Thomas,
my virtual memory paging file size is set to windows managed
Right now ArcGIS Enterprise instance is 4vCPU and 32 Gb RAM because i took off heavy used feature service into another dedicated ArcGIS Server. But ArcGIS Server service interruption is still happening in the ArcGIS Enterprise machine.
Also, i've seen this max responde time for Printing tools and they seem to be quite high.
To be honest this i don't have any new idea in order to solve this issue and it's critical.
Thank you anyway