For several weeks or more our map services in our ArcGIS Server 10.4.1 setup occasionally fail to draw when browsing to their export REST endpoints.
We have a web adaptor and 2 machines on the default cluster, and a SQL Server SDE, all using HTTPS. Currently we only have 41 services running with 1 to 2 instances, with less than 200 layers between them all. Our server machines are VMs with two cores and 16 GB of RAM, but currently use 11 to 14 GB with our load. CPU usage is fine, ranging from 4% to 60% on average. Our McAfee enterprise setup has already been configured to exclude ESRI's recommended folders.
When browsing to a service's REST endpoint on each machine (https://[machine-1]:6443/arcgis/rest/services/[service]/MapServer/export?bbox=[default-bbox]) we get the expected HTML output, but the image is blank. The other machine in the cluster might disp...
It seems like it's mostly the same few services which fail, but that's probably just because they're the ones we use/see most often in our Geocortex Essentials site (though the number does seem to be slowly increasing). These services are pretty much all Map Services with dynamic layers turned on, nothing special. We do have a few cached maps, and so far (fingers crossed) these have not been problematic at all.
Details / Investigation:
Restarting the bad service in Server Manager or ArcCatalog will temporarily restore functionality for a day or so. Restarting the ArcGIS Server Windows service or the server VM itself will solve it for a bit longer. OR, if you wait five minutes, it might return to normal without needing to do anything, but might start acting up so...
We've tried publishing specific services with a registered File Geodatabase on each machine instead of using the SDE, but that fails as well. We even rebuilt the Map Document for the most problematic service from scratch, only importing symbology & label fonts, and it still seems affected.
It seems like when one service starts failing to export (after a server restart for example), the others are not necessarily more likely to start failing as well.
We've already increased our App Heap Size to 512 MB, and our SOC Heap Size to 256 MB, but this hasn't changed anything.
The ArcGIS Server logs don't give any errors, even at DEBUG level. When a service starts failing, the logs show:
|MapServer.ExportMapImage||Begining of preparation.|
|MapServer.ExportMapImage||End of preparation.|
|Map.Draw||Beginning of group layer draw: Infrastructure (x2)|
|Map.Draw||End of group layer draw: Infrastructure (x2)|
|/export||REST request successfully processed. Response size is 429 characters.|
When it is working, the logs will show the service going through each layer in the group layer Infrastructure, with log records at each step for:
There is no indication as to what changes when a service "recovers" automatically (e.g. service shutting down & restarting, recycling of data connection, etc.).
Has anyone heard of or seen something similar to this? I've been in contact with ESRI support and they haven't come back with anything yet.
In your post you say "Restarting the bad service in Server Manager or ArcCatalog will temporarily restore functionality for a day or so. Restarting the ArcGIS Server Windows service or the server VM itself will solve it for a bit longer. OR, if you wait five minutes, it might return to normal without needing to do anything, but might start acting up so...
Does this mean you have Recycle on your services set to only occur once each day at midnight? If so, maybe you would consider changing your recycling to once every hour as you indicate that recycling helps to fix the issue temporarily so additional recycling could really minimize this issue. I've had recycling set at once per hour for my org's services for many years which works well.
If you don't mind me asking, what does your org mainly use your map/feature services for (i.e. what is the main data sink)? Ours is Geocortex Essentials, which can make many export requests per service per second from a client's browser, depending on how they manipulate the map.
Yes we are facing same trouble since past many months. We have tried to troubleshoot in exact same way as explained in the initial post with no luck. We also observed that in our case, this happens especially with grouped layers.
We discussed this issue Esri consultants during the conference recently. Any help to identify the cause and find the resolution will be appreciated as we don't know how long it might take for us to identify the cause and resovle it.
Also it will be good to know if anyone else facing the same issue?
Thanks for the reply. We've had services without group layers fail in this way as well. We've never seen cached services fail, nor services with just point features however.
We still haven't been able to pinpoint the reason behind the failure, or what seems to cause it (service load, port connection limits, temporary network outages, etc.).
What version are you using? Do you have an SDE? How many services do you have running? Are your ArcGIS Servers running on VMs? Knowing if our installs are similar might be helpful.
Our setup is in cloud, but we haven't seen such a behaviour anytime in past with cloud setup. We used AGS version 10.5.1. We also tested laod, port connection limits, if any temporary network outage may be causing, etc. We have not been able to find any positive outcome so far.
We are using SDE. We have approx. 30-50 services running, but we have observed this behaviour with particular services only (not sure if other services may be behaving in similar manner, but haven't observed anything).
What is the setup on your side?
Our setup is on locally-hosted VMware servers. We use SDE 10.4.1 along with Server 10.4.1, but are experiencing the failures with services who've been published with data copied to the server, so SDE & local file GDBs are affected. We run a similar number of services, some with group layers, others without, and the failure seems to affect either configuration.
This behaviour is pretty new as well. Our system has been in production since January 2018, and the symptoms have only been appearing around early January 2019 (though it's possible it was happening before and we hadn't noticed it). It shows up in the same services frequently, but appears to be slowly "spreading" to other services as time passes.
We've sent our list of recent Windows Updates going back to early December to ESRI and they haven't been able to dig up anything.