For several weeks or more our map services in our ArcGIS Server 10.4.1 setup occasionally fail to draw when browsing to their export REST endpoints.
We have a web adaptor and 2 machines on the default cluster, and a SQL Server SDE, all using HTTPS. Currently we only have 41 services running with 1 to 2 instances, with less than 200 layers between them all. Our server machines are VMs with two cores and 16 GB of RAM, but currently use 11 to 14 GB with our load. CPU usage is fine, ranging from 4% to 60% on average. Our McAfee enterprise setup has already been configured to exclude ESRI's recommended folders.
When browsing to a service's REST endpoint on each machine (https://[machine-1]:6443/arcgis/rest/services/[service]/MapServer/export?bbox=[default-bbox]) we get the expected HTML output, but the image is blank. The other machine in the cluster might display the expected output with a valid image for the same bounding box, or it might be blank as well.
It seems like it's mostly the same few services which fail, but that's probably just because they're the ones we use/see most often in our Geocortex Essentials site (though the number does seem to be slowly increasing). These services are pretty much all Map Services with dynamic layers turned on, nothing special. We do have a few cached maps, and so far (fingers crossed) these have not been problematic at all.
Details / Investigation:
Restarting the bad service in Server Manager or ArcCatalog will temporarily restore functionality for a day or so. Restarting the ArcGIS Server Windows service or the server VM itself will solve it for a bit longer. OR, if you wait five minutes, it might return to normal without needing to do anything, but might start acting up sooner. Some do not recover at all until they're recycled at midnight.
We've tried publishing specific services with a registered File Geodatabase on each machine instead of using the SDE, but that fails as well. We even rebuilt the Map Document for the most problematic service from scratch, only importing symbology & label fonts, and it still seems affected.
It seems like when one service starts failing to export (after a server restart for example), the others are not necessarily more likely to start failing as well.
We've already increased our App Heap Size to 512 MB, and our SOC Heap Size to 256 MB, but this hasn't changed anything.
The ArcGIS Server logs don't give any errors, even at DEBUG level. When a service starts failing, the logs show:
|MapServer.ExportMapImage||Begining of preparation.|
|MapServer.ExportMapImage||End of preparation.|
|Map.Draw||Beginning of group layer draw: Infrastructure (x2)|
|Map.Draw||End of group layer draw: Infrastructure (x2)|
|/export||REST request successfully processed. Response size is 429 characters.|
When it is working, the logs will show the service going through each layer in the group layer Infrastructure, with log records at each step for:
- Number of features drawn: [x]
- Symbol Drawing
- Data Access
- Execute Query
There is no indication as to what changes when a service "recovers" automatically (e.g. service shutting down & restarting, recycling of data connection, etc.).
Has anyone heard of or seen something similar to this? I've been in contact with ESRI support and they haven't come back with anything yet.