FYI only... posting to community in case others run into the issue...
Every once in a while (month or two) we get reports from users that some layers in some services are missing. Our monitoring does not catch this condition since its very sporadic. Here is an example of the available layers in a service (normally):
The layers as they looked today during this condition:
Bottom line... Clearing the rest-cache in the admin API resolved our issue:
We have quite a few ArcGIS Server (AGS) deployments. A mix of web-tier and token based security... This problem has occurred on both (confirmed v10.3.0 using token and v10.3.1 using web-tier). All AGS deployments are Virtual Machines (VM's) running WIN Server 2008 R2 and configured with access from web-adaptors hosted in IIS. Not federated with portal (stand-alone).
We have our config-store/directories on a dedicated file server (running multi-machine sites)... and configure the arcgis server to use a DFS path (we have had to move it around a few times... and this makes it really easy for us to move it).
We normally publish web-services with the source data in a File Geodatabase (FGDB) residing on disk, mapped with a DFS path as well (but those are usually on a different file server).
User-store is configured to use "Windows Domain" and role-store is built-in.
We are doing basic host based monitoring (every 60 seconds) using the IPSwitch Whats Up Gold (WUG) product based on standard PING, windows services (ArcGIS Server) and some HTTP Content monitors (for some high-use services and the root on the rest-endpoint over port 6443)
I really do not like relying on server/service reboots to solve our issues... Our servers are already rebooted too often (mostly for patches being applied) and I think that is what ultimately caused this issue. I would prefer to find the 'component' degraded and fix that individually without an aggressive reboot. The site impacted today has 2 back-end AGS hosts, both of which went offline for ~6 min around 3:30am:
The back-end file servers (both config-store/directories and FGDB hosting) were still online during this time period:
Before clearing the rest cache... I stopped both of the machines from the AGS manager (web-based). I watched all the ArcSOC.exe's go away and it left a handful of .exe's behind including 1 javaw.exe. Starting both machines back up brought back all ArcSOC.exe, but testing from multiple machines showed the missing layers still. I did not execute a windows service re-start, and suspect the issue would have been resolved that route (since all .exe's would have disappeared).
If this continues to be a problem we will most likely script a REST cache clear for all the ArcGIS Server deployments and schedule run in the early morning. This is a hard condition to identify since we have so many services hosted and IT does not know what layers are in what services (that is managed by the GIS users).