Since upgrading to 10.6.1, we have seen instances of machines in our ArcGIS Server site crashing (its not always the same machine). After doing some deep diving into the logs and ArcGIS Monitor, we are seeing spikes in the available memory (going from 5GB available to almost the full 32GB). Usually these are just spikes, but on occasion we see the available memory flatline until we go in and restart the service on the server. These spikes I have correlated with warnings of the machine synchronizing with the site. When this happens, it typically stops all the services and brings them all back up, which appears to be what synchronizing with the site does. The issue seems to be when one of the SOC processes hangs, killing this process (or restarting the ArcGIS Server service) brings the machine back online. Unfortunately it is never the same map service that is hanging.
I do have a premium support ticket open for this, but wanted to see if anyone else has experienced this. Premium support's recommendation was to increase the number of cores on both servers, which was done, but we continue to see the crashing.
Since we are coming from 10.3.1, I am not familiar with the new optimized app server architecture that was introduced at 10.6. I am thinking the synchronizing with site function was part of this. I have not found any documentation or anything into the logs as to when or why the server needs to synchronize with the site.
Some details on the site:
10.6.1 ArcGIS Server running on 7.5 RHEL
32GB RAM and 6 CPU cores
Federated with Portal as the hosting server
Running about 52 map image services (currently no hosted feature services yet)
There are no reports of synchronizing during the night or weekends, so it is definitely happening when there is heavy traffic.