We're running two clusters of ArcGIS for Server Enterprise (Windows) 10.3.1 on VMware vSphere 5 (ESXi 5.5) running Windows Server 2008R2 SP1 (64-bit). For the past three days we are seeing repeated crashes with all servers going down together. From the VM console we're seeing the CPUs spike to 100% and stay there until the VMs become unresponsive. Server logs are empty. The only commonality between the servers in these two clusters is a common config-store. On the ArcGIS side we haven't changed any configurations in recent days. On the VM side we're told nothing has changed in recent days. Can a corrupted config-store cause servers to fail? Can someone identify which process within ArcGIS controls communication with the config-store?
Where is your config store? I've had similar performance issues, still not completely resolved, that I've narrowed down to disk performance on the NAS that the VM's are running on. If you have a bunch (as defined as more than 2) VM's sharing disk spindles, I/O performance drops to "equivalent to a 3.5" floppy drive" as reported by Performance Analysis of Logs (PAL) Tool - Home . ArcGIS Svr is very very sensitive to disk performance where the config store is concerned. As far as the communication protocol, it's using SMB, and a massive amount according to our network engineer whom identified a high number of failing SMB packets going between our servers and the shared config store. The solution is a SAN and network solution that is far out of our budget. Low-hanging fruit, are your IT folks enforcing encrypt SMB, per chance, at the OS level? 15-30% performance hit right there. Perhaps a GPO update is causing this?
Our config store is on a dedicated GIS application file share that we've used for over two years without incident. It's only in the last two weeks that we've seen a rash of "Failed to return all folder configurations. Could not create folder "/" in the config store" or "an error was encountered while synchronizing with the config store. Could not create directory path ..." error messages. We think something must have changed to create a bottleneck, but our IT staff have been unable to identify any change that might be culprit. We've temporarily moved the config stores to a server in each cluster and for now things appear to have stabilized, but we'd like to isolate the problem and find a resolution.
After further investigation we discovered that our IT department had moved our GIS application fileshare from block storage to Isilon. Once they moved the fileshare back to higher performance storage the issues resolved.