I'm having an issue with one my ArcGIS Enterprise 11.1 deployments where the two machine, active-active server site randomly re-synchronizes with the site which causes a partial outage.
Specifically, I get these error messages in the logs:
The server machine 'machine name' is synchronizing with the site. This will take a few minutes and during this operation all administrative operations will be blocked.
Resetting the synchronization flag on the machine 'machine name'. Resetting the synchronize flag as synchronization is now running.
Failed to update the security configuration. Exception Could not connect to the ArcGIS component at URL 'https://machinename:6443/arcgis/admin/local/manageHandler'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error: Connection refused: connect
There were a couple other variations of that same error message from above. But after a few minutes, it finishes and everything goes back to normal. And while it doesn't happen at the same time, this has happened on both machines. So, it's not isolated to one machine.
Is there any reason why this is randomly happening? There are no other errors in the server logs. No errors in Windows event viewer, and the ArcGIS Server service doesn't get restarted or anything. Technically, we don't experience an outage because the other ArcGIS Server site handles the load until the other server site comes back online. But, it's also not ideal for one of our servers to essentially be inoperable for 20+ minutes while this is happening.
Sounds like there maybe a network issue happening??? or maybe windows updates?
We're good with Windows Updates and we're not noticing any network issues. We have two different ArcGIS Enterprise deployments and both are configured the same way. Only one of them is having this issue and we're not seeing any network related issues.
One of the approaches I've taken is to put the servers into Read-Only mode during core business hours. It means you can only make changes out-of-hours, but that's actually a good practice to have. Performance and reliability tend to be improved by using Read Only. Can be an issue if you're using them as Hosted as well as General-Purpose roles at the same time.
@RyanUthoff I am running into a similar issue. I do have a large number of services (400+ most of them with min. instance set to 0) and one thing I noticed was it happens when multiple ArcSOCs are trying to get initiated at the same time. This happens to me in both 10.8.1 and 11.1 setups.
Hmm, that's interesting. We have two ArcGIS Enterprise deployments. One on 10.9.1 and one on 11.1, and I'm only having issues with 11.1. Our 10.9.1 environment actually has higher demand, with our 11.1 environment having 100 ArcSOCs running at any given time. But, the services we have are pretty large.
I'm just curious, how are you determining this happens when ArcSOCs are initiating at the same time? Like what tools do you use to determine that?
I'm looking to upgrade it to 11.2 in the next couple months, and if that doesn't fix it, I'll open a support ticket. It's just difficult because I can't force reproduce the issue.