I'm having an issue with one my ArcGIS Enterprise 11.1 deployments where the two machine, active-active server site randomly re-synchronizes with the site which causes a partial outage.
Specifically, I get these error messages in the logs:
The server machine 'machine name' is synchronizing with the site. This will take a few minutes and during this operation all administrative operations will be blocked.
Resetting the synchronization flag on the machine 'machine name'. Resetting the synchronize flag as synchronization is now running.
Failed to update the security configuration. Exception Could not connect to the ArcGIS component at URL 'https://machinename:6443/arcgis/admin/local/manageHandler'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error: Connection refused: connect
There were a couple other variations of that same error message from above. But after a few minutes, it finishes and everything goes back to normal. And while it doesn't happen at the same time, this has happened on both machines. So, it's not isolated to one machine.
Is there any reason why this is randomly happening? There are no other errors in the server logs. No errors in Windows event viewer, and the ArcGIS Server service doesn't get restarted or anything. Technically, we don't experience an outage because the other ArcGIS Server site handles the load until the other server site comes back online. But, it's also not ideal for one of our servers to essentially be inoperable for 20+ minutes while this is happening.
Sounds like there maybe a network issue happening??? or maybe windows updates?
We're good with Windows Updates and we're not noticing any network issues. We have two different ArcGIS Enterprise deployments and both are configured the same way. Only one of them is having this issue and we're not seeing any network related issues.
One of the approaches I've taken is to put the servers into Read-Only mode during core business hours. It means you can only make changes out-of-hours, but that's actually a good practice to have. Performance and reliability tend to be improved by using Read Only. Can be an issue if you're using them as Hosted as well as General-Purpose roles at the same time.
@RyanUthoff I am running into a similar issue. I do have a large number of services (400+ most of them with min. instance set to 0) and one thing I noticed was it happens when multiple ArcSOCs are trying to get initiated at the same time. This happens to me in both 10.8.1 and 11.1 setups.
Hmm, that's interesting. We have two ArcGIS Enterprise deployments. One on 10.9.1 and one on 11.1, and I'm only having issues with 11.1. Our 10.9.1 environment actually has higher demand, with our 11.1 environment having 100 ArcSOCs running at any given time. But, the services we have are pretty large.
I'm just curious, how are you determining this happens when ArcSOCs are initiating at the same time? Like what tools do you use to determine that?
I'm looking to upgrade it to 11.2 in the next couple months, and if that doesn't fix it, I'll open a support ticket. It's just difficult because I can't force reproduce the issue.
Hi @RyanUthoff,
Were you able to get this resolved? I suspect we are also facing this similar issue on 11.4 HA environment. Since two days our ArcGIS Server is down for almost 20-30 minutes, starting exactly around 7:30 AM CST, and it gets fixed automatically, although the only thing we did during this outage was server re-starts multiple times, but it's something else.
We also have 2 machines on ArcGIS Server sites, and we also have 2 different envs(QAT and PROD) issue is happening on PROD. We have deployed Server on AWS machines. Nothing changed recently. Not sure what is causing this.
We appreciate if you can share any insights.
Yeah, the root cause of the issue for us is that the two ArcGIS Servers were running on two different EC2 instance types. I forgot what they were, but for example, one was a general computing type, and the other was a compute optimized type. Basically, all ArcGIS Server machines in the site need to have the exact same resources (EC2 instance type, CPU, RAM, etc.).
I forgot exactly how it was explained to me, but basically, the ArcGIS Server would be processing requests at different speeds, and because one was slower than the other, it would think one of the machines was temporarily offline which would result in the server having to sync with the site again.
In your case however, it is odd that it is happening at the exact same time every day. It would happen at random times during the day for us, mainly during high usage. If you have the exact same EC2 instance specs for both ArcGIS Server machines, I'd try looking at the ArcGIS Server logs right before the sync happens to see if you find any suspicious messages in the logs that might help narrow down the issue.
Thank you so much @RyanUthoff, we really appreciate you taking time to respond.
I checked the machines and they are on exact same configurations.
You mentioned to check logs before the sync. Can you kindly help how do we know at what time the sync takes place?
Also, to share we did not encounter this issue yesterday. Not sure what's happening.
Besides all this, we are seeing continuous ArcGIS Datastore replication failures from standby datastore. By any chance this can trigger an outage for 30-40 minutes?
You'll know what time the sync takes place by looking in the logs. It'll be easiest to search through the text log files on the machine itself instead of through Server Manager. Just do a search for "synchronizing" in the text file and you should be able to find it. You said this was happening around 7:30AM, so I'd say that would be a good place to start looking. The full log message is: "The server machine 'machine name' is synchronizing with the site. This will take a few minutes and during this operation all administrative operations will be blocked."
If you're not able to find that message in the logs when the issue is happening, then you have a different issue.
Regarding the Data Store, I don't think that should cause an ArcGIS Server outage. That should only cause an issue with the Data Store itself.