Select to view content in your preferred language

Servers Randomly Synchronize with the Server Site, Causing Partial Outage

604
5
01-30-2024 01:17 PM
RyanUthoff
Regular Contributor

I'm having an issue with one my ArcGIS Enterprise 11.1 deployments where the two machine, active-active server site randomly re-synchronizes with the site which causes a partial outage.

Specifically, I get these error messages in the logs:

The server machine 'machine name' is synchronizing with the site. This will take a few minutes and during this operation all administrative operations will be blocked.

Resetting the synchronization flag on the machine 'machine name'. Resetting the synchronize flag as synchronization is now running.

Failed to update the security configuration. Exception Could not connect to the ArcGIS component at URL 'https://machinename:6443/arcgis/admin/local/manageHandler'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error: Connection refused: connect

There were a couple other variations of that same error message from above. But after a few minutes, it finishes and everything goes back to normal. And while it doesn't happen at the same time, this has happened on both machines. So, it's not isolated to one machine.

Is there any reason why this is randomly happening? There are no other errors in the server logs. No errors in Windows event viewer, and the ArcGIS Server service doesn't get restarted or anything. Technically, we don't experience an outage because the other ArcGIS Server site handles the load until the other server site comes back online. But, it's also not ideal for one of our servers to essentially be inoperable for 20+ minutes while this is happening. 

0 Kudos
5 Replies
Dan_Brumm
Occasional Contributor II

Sounds like there maybe a network issue happening???  or maybe windows updates? 

Daniel Brumm
GIS Nerd
0 Kudos
RyanUthoff
Regular Contributor

We're good with Windows Updates and we're not noticing any network issues. We have two different ArcGIS Enterprise deployments and both are configured the same way. Only one of them is having this issue and we're not seeing any network related issues.

0 Kudos
Scott_Tansley
MVP Regular Contributor

One of the approaches I've taken is to put the servers into Read-Only mode during core business hours.  It means you can only make changes out-of-hours, but that's actually a good practice to have.  Performance and reliability tend to be improved by using Read Only.  Can be an issue if you're using them as Hosted as well as General-Purpose roles at the same time.  

Scott Tansley
https://www.linkedin.com/in/scotttansley/
0 Kudos
CodeSpatial
New Contributor II

@RyanUthoff I am running into a similar issue. I do have a large number of services (400+ most of them with min. instance set to 0) and one thing I noticed was it happens when multiple ArcSOCs are trying to get initiated at the same time. This happens to me in both 10.8.1 and 11.1 setups.

0 Kudos
RyanUthoff
Regular Contributor

Hmm, that's interesting. We have two ArcGIS Enterprise deployments. One on 10.9.1 and one on 11.1, and I'm only having issues with 11.1. Our 10.9.1 environment actually has higher demand, with our 11.1 environment having 100 ArcSOCs running at any given time. But, the services we have are pretty large.

I'm just curious, how are you determining this happens when ArcSOCs are initiating at the same time? Like what tools do you use to determine that? 

I'm looking to upgrade it to 11.2 in the next couple months, and if that doesn't fix it, I'll open a support ticket. It's just difficult because I can't force reproduce the issue.

0 Kudos