Expected Downtime for Portal to Failover with HA Architecture

RussCoffey · ‎12-27-2022

We are deploying a Highly Available ArcGIS Enterprise 10.9.1 architecture and have been testing for a few weeks to determine the optimal configuration and are currently focused on minimizing downtime due to events such as patching or hardware failures.

One result that came as a surprise to us was the amount of time it apparently takes to perform a failover, and I am specifically referring to Portal. Here is a scenario we have experienced: initiate the failover process by turning off the Portal service on the Primary machine (if there is a better way to initiate a failover, I am interested in hearing about it). Immediately, users of our site get no response, or maybe a 500 Error, for about 3-4 minutes until the Standby realizes that the Primary is down and promotes itself to Primary, after which users are back up and running fine.

Is this the expected behavior and timeframe for switching?
There must be two components of that 3-4 minutes.
(1) The Standby machine performing checks against the Primary to see if it is running.
(2) Standby actually promoting itself to Primary.
How much of that 3-4 minutes is due to the health checks and how much is due to the actual promotion process?

Our deployment is configured with the default ha settings, as set in the portal.ha.monitor.interval file and documented here.
Based on this information, the health checks would only be responsible for 3 seconds of the 3-4 minute downtime, because Standby performs a health check with the Primary server every 1 second and promotes itself after 3 sequential failures.

Should we really experience a 3-4 minute wait time for Standby to promote itself to Primary?
I am surprised because that is about the amount of time that it takes the Portal service to fully start up from a stopped state. But the Portal service on our Standby server is up and running and should theoretically be able to take over as soon as it realizes that the Primary is down.

We are very interested in hearing the experiences of others who have an HA deployment and have failed over for scheduled events such as patching, as well as due to unplanned events like hardware failure.

ReeseFacendini · ‎12-29-2022

Portal will take about 3-5 minutes to fully failover, it's not an instantaneous process. Yes, both Portal services are running, and the health check is being performed frequently, but the database that keeps track of users and content is only active on a single node at a time (primary node). When Portal fails over, the database starts up on the standby node, but does take some time to make sure it's reading the latest changes of content and users from the shared location.

If you needed an instant switch over to a working system, I would recommend having a complete duplicate of the system and use DNS changes to point to that second system. Users access the same URLs, but are technically pointed elsewhere.

View solution in original post

BillFox · ‎12-28-2022

Hello Russ,

I believe portal ha is "hot/hot" with both boxes taking care of user requests. As well as the Primary/Standby functionality.

RussCoffey · ‎12-29-2022

Thanks for the reply, @BillFox. While Portal can function in an active-active mode, we can confirm that both machines are not responding to requests at the same time. It is definitely functioning in a Primary/Standby mode (otherwise, once we get to production, it would violate our licensing).

BillFox · ‎12-29-2022

Hi Russ, here's some additional portal active-active details in here from @JonathanQuinn

https://community.esri.com/t5/arcgis-enterprise-questions/activ-activ-quot-portal-for-arcgis-quot-in...

ReeseFacendini · ‎12-29-2022

Portal will take about 3-5 minutes to fully failover, it's not an instantaneous process. Yes, both Portal services are running, and the health check is being performed frequently, but the database that keeps track of users and content is only active on a single node at a time (primary node). When Portal fails over, the database starts up on the standby node, but does take some time to make sure it's reading the latest changes of content and users from the shared location.

If you needed an instant switch over to a working system, I would recommend having a complete duplicate of the system and use DNS changes to point to that second system. Users access the same URLs, but are technically pointed elsewhere.

RussCoffey · ‎12-29-2022

Thank you for the response, @ReeseFacendini. While we were hoping for the failover to occur in only a few seconds, we just needed confirmation (and the explanation helps as well) that 3-5 minutes is what we should expect. We are trying to avoid managing the synchronization of content between servers, but we now understand that that is what we will have to do if we need a faster failover process. Thank you.