Portal for ArcGIS High Availability Startup Failure

DeanMoiler · ‎08-18-2020

Hi Guys,

I'm having an issue with 2 separate (staging/production) 10.7.1 Portal Environments with High Availability coming back online following server restarts.

From some investigation of the log files it seems to occur when the servers are restarted at the same time, with the logging service starting (within 0 - 10 seconds, after 60 seconds the nodes fail over and doesn't seem to present the issue).

Following a restart of both servers, the existing master node (Server 1) remains the master node, the standby (Server 2) node is removed from the shared nodes.config file. Server 2 then seems to attempt to recover itself as standby node after around 3.5 hours if left alone.

During this time the master node logs report to be running fine and monitoring standby nodes:

HA: Node server1.domain.local is configured to be master.
HA: Monitoring the standby nodes.

and Server 2 does not recover its standby node status, with the following repeating every 11 minutes:

HA: Error in HA plugin. java.lang.Exception: Failed to start the Web Server. The startup timed out.

Leaving the nodes.config with just the master node in play:

master.node = server1.domain.local
standby.nodes =

After 3.5 hours Server 2 decides it does want to be standby node, and updates the nodes.config file:

HA: Store the standby node server2.domain.local from the standby.nodes property in the nodes.properties file.

And both end up in play in nodes.config:

master.node = server1.domain.local
standby.nodes = server2.domain.local

The most notable issue here is that the master node web server does not display at all during this "not fully recovered" state, so the portal is essentially offline without the manual intervention of restarting portal services. Neither of the following pages will load:

https://server1.domain.local:7443/arcgis/portaladmin

https://server1.domain.local:7443/arcgis/home

To remedy the issue, simply stopping the Portal for ArcGIS Service on Server 2 allows the master node web server to display correctly! For some reason, the standby server, not correctly acting as standby causes the master node webserver to fail to display at all.

I can see (via netstat -ao) that during the failed state the localhost LISTENING 5701 ports do not seem to be running, though the servers appear ESTABLISHED via 5701, so are somewhat communicating. Once both servers are listed in the nodes.config file, both the localhost 5701 LISTENING and remote ESTABLISHED ports are running.

I also see the following occurring in the Catalina logs:

I have disabled McAfee for testing and internal firewalls are disabled, but the error still presented itself following serveral restarts.

Maybe (hopefully) Jonathan Quinn‌ or someone else in the community may have come across something like this before?

Thanks,

Dean

JonathanQuinn · ‎08-18-2020

This is a bug that's resolved at 10.8:

BUG-000121969 If both portal machines restart at the same time, the web server can become deadlocked

We're also working on some HA/DR patches for 10.6.1 and 10.7.1 where this will be included.

View solution in original post

JonathanQuinn · ‎08-18-2020

This is a bug that's resolved at 10.8:

BUG-000121969 If both portal machines restart at the same time, the web server can become deadlocked

We're also working on some HA/DR patches for 10.6.1 and 10.7.1 where this will be included.

DeanMoiler · ‎08-18-2020

Brilliant, thanks for that Jonathan, I'll subscribe to this bug to await the patch at 10.7.1.

Cheers,

Dean

DeanMoiler · ‎12-15-2020

Hi @JonathanQuinn,

I've seen there is now a "Portal for ArcGIS High Availability and Disaster Recovery Quality Patch" for 10.6 / 10.7 available but doesn't seem to indicate the deadlocking issue is resolved.

Can you tell me whether this BUG-000121969 was resolved in 10.7.1 in the patch above (perhaps as part of a different bug fix?), or whether the patch for this is coming out in a later patch?

Thanks!

Dean

JonathanQuinn · ‎12-18-2020

Unfortunately, the changes required for that issue could not be ported back to 10.7.1. The issue is resolved at 10.8, so you'll need to upgrade to 10.8 or 10.8.1.

DeanMoiler · ‎12-20-2020

Thanks for the update Jonathan, that's very useful to know.

Have a Merry Christmas!

Dean