Hi Guys,
I'm having an issue with 2 separate (staging/production) 10.7.1 Portal Environments with High Availability coming back online following server restarts.
From some investigation of the log files it seems to occur when the servers are restarted at the same time, with the logging service starting (within 0 - 10 seconds, after 60 seconds the nodes fail over and doesn't seem to present the issue).
Following a restart of both servers, the existing master node (Server 1) remains the master node, the standby (Server 2) node is removed from the shared nodes.config file. Server 2 then seems to attempt to recover itself as standby node after around 3.5 hours if left alone.
During this time the master node logs report to be running fine and monitoring standby nodes:
HA: Node server1.domain.local is configured to be master.
HA: Monitoring the standby nodes.
and Server 2 does not recover its standby node status, with the following repeating every 11 minutes:
HA: Error in HA plugin. java.lang.Exception: Failed to start the Web Server. The startup timed out.
Leaving the nodes.config with just the master node in play:
master.node = server1.domain.local
standby.nodes =
After 3.5 hours Server 2 decides it does want to be standby node, and updates the nodes.config file:
HA: Store the standby node server2.domain.local from the standby.nodes property in the nodes.properties file.
And both end up in play in nodes.config:
master.node = server1.domain.local
standby.nodes = server2.domain.local
The most notable issue here is that the master node web server does not display at all during this "not fully recovered" state, so the portal is essentially offline without the manual intervention of restarting portal services. Neither of the following pages will load:
https://server1.domain.local:7443/arcgis/portaladmin
https://server1.domain.local:7443/arcgis/home
To remedy the issue, simply stopping the Portal for ArcGIS Service on Server 2 allows the master node web server to display correctly! For some reason, the standby server, not correctly acting as standby causes the master node webserver to fail to display at all.
I can see (via netstat -ao) that during the failed state the localhost LISTENING 5701 ports do not seem to be running, though the servers appear ESTABLISHED via 5701, so are somewhat communicating. Once both servers are listed in the nodes.config file, both the localhost 5701 LISTENING and remote ESTABLISHED ports are running.
I also see the following occurring in the Catalina logs:

I have disabled McAfee for testing and internal firewalls are disabled, but the error still presented itself following serveral restarts.
Maybe (hopefully) Jonathan Quinn or someone else in the community may have come across something like this before?
Thanks,
Dean