HA Portal failed to start

Yurii_P · ‎04-10-2024

Hi all.

We have a HA deployment on 2 servers(Ubuntu), both servers have it’s own portal, server, datastore, web adaptors etc. Deployments were installed and configured by official Chief cookbooks. Everything was working fine until primary arcgis components stopped to respond and standby components started to handle connections. For some reason system wasnt working properly so we decided to reboot primary server and disable ArcGIS components on standby server so primary components will be the ones who handle requests.

The problem is that primary portal is not launching internal Postgres database. Portal logs show the following:

<Msg time="2024-04-10T07:51:04,907" type="SEVERE" code="218010" source="Portal Admin" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>

<Msg time="2024-04-10T07:51:04,907" type="WARNING" code="218012" source="Portal Admin" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Error at check and update urls. The portal is unavailable at this time. Please contact your system administrator and check the logs on disk for more information.</Msg>

<Msg time="2024-04-10T07:51:11,304" type="SEVERE" code="218010" source="Portal Admin" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>

<Msg time="2024-04-10T07:51:11,588" type="WARNING" code="218014" source="Portal" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Starting Index Service.</Msg>

<Msg time="2024-04-10T07:51:16,347" type="SEVERE" code="218010" source="Portal Admin" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>

<Msg time="2024-04-10T08:01:35,576" type="WARNING" code="218015" source="Portal" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Started Index Service.</Msg>

<Msg time="2024-04-10T08:06:25,166" type="WARNING" code="9999" source="Portal" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Unable to find 'HA_MASTER_NODE_DOWN' message in resource file.</Msg>

<Msg time="2024-04-10T08:06:25,166" type="WARNING" code="216028" source="Portal" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">HA_MASTER_NODE_DOWN eip-esrieip-esri-ha</Msg>

Portal service logs:

Wed Apr 10 07:24:49.483 UTC 2024 Connection string: jdbc:postgresql://primary-esri:7654,standby-esri-ha:7654/gwdb?targetServerType=master

Wed Apr 10 07:24:49.483 UTC 2024 Connecting to the configuration store using connection string 'jdbc:postgresql://primary-esri:7654,standby-esri-ha:7654/gwdb?targetServerType=master'.

Wed Apr 10 07:24:49.486 UTC 2024 The attempt to connect to the configuration store failed. Attempting reconnection...

Wed Apr 10 07:25:00.355 UTC 2024 Connection string: jdbc:postgresql://primary-esri:7654,standby-esri-ha:7654/gwdb?targetServerType=master

Wed Apr 10 07:25:00.355 UTC 2024 Connecting to the configuration store using connection string 'jdbc:postgresql://primary-esri:7654,standby-esri-ha:7654/gwdb?targetServerType=master'.

Wed Apr 10 07:25:00.357 UTC 2024 The attempt to connect to the configuration store failed. Attempting reconnection...

Wed Apr 10 07:25:05.413 UTC 2024 Connection string: jdbc:postgresql://primary-esri:7654,standby-esri-ha:7654/gwdb?targetServerType=master

Wed Apr 10 07:25:05.414 UTC 2024 Connecting to the configuration store using connection string 'jdbc:postgresql://primary-esri:7654,standby-esri-ha:7654/gwdb?targetServerType=master'.

Wed Apr 10 07:25:05.415 UTC 2024 The attempt to connect to the configuration store failed. Attempting reconnection...

Wed Apr 10 07:30:12.491 UTC 2024 warning: no-jdk distributions that do not bundle a JDK are deprecated and will be removed in a future release

Wed Apr 10 07:30:12.493 UTC 2024 ------------------------------------------------------------------------

Wed Apr 10 07:30:12.493 UTC 2024

Wed Apr 10 07:30:12.493 UTC 2024 WARNING: OpenSearch MUST be stopped before running this tool.

Wed Apr 10 07:30:12.493 UTC 2024

Wed Apr 10 07:30:12.493 UTC 2024 ------------------------------------------------------------------------

Wed Apr 10 07:30:12.493 UTC 2024

Wed Apr 10 07:30:12.493 UTC 2024 You should only run this tool if you have permanently lost all of the

Wed Apr 10 07:30:12.493 UTC 2024 cluster-manager-eligible nodes in this cluster and you cannot restore the cluster

Wed Apr 10 07:30:12.493 UTC 2024 from a snapshot, or you have already unsafely bootstrapped a new cluster

Wed Apr 10 07:30:12.493 UTC 2024 by running `opensearch-node unsafe-bootstrap` on a cluster-manager-eligible

Wed Apr 10 07:30:12.493 UTC 2024 node that belonged to the same cluster as this node. This tool can cause

Wed Apr 10 07:30:12.494 UTC 2024 arbitrary data loss and its use should be your last resort.

Wed Apr 10 07:30:12.494 UTC 2024

Wed Apr 10 07:30:12.494 UTC 2024 Do you want to proceed?

Wed Apr 10 07:30:12.494 UTC 2024

Wed Apr 10 07:30:12.494 UTC 2024 Node was successfully detached from the cluster

Wed Apr 10 07:41:36.297 UTC 2024 warning: no-jdk distributions that do not bundle a JDK are deprecated and will be removed in a future release

Wed Apr 10 07:41:36.298 UTC 2024 ------------------------------------------------------------------------

Wed Apr 10 07:41:36.298 UTC 2024

Wed Apr 10 07:41:36.298 UTC 2024 WARNING: OpenSearch MUST be stopped before running this tool.

Wed Apr 10 07:41:36.298 UTC 2024

Wed Apr 10 07:41:36.298 UTC 2024 ------------------------------------------------------------------------

Wed Apr 10 07:41:36.299 UTC 2024

Wed Apr 10 07:41:36.299 UTC 2024 You should only run this tool if you have permanently lost all of the

Wed Apr 10 07:41:36.299 UTC 2024 cluster-manager-eligible nodes in this cluster and you cannot restore the cluster

Wed Apr 10 07:41:36.299 UTC 2024 from a snapshot, or you have already unsafely bootstrapped a new cluster

Wed Apr 10 07:41:36.299 UTC 2024 by running `opensearch-node unsafe-bootstrap` on a cluster-manager-eligible

Wed Apr 10 07:41:36.299 UTC 2024 node that belonged to the same cluster as this node. This tool can cause

Wed Apr 10 07:41:36.299 UTC 2024 arbitrary data loss and its use should be your last resort.

Wed Apr 10 07:41:36.299 UTC 2024

Wed Apr 10 07:41:36.299 UTC 2024 Do you want to proceed?

Wed Apr 10 07:41:36.299 UTC 2024

Wed Apr 10 07:41:36.300 UTC 2024 Node was successfully detached from the cluster

Running ps -aux | grep portal shows only 2 portal processes - portal itself and opensearch, while portal tomcat and portal postgres processes are missing.

I tried to launch Postgres manually using

/opt/arcgis/portal/framework/runtime/pgsql/bin/postgres -D /opt/arcgis/portal/usr/arcgisportal/db -p 7654

It launched successfully but after few minutes process were terminated.

Any ideas how we can solve it?

JonathanQuinn · ‎04-10-2024

<Msg time="2024-04-10T08:06:25,166" type="WARNING" code="216028" source="Portal" process="4127" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">HA_MASTER_NODE_DOWN eip-esrieip-esri-ha</Msg>

This message means that the standby has started up, but the primary machine is not accessible. Since the standby can't get the most up to date data from the primary, it won't promote itself to primary. You'll need to stop the standby, start primary, and troubleshoot why it's not healthy before starting standby.

Yurii_P · ‎04-11-2024

Hi, thank you for your reply. Standby machine was stopped at that moment. To avoid this problem with Postgres connectivity I launched portal Postgres manually, connected to it, took a dump from gwdb database and restored it on standalone Postgres. Then I modified portal config file and replaced connection string to force portal connect to that Postgres instance. Looks like it helped a little, because after some time portal somehow launched its own Postgres and tomcat. Despite of that portal is still unavailable and portal logs contains:

<Msg time="2024-04-11T08:32:14,498" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:32:24,503" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:32:34,502" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:32:44,502" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:32:54,505" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:33:04,475" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:33:04,487" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:33:14,498" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>
<Msg time="2024-04-11T08:33:24,500" type="WARNING" code="218037" source="Portal Admin" process="160182" thread="1" methodName="" machine="EIP-ESRI" user="" elapsed="" requestID="">Health Check failed, the portal is not ready.</Msg>