AWS 10.8.1 HA Portal not restarting

11843
28
Jump to solution
08-24-2020 06:47 AM
ShanonLoughton
Esri Contributor

Hi

In AWS Ive deployed ArcGIS Enterprise 10.8.1 using Windows in HA and using ELB, both templates available here.
It works on first deployment and successful CloudFormation create-stack.
However when stopping/restarting Portal service on a given instance, or rebooting the Portal instances (either of them), they fail to restart Portal service, citing internal database not started properly, and symptoms like passwords rejected at /portaladmin login.
Here's one instance's set of logs:
Windows EventViewer

PostgreSQL : The description for Event ID 0 from source PostgreSQL cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer. (repeated, initially Information class, but further back is Error class)
Portal: The description for Event ID 5 from source Portal for ArcGIS cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer. (repeated, initially Information class, but further back is Error class)

"C:\Program Files\ArcGIS\Portal\framework\service\logs\service-error-0.log"

Mon Aug 24 13:01:13.237 UTC 2020 Exception in thread "I/O dispatcher 17" 
Mon Aug 24 13:01:13.237 UTC 2020 java.lang.NoSuchMethodError: org.apache.http.ConnectionClosedException: method 'void <init>()' not found
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:356)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
Mon Aug 24 13:01:13.237 UTC 2020 at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
Mon Aug 24 13:01:13.237 UTC 2020 at java.base/java.lang.Thread.run(Unknown Source)

"C:\arcgisportal\logs\10.72.4.42\portal\portal-20200824.123347-12940-0.0.log"

<Msg time="2020-08-24T12:34:34,908" type="SEVERE" code="218010" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>
<Msg time="2020-08-24T12:34:34,908" type="WARNING" code="218012" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">Error at check and update urls. The portal is currently unavailable. Please contact your portal administrator.</Msg>
<Msg time="2020-08-24T12:35:12,375" type="SEVERE" code="218010" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>
<Msg time="2020-08-24T12:35:18,695" type="SEVERE" code="218010" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>
<Msg time="2020-08-24T12:35:22,903" type="WARNING" code="218014" source="Portal" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">Starting Index Service.</Msg>
<Msg time="2020-08-24T12:36:28,437" type="SEVERE" code="218010" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>
<Msg time="2020-08-24T12:36:49,886" type="SEVERE" code="218010" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>
<Msg time="2020-08-24T12:37:44,483" type="SEVERE" code="218010" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>
<Msg time="2020-08-24T12:38:09,93" type="SEVERE" code="218010" source="Portal Admin" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>
<Msg time="2020-08-24T12:38:41,750" type="WARNING" code="218015" source="Portal" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">Started Index Service.</Msg>
<Msg time="2020-08-24T12:38:41,828" type="WARNING" code="217060" source="Portal" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The database server was found to be stopped. Re-starting it.</Msg>
<Msg time="2020-08-24T12:38:51,974" type="WARNING" code="217064" source="Portal" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">The web server was found to be stopped. Re-starting it.</Msg>
<Msg time="2020-08-24T12:43:09,108" type="WARNING" code="219999" source="Portal Admin" process="5948" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID=""> Failed to generate a token for user 'portaladmin'. java.lang.Exception: Server returned HTTP status code 404. Requested URL: https://10.72.4.42:7443/arcgis/sharing/generateToken</Msg>
<Msg time="2020-08-24T12:43:12,517" type="WARNING" code="219999" source="Portal Admin" process="5948" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID=""> Failed to generate a token for user 'portaladmin'. java.lang.Exception: Server returned HTTP status code 404. Requested URL: https://10.72.4.42:7443/arcgis/sharing/generateToken</Msg>
<Msg time="2020-08-24T12:43:15,819" type="WARNING" code="219999" source="Portal Admin" process="5948" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID=""> Failed to generate a token for user 'portaladmin'. java.lang.Exception: Server returned HTTP status code 404. Requested URL: https://10.72.4.42:7443/arcgis/sharing/generateToken</Msg>
<Msg time="2020-08-24T12:43:20,950" type="WARNING" code="219999" source="Portal Admin" process="5948" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID=""> Failed to generate a token for user 'portaladmin'. java.lang.Exception: Server returned HTTP status code 404. Requested URL: https://10.72.4.42:7443/arcgis/sharing/generateToken</Msg>
<Msg time="2020-08-24T12:43:31,158" type="WARNING" code="219999" source="Portal Admin" process="5948" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID=""> Failed to generate a token for user 'portaladmin'. java.lang.Exception: Server returned HTTP status code 404. Requested URL: https://10.72.4.42:7443/arcgis/sharing/generateToken</Msg>
<Msg time="2020-08-24T13:01:13,481" type="WARNING" code="217068" source="Portal" process="12940" thread="1" methodName="" machine="10.72.4.42" user="" elapsed="" requestID="">Error in the process monitor watchdog thread. sleep interrupted</Msg>

"C:\arcgisportal\logs\database\pg_log\postgresql-2020-08-24_060707.log"

2020-08-24 05:38:42 PDT: [6272]: LOG: database system was shut down at 2020-08-24 05:32:36 PDT
2020-08-24 05:38:42 PDT: [5824]: LOG: database system is ready to accept connections
2020-08-24 05:38:42 PDT: [10060]: LOG: incomplete startup packet
2020-08-24 05:38:42 PDT: [10180]: FATAL: no pg_hba.conf entry for host "::1", user "arcgis", database "arcgis", SSL off
2020-08-24 05:39:12 PDT: [12016]: LOG: incomplete startup packet
2020-08-24 05:40:12 PDT: [12196]: LOG: incomplete startup packet
2020-08-24 05:41:12 PDT: [12120]: LOG: incomplete startup packet
2020-08-24 05:42:12 PDT: [11232]: LOG: incomplete startup packet (repeated)
2020-08-24 05:42:33 PDT: [11348]: LOG: could not receive data from client: An existing connection was forcibly closed by the remote host. (repeated)

Thanks for any help.

ArcGIS Enterprise Admin Professional, AWS Solutions Architect Professional
28 Replies
JonathanQuinn
Esri Notable Contributor

I'm a bit doubtful that the problem is as simple as stopping and starting the standby. We go through a considerable amount of regression testing and this problem has not been discovered internally. The fact that you had many db folders means that the standby is unable to rejoin; that behavior is not described in the bug. Anyway, we'll need to have a close look at that issue.

0 Kudos
NicolasGIS
Occasional Contributor III

I found some error in Event viewer related to the service account running ArcGIS Enterprise:

"The machine-default permission settings do not grant Local Activation permission for the COM Server application with CLSID
{0358B920-0AC7-461F-98F4-58E32CD89148}
and APPID
{3EB3C877-1F16-487C-9050-104DBCD66683}
to the user MYCOMPANY\MYUSER SID (S-1-5-21-1526224874-1540688658-1361462980-7571414) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool."

CLSID {0358B920-0AC7-461F-98F4-58E32CD89148} refers to Wininet Cache task object

APPID {3EB3C877-1F16-487C-9050-104DBCD66683} refers to WinInetCacheServer

Could it explain why memory cache component couldn't start up ?

Cheers

0 Kudos
MichailMarinakis1
Occasional Contributor II

We have noticed something. In the HA portal, the standby machine that is complaining with "You are not authorized to use this resource." when admin is trying to log in, has lost the SSL Certificates configuration! When we joined this machine to the portal site, we had set correctly the root and server certificate. Screenshot 2020-11-25 153921.png

If we try to update this configuration, then we receive this

MichailMarinakis1_0-1606316921291.png

As a workaround, we have tried to:

  1. stop the standby machine,
  2. restart the primary,
  3. start the standby again

Sometimes this pattern works, sometimes not...

Another point is that the following link: https://standbymachine.domain.com:7443/arcgis/sharing/rest/generateToken returns a 404 error message. Everything works fine for the primary machine.

Any ideas will be very helpful!

0 Kudos
NicolasGIS
Occasional Contributor III

In the folder you mentionned:

<install directory>\framework\runtime\ds\usr\logs\MYVM01,

there is only an 'elasticlog' folder which I believe is only about tileCache datastore.

0 Kudos
SaurabhUpadhyaya
Occasional Contributor

We are deploying ArcGIS Servers(On Premises- Due to Date Security Restriction). Previously we are using Load Balancer to connect both ArcGIS Server and not using any Enterprise Portal. 

Now we have Enterprise Portal and ArcGIS Servers are federated with Portal(Only 1 Portal Server). we are planning to use load balancer and not able to found any esri document/help to add the load balancer.

 

2 ArcGIS Servers

1 Portal

1 Load Balancer

1 Postgres DB Request to all of you please suggest 

0 Kudos
MichailMarinakis1
Occasional Contributor II

Hi!

Same behavior for us with on premise HA portal. 

I have tried different patterns. stopping the standby restarting the primary one but no luck.

Any feedback can be useful. 

Cheers,

Michail

0 Kudos
JonathanQuinn
Esri Notable Contributor

Shanon Loughton‌ can you take a look at the localhost logs under <install directory>\framework\runtime\tomcat\logs? They'll show why the Sharing API is not initializing and returning a 404.

by Anonymous User
Not applicable

I forgot about this post during the Geonet backend changeover... looks like Im in dual user account land too..

The original environment is gone but a similar one I spun up showed similar symptoms of Portal in HA not restart-able. I have not checked if exactly the same situation, however it was resolved by:

  1. Ensuring the Standby machine is stopped first.
  2. And then Primary machine is stopped and started.
  3. Check that Primary machine works functions and is up for a while (>10 mins).
  4. Then start Standby machine and let it presumably sync/catchup/revert whatever.

Prior to the above resolution I also checked  <install directory>\framework\runtime\tomcat\logs - there were a variety of Java errors, SEVERE ones were about:

  • Unauthorized access, No valid token (presumably my attempts)
  • One or more listeners failed to start
  • Could not join cluster shutting down now.

Logs attached, 10.0.0.1 is Primary, 10.0.0.2 is Standby.

@ShanonLoughton 

0 Kudos
Trevor_Hart
Occasional Contributor

This is a very interesting discussion. We have had this as an ongoing issue for over a year now without a reliable fix.

We restart as per @Anonymous User  suggestion and this generally fixes it.

One thing to add here is that Postgres is actually available as I am able to log into the database using another database client.

@Anonymous User your logs show this error;

1-Dec-2020 01:50:28.894 WARNING [main] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesJdbc The web application [arcgis#portaladmin] registered the JDBC driver [org.postgresql.Driver] but failed to unregister it when the web application was stopped. To prevent a memory leak, the JDBC Driver has been forcibly unregistered.

This makes me think that its not the database but rather the (lack of) JDBC driver thats causes the problem.

 
WilliamCraft
MVP Regular Contributor

I'm having the exact same issue/symptoms with my Azure Esri HA deployment.  Everything seems to work fine until I restart the Portal service on either of the two Portal machines.  There doesn't appear to be a workaround in my case.  I've tried this with two entirely clean installations and both times it has happened to me.  Once I restart the Portal, I begin getting 400s and 500s on the Portal Sharing REST application which prevents Portal from behaving as expected.  The logs show all of the things mentioned above in others' posts.  @JonathanQuinn you stated previously "I'm a bit doubtful that the problem is as simple as stopping and starting the standby" but from many of our perspectives restarting the Portal breaks the configuration.  It sounds like this issue occurs for on-premise and cloud deployments of HA.  

0 Kudos