Problem during federation - attempting single-machine deployment migration/DR strategy

Dixie_MDavis · ‎04-19-2020

Hello. Hope all are staying safe and sane during this unusual time.

I am having difficulty trying to implement the single-machine deployment approach for a DR strategy as described in this ArcGIS Blog: https://www.esri.com/arcgis-blog/products/arcgis-enterprise/administration/migrate-to-a-new-machine-...

I tested this same strategy successfully last year, though it took me four attempts and some help from this community (@JQuinn-esristaff, specifically) to get it right. I am now at the final stage of implementing our new configuration using this strategy and find myself stuck again. I am having a similar problem with trying to restore a webgisdr full backup made from the primary machine to a secondary machine that has been installed and configured with 1071 AGS Enterprise software. Before installing any software on the secondary machine (hostname - enterprise1.domain.com | IP address -10.0.0.2), I edited the secondary machines etc/hosts file to include a reference to the primary machine's FQDN (enterprise.domain.com) while using the secondary machine's IP address (10.0.0.2), so that its IP address would resolve to enterprise.domain.com, as indicated in the blog.

Where things break down is during the federation, but I am not sure. I suspect this because while I am logged into Portal (accessed successfully using enterprise.domain.com on the secondary machine), I get a login screen for Portal that references the secondary machine's hostname (enterprise1.domain.com). This happens after I have set the server for federation and saved it and before I set the same server as the hosting server (all using the enterprise.domain.com reference). Checking the Server security configuration within the Adminstration site, shows that the "portalUrl" uses enterprise1.domain.com (the secondary FQDN), rather than enterprise.domain.com (the primary FQDN). So, the "privatePortalUrl" and the "portalUrl" values are not the same. Hence, when I tried a restore I got the error that states the public Portal URLs are not the same and the restore fails.

Can anyone point out what I may be doing wrong? Is there a required order to installing the software or for applying the SSL certs? I have had problems with getting the SSL certs done properly in the past, could this be the issue? We are using a wildcard SSL and I have configured both Portal and Server using the cert as an exported root (*.cer) and as an exported existing cert (*.pfx). I have also used the Portal's checkURL utility against the Server's admin URL and it returns a status code of 200, though it does return false for the "secured" value.

Thanks for your time and attention. Thanks in advance. Best, Dixie.

Dixie_MDavis · ‎04-28-2020

So, I have not been able to solve my issue, but I do think I may know where things are breaking and what is contributing to the problem. Maybe this information will lead to more information or ideas for a resolution from someone.

I believe the problem lies with the configuration of the Data Store. During the preparation of the secondary machine (enterprise1.domain.com) the software installs and configuration of each component except for the Data Store (Portal, WA, Server, WA) work as expected and correctly pick up the hostname for the primary machine (enterprise.domain.com). I believe it is mainly working because the primary machine's domain name is either reflected in messages displayed during configuration or it is reflected in the browser URL.

After the configuration of the Data Store, however, one of the properties (owning system URL) includes the secondary machine's hostname (https://enterprise1/<WA for Server>). It is blank before configuration and then it includes the secondary machine's hostname after configuration. I believe this may be why the restore is not working.

I believe what is contributing to the problem is that the VMs we are using (enterprise1.domain.com and enterprise.domain.com) were set up using hostnames in all capital letters. Both the primary, secondary and a common box we use for centralized data access all have hostnames in all caps. We have tried unsuccessfully to change the hostnames (via the system property and Active Directory), but Windows believes the two strings (all lowercase and all uppercase) to be equivalent so it will not allow us to apply any changes. We have found a reference that states the ComputerName registry value could be changed, (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ComputerName\ActiveComputerName\ComputerName)

but I am not sure about the downhill effects of this change. We would have to change both the primary and secondary machine names and potentially the data server. These changes would in effect be server name changes, right? So, it could break our primary Portal, right?

I have verified that the test system I used to successfully install, configure and restore a webgisdr export had all lowercase server names and the Data Store owning system URL was configured on the secondary machine with the primary machine's hostname (https://enterprise.domain.com/<WA for Server>).

I know case-sensitivity matters in some of the technologies that go into the different software components in Enterprise, so that is why I think the core of our problems are related to the server names. I am also, therefore, wary to change them. If anyone has any suggestions/clues please let me know. I really appreciate your time and attention. Best, Dixie.

webgisdr utillity‌

datastore‌

#dr strategy‌

JonathanQuinn · ‎04-28-2020

The DR tool shouldn't care about the owning system URL returned by the Data Store. When you registered the

Web Adaptor on your new site, what URL did you use to reach the Web Adaptor registration page? Was it enterprise1.domain.com/<wa>/webadaptor or enterprise.domain.com/<wa>/webadaptor?

If you used enterprise1.domain.com to reach the registration page, then that's the URL the Portal will configure for itself, even though the etc\hosts entry is in place to resolve the IP of the Web Adaptor machine to enterprise.domain,com. You need to access the Web Adaptor registration page through enterprise.domain.com.

Dixie_MDavis · ‎04-28-2020

Hi, Jonathan.

Thanks for your response.

For the Portal Web Adaptor, the configuration page was launched using localhost/<wa>/webadaptor. After getting past the security challenges due to non HTTPs, it resolved to enterprise1.domain.com, but I forced it back to enterprise.domain.com which stuck. The configuration message for the Portal WA returned the appropriate hostname and URL for the Portal - https://enterprise.domain.com/<wa>/home.

For the Server Web Adaptor, the configuration page launched using localhost/<wa>/webadaptor/server. I left it as localhost during the configuration. When the configuration was complete, it listed the correct server name as being configured - enterprise.domain.com and then reported that the URL to use to access the Services Directory was: https://localhost/<wa>/rest/services.

Is this the problem, then? That I did not use the FQDN for the primary server in the URL for web adaptor configuration pages? I used it within the forms for ArcGIS Server URL and for Portal URL. Please see the uploaded image.

Dixie_MDavis · ‎04-30-2020

After ensuring the primary hostname was part of the URL for launching the Web Adaptor configuration for both Portal and Server, I did successfully start a restore. It failed however after several hours. It completed the restore for Data Store, Server and then failed for Portal. The message was:

Failed to restore the Portal for ArcGIS:
Url: https://enterprise.domain.com/<portal wa>.
{"error":{"code":500,"details":null,"message":"Failed to import site. Failed to
delete the database directory."}}

I saw from the Portal logs that it was looking for a path in the arcgisportal content directory that did not exist and there are several additional directories (5 of them) in the arcgisportal directory that begin with db and have a string of numbers after them, like db1588229730671. There is also a backedupContents20200429 directory under the arcgisdatastore directory. The Windows service account running Portal, Server and Data Store services has full NTFS permissions on both the arcgisportal and arcgisdatastore directories as set by the software install. That is not the case for the arcgisserver directory or the main software install directory.

Just wondering if anyone has insight into this issue?

Thanks in advance. Best, Dixie.

Dixie_MDavis · ‎04-30-2020

Tried the restore again, but this time copied the backup from the primary server to the secondary machine so that it was local. Updated the webgisdr.properties file and ran the import command again. The restore got further, but failed again. After over 8 hours, it reported that it could not start the database server.

{"error":{"code":500,"details":null,"message":"Failed to import site. java.lang
Exception: Failed to start the database server. The startup timed out. Please check the log file at Z:\\arcgisportal\\logs\\database\\pgsql.log."}}

Failed to restore the Portal for ArcGIS.

I checked the log noted and it only had the following lines: 1 file(s) copied. Several times.

I have not changed the logging level in the logback.xml file within the webgisdr directory.

Has anyone seen this behavior before? Anyone know how to resolve this so we can get a restore working on our secondary machine? Thanks in advance, Dixie.

JonathanQuinn · ‎05-01-2020

That type of problem requires a bit of digging. When Portal starts the database, it sends the request to start it and waits for 5 minutes for it to be avaiable. If it's not, it reports that the startup timed out. The only way to see the reason why the database failed to start is through the Event Viewer logs. You can look at the timestamp of when the error occurred in Portal, and 5 minutes prior to that, you should see why the database failed to start. What version are you using? At 10.6 and earlier, you need to make sure that the db directory is the same between your source and target environments.

Dixie_MDavis · ‎05-01-2020

Hi, Jonathan. We are using 10.7.1. I will take a peak at the Event Viewer. The primary and secondary systems should be identical.

Dixie_MDavis · ‎05-02-2020

Here's what I found from the Event Viewer and postgres logs. The Event Viewer and webgisdr logs are in EDT. The postgres log is PDT. The directory referred to below by the Event Viewer does exist and the service account running the NT services has full NTFS permissions.

Error in webgisdr log - 19:57:48 (This is EDT) Failed to start the database server. The startup timed out.

Event viewer - 7:52:34 PM pg_ctl: server does not shut down
Event viewer - 7:52:40 PM pg_ctl: another server might be running; trying to start server anyway
Event viewer - 7:52:40 PM pg_ctl: could not start server Examine the log output.
Event viewer - 7:57:41 MP pg_ctl: directory "Z:/arcgisportal/db" does not exist

postgresql log
2020-04-30 16:58:44 PDT: [13396]: LOG: database system was interrupted; last known up at 2020-04-30 16:33:43 PDT
2020-04-30 16:58:44 PDT: [13396]: LOG: database system was not properly shut down; automatic recovery in progress
2020-04-30 16:58:44 PDT: [13396]: LOG: redo starts at 0/28027350

JonathanQuinn · ‎05-05-2020

Is the path actually Z:\arcgisportal\db? Is the Z:\ drive a physical drive on the machine, or is it a mapped drive? Do both environments use the Z:\ drive?