Problem during federation - attempting single-machine deployment migration/DR strategy

Dixie_MDavis · ‎04-19-2020

Hello. Hope all are staying safe and sane during this unusual time.

I am having difficulty trying to implement the single-machine deployment approach for a DR strategy as described in this ArcGIS Blog: https://www.esri.com/arcgis-blog/products/arcgis-enterprise/administration/migrate-to-a-new-machine-...

I tested this same strategy successfully last year, though it took me four attempts and some help from this community (@JQuinn-esristaff, specifically) to get it right. I am now at the final stage of implementing our new configuration using this strategy and find myself stuck again. I am having a similar problem with trying to restore a webgisdr full backup made from the primary machine to a secondary machine that has been installed and configured with 1071 AGS Enterprise software. Before installing any software on the secondary machine (hostname - enterprise1.domain.com | IP address -10.0.0.2), I edited the secondary machines etc/hosts file to include a reference to the primary machine's FQDN (enterprise.domain.com) while using the secondary machine's IP address (10.0.0.2), so that its IP address would resolve to enterprise.domain.com, as indicated in the blog.

Where things break down is during the federation, but I am not sure. I suspect this because while I am logged into Portal (accessed successfully using enterprise.domain.com on the secondary machine), I get a login screen for Portal that references the secondary machine's hostname (enterprise1.domain.com). This happens after I have set the server for federation and saved it and before I set the same server as the hosting server (all using the enterprise.domain.com reference). Checking the Server security configuration within the Adminstration site, shows that the "portalUrl" uses enterprise1.domain.com (the secondary FQDN), rather than enterprise.domain.com (the primary FQDN). So, the "privatePortalUrl" and the "portalUrl" values are not the same. Hence, when I tried a restore I got the error that states the public Portal URLs are not the same and the restore fails.

Can anyone point out what I may be doing wrong? Is there a required order to installing the software or for applying the SSL certs? I have had problems with getting the SSL certs done properly in the past, could this be the issue? We are using a wildcard SSL and I have configured both Portal and Server using the cert as an exported root (*.cer) and as an exported existing cert (*.pfx). I have also used the Portal's checkURL utility against the Server's admin URL and it returns a status code of 200, though it does return false for the "secured" value.

Thanks for your time and attention. Thanks in advance. Best, Dixie.

Dixie_MDavis · ‎05-20-2020

Attempted another restore on the secondary server as a test. Used the same local webgisdr export and it failed with the same error as seen on 4/29 - "Failed to import site. Failed to delete the database directory." I cannot log on to the Portal home and the Portal admin only allows the upgrade command.

For disaster recovery, do we need to consistently start with a secondary server that has no content but is configured as the primary server?

The documentation mentions setting up scheduled tasks to routinely update the secondary machine with an export from the primary machine. Is the secondary machine at the same state each time, meaning is it a cloned VM with just the configured software? Or, is it at the last state it was after the previous restore - meaning it is has content but is out of sync (content, etc.) with the primary server?

JonathanQuinn · ‎05-27-2020

No, the portal can contain items, the restore process will delete everything and restore everything in the backup.

The "failed to delete the database directory" error is strange. Does that happen consistently? If you can, try to monitor the running processes on the machine. During the restore, the PG processes should go away, so that the old DB can be backed up/deleted before the new one restored.

Dixie_MDavis · ‎05-26-2020

Tried the suggestion another analyst had to do a datastore restore from our primary server to the standby server prior to trying the full webgisdr import. We could not get the datastore restore to work using either a local copy or by using a file share.

The errors were either 'Could not find a valid backup to restore the data store.' after copying the backup to the default local relational datastore location and using 'most-recent' for the target argument. Or, 'Failed to create relational data store database '{0}'.' after specifying the backup file name with target.

This weekend we we were able to successfully re-run full and incremental backups and restores on our testing environment using the webgisdr. This fact leads me to believe our issue lies with the datastore and the problem with using upper case names in our original setup for the VMs (as I posted on 4/28). The testing environment does not have upper case host names.

We are considering creating a new set of VMs with lower-case server names to see if we can get success. We may go ahead and use 10.8 to take advantage of the read-only setting for Portal. We will report back if we are successful.

Best, Dixie.

JonathanQuinn · ‎05-27-2020

Has the analyst tested the capital letters in hostnames theory? They should be able to test that in-house to see if they can reproduce the problem.

Dixie_MDavis · ‎05-28-2020

Thanks for the suggestion, Jonathan. I have sent a note to our analyst and asked if they could test the theory.

Dixie_MDavis · ‎06-18-2020

Hi, Jonathan.

I was so hoping that I could post back that using VMs with lowercase hostnames fixed our issues. We have spent almost two weeks recreating our system using new VMs with lowercase names. And, this morning I found the full restore using our new secondary server failed. It reported one of the same errors we saw on an earlier attempt -

Failed to restore the Portal for ArcGIS:
Url: https://<primary server domain name>/<portal wa>.
{"error":{"code":500,"details":null,"message":"Failed to import site. Failed to delete the database directory."}}

We cannot log into the Portal. There are no items in the content area. Portal admin only allows us to Upgrade. There are several dbnnnnnnnnnnnn directories in the arcgisportal directory.

The catalina log does have errors relating to a token:

17-Jun-2020 21:09:52.685 SEVERE [https-jsse-nio-7443-exec-10] com.esri.commons.web.rest.providers.BaseProvider.getValue Token Required.

It also has one noting a servlet error relating to the request and form parameters not being what is expected for the URI

https://<primary server domain name>/arcgis/portaladmin/importSite/?
"...contains form parameters in the request body but the request body has been consumed by the servlet or a servlet filter accessing the request parameters. Only resource methods using @FormParam will work as expected. Resource methods consuming the request body by other means will not work as expected.
17-Jun-2020 21:20:29.783 SEVERE [https-jsse-nio-7443-exec-8] com.sun.jersey.spi.container.ContainerResponse.mapMappableContainerException The RuntimeException could not be mapped to a response, re-throwing to the HTTP container
com.esri.arcgis.portal.admin.core.PortalException: The portal site has not been initialized. Please create a new site and try again."

The localhost log in the tomcat directory says the portal site has not been initialized. Please create a new site and try again.

The postgresql log has at least three errors stating that a particular .ready file is not found, for example:

2020-06-17 18:23:23 PDT: [4732]: LOG: could not create archive status file "pg_wal/archive_status/00000001000000000000002E.ready": No such file or directory

Looking in the Event Viewer for errors logged prior to the postgres errors, we found one event that stated the "Z:\arcgisportal\db is not a database cluster directory" and one that stated, "pg_ctl: server does not shut down".

There is an event regarding listening on a localhost ipv6 port. Does ipv6 need to be enabled for Enterprise? It is not on our VMs.

We do have a support request logged (#02547962), but I have not heard from the analyst since 6/3 when he emailed to say he was determining if he could create a test environment to test the uppercase host name issue.

I will send an update to our support analyst with the latest error information to find out if he has learned anything new.

Do you have any suggestions on additional things to try? Are there any other logs I should investigate?

Your input and feedback has been extremely helpful throughout this effort. Thanks for any suggestions or ideas you can provide.

Best, Dixie.

JonathanQuinn · ‎06-19-2020

I'm glad the hostname theory has some substance to it. I'd suggest making sure the analyst tests it and logs a bug if they can reproduce it.

The "failed to delete the db directory" error is a bit misleading. When we restore portal, we want to backup your data so we can roll back if there was an issue. The db folder will be renamed to dbXXXX and the db folder in the backup put in the same folder. We can only rename the db directory if the database is shut down. I'm not sure what you've done since the error was thrown, (restarted the portal, etc), but the next time it happens, check to see if there are any postgres.exe processes running, and the see when the PID associated with those PG processes was created under framework\etc\pids. If the time modified value of the PID file associated with the PG process is before the time you attempted to restore, then the database wasn't shut down correctly. If the database is still running, then you can't rename the db folder.

It appears this type of failure causes the rollback logic to fail, potentially because the database is still running. To get the site back to a working state, you can copy the .config-store-connection.json file from the db directory into framework\etc without the leading period. Portal should start working after that.

Check the log files for a message indicating that the database was found to be stopped, and portal is restarting it. That'd be an indication that portal did stop the database, but a timing issue with our internal process checks re-started the database during the restore.

Dixie_MDavis · ‎06-22-2020

Thanks for your response, Jonathan. Your posts have been very helpful.

I do believe the problem with the restore failing is related to the timing issue you mention. I found in the logs that the portal was restarting the database after it was found to be stopped.

I tried two more times to do a restore and got the same error with failing to delete the database and then got another error we had seen before that stated the database server could not be started and the startup had timed out. Each time I was able to bring the portal back using the method you described with copying over the json connection file. There was still no content, however, since the restore had failed and we were starting with an empty standby server.

I tried a third time to do another full restore with a newer backup from the primary server and it failed again with the same error regarding not being able to delete the database. This time I am not able to login to the Portal, nor can I fix it using the json connection file. Signing in from the Portal home page reports that the account is unauthorized. Trying to use the Portal admin states that the Portal is unavailable and to check the logs. The logs report that there are invalid primary and secondary checkpoint records and that the startup process was terminated. The portal log reports that the internal database is not running nor accepting connections and I should contact ESRI support.

My point in trying it the third time was to try and get to the point where I could restore the Portal content by doing a re-indexing and/or by doing a Portal site import using an export from the primary's Portal.

So, how can we get past the timing issue and the rollback logic that fails?

Is there something I can suggest to the analyst to try to help us work through this? We have had an open support case since the beginning of May and I need to get a resolution soon.

Would the same timing issue come into play if we were to try and restore our primary server with a backup?

Your responses have been invaluable, thanks again for all the time you have provided Best, Dixie.

JonathanQuinn · ‎06-22-2020

I think you need to determine definitely what happens to the postgres processes during a restore. At a very high level, it's:

1) Stop the database
2) Backup the content and database to a temporary location

3) Import the content
4) Restore the database

3 and 4 may be out of order, but that's the general idea. If the database fails to stop on step 1, then nothing else will work. What I suggest is ensure that the standby portal is in a working state, take a backup of Portal directly in Portaladmin, and try to restore it. The DR tool isn't doing anything different than that, so it should help to isolate your testing. Once you restore, watch the processes on the portal machine, and you should see the postgres.exe processes go away. You may also want to run ProcMon with a filter on the C:\arcgisportal directory so you can see what is going on on the file system when restoring. You can set a filter for "path starts with C:\arcgisportal\db" which should capture anythiing within C:\arcgisportal\db as well as the backup folder, ex. C:\arcgisportal\dbXXXX. ProcMon will help log what the issue was when Portal tried to rename C:\arcgisportal\db to C:\arcgisportal\dbXXXX. You may see access denied or some other error, which will help you identify what the problem was.