Problem during federation - attempting single-machine deployment migration/DR strategy

4183
28
04-19-2020 07:17 AM
Dixie_MDavis
Occasional Contributor

Hello. Hope all are staying safe and sane during this unusual time.

I am having difficulty trying to implement the single-machine deployment approach for a DR strategy as described in this ArcGIS Blog: https://www.esri.com/arcgis-blog/products/arcgis-enterprise/administration/migrate-to-a-new-machine-...

I tested this same strategy successfully last year, though it took me four attempts and some help from this community (@JQuinn-esristaff, specifically) to get it right. I am now at the final stage of implementing our new configuration using this strategy and find myself stuck again. I am having a similar problem with trying to restore a webgisdr full backup made from the primary machine to a secondary machine that has been installed and configured with 1071 AGS Enterprise software. Before installing any software on the secondary machine (hostname - enterprise1.domain.com | IP address -10.0.0.2), I edited the secondary machines etc/hosts file to include a reference to the primary machine's FQDN (enterprise.domain.com) while using the secondary machine's IP address (10.0.0.2), so that its IP address would resolve to enterprise.domain.com, as indicated in the blog.

Where things break down is during the federation, but I am not sure. I suspect this because while I am logged into Portal (accessed successfully using enterprise.domain.com on the secondary machine), I get a login screen for Portal that references the secondary machine's hostname (enterprise1.domain.com). This happens after I have set the server for federation and saved it and before I set the same server as the hosting server (all using the enterprise.domain.com reference). Checking the Server security configuration within the Adminstration site, shows that the "portalUrl" uses enterprise1.domain.com (the secondary FQDN), rather than enterprise.domain.com (the primary FQDN). So, the "privatePortalUrl" and the "portalUrl" values are not the same. Hence, when I tried a restore I got the error that states the public Portal URLs are not the same and the restore fails.

Can anyone point out what I may be doing wrong? Is there a required order to installing the software or for applying the SSL certs? I have had problems with getting the SSL certs done properly in the past, could this be the issue? We are using a wildcard SSL and I have configured both Portal and Server using the cert as an exported root (*.cer) and as an exported existing cert (*.pfx). I have also used the Portal's checkURL utility against the Server's admin URL and it returns a status code of 200, though it does return false for the "secured" value.

Thanks for your time and attention. Thanks in advance. Best, Dixie.

0 Kudos
28 Replies
Dixie_MDavis
Occasional Contributor

Hi, Jonathan.

Yes, the path "Z:\arcgisportal\db" exists and it is a physical drive.  Both the primary and secondary machines use the Z:\ drive for the software install and related directories.  

0 Kudos
Dixie_MDavis
Occasional Contributor

I will log a support request to find out what might be going wrong.  I feel like I am missing something because it doesn't seem like it should be so difficult to get the restore completed.  We have a relatively simplistic setup with a single machine and not a great deal of content in our Portal or being served from our Server.

0 Kudos
Dixie_MDavis
Occasional Contributor

So, we have logged a support request and the analyst advised doing an export of the Portal site from the Portal's administrative REST endpoint.  We did that and then imported the export into the secondary machine's Portal administrative REST endpoint.  The process seems to have completed but there is no content.  The Portal is accessible, members and groups are present, but no content.  Does the absence of content indicate the process was not successful?  Is there a way to hook up the content somehow?

0 Kudos
JonathanQuinn
Esri Notable Contributor

Do you see item ID folders within the content directory? Do the amount of folders match between the source and target environments? Does the index match up? There's validation in the import that will make sure the content in the backup and where it was restored to match. You can try to extract the backup and inspect it to make sure it contains the folders you'd expect by matching up the number of folders to the source environment.

0 Kudos
Dixie_MDavis
Occasional Contributor

Hi,  Jonathan.  Yes, there are item ID folders in the content directory on the secondary machine.  The amount is not the same, however.  The secondary machine has three more items.  How do I check the index to tell if it matches up?  

As for extracting the backup and inspecting the number of folders, the number of folders in the webgisdr full backup and the number in the portal admin export do match up, but they have roughly 300 fewer items.    

0 Kudos
JonathanQuinn
Esri Notable Contributor

The index can be checked via the Portal Administrator Directory, https://portal.domain.com:7443/arcgis/portaladmin/system/indexer:

Indexer Status—ArcGIS REST API: Administer your portal | ArcGIS for Developers 

That will tell you if the content is there, but the index, (which populates search results) isn't aware of the content. You can reindex the portal and see if that helps, if does turn out to be out of sync.

Reindex—ArcGIS REST API: Administer your portal | ArcGIS for Developers 

Depending on when you took the backup, there's a chance that new items were created in the original environment. If the difference was drastic, then that'd be a problem.

0 Kudos
Dixie_MDavis
Occasional Contributor

Wow!  Thanks so much, Jonathan.  It appears that running the reindex worked.  The status returned an equal count for the users, groups and search (content?).  I ran reindex in full mode and then content was available.

So, since this was a test to restore the primary server, I need to try it again because I need a repeatable process that I can automate.  Also, there are about a dozen dbxxxxxxxxxxxxx sub-directories in the arcgisportal directory.  I will create another clone from the secondary's snapshot and begin again.  I will make sure I create a new backup with the webgisdr utility and try to import it as soon as possible during off-hours.  If the webgisdr import fails again, is it standard practice to try the portal endpoint export as I was advised by Support?

Thanks again for your patience and all your responses. We really want to get a repeatable, reliable process to act as our backup/disaster recovery strategy.

Best, Dixie.

0 Kudos
JonathanQuinn
Esri Notable Contributor

The dbXXXXX folders are just backups of the existing database that's there. You can delete them once you've confirmed that the restore worked.

Trying the export/import in Portaladmin directly is a good troubleshooting step because you don't need to restore Data Store and Server to see if the Portal restore is going to fail. We have plans of making that easier with the DR tool, but they're not in yet.

Yes, the DR tool needs to be consistently successful to be considered a viable DR approach, so this feedback is helpful.

Dixie_MDavis
Occasional Contributor

Hello again, Jonathan.

In the hopefully, helpful feedback department, we conducted another restore test and this time we got a different set of messages in the webgisdr log (see images below).  The failure to validate servers concerned us and we did see errors in the Server Manager logs, but the Data Store looked good after a describe.  Initially we could not log into the Portal home, but after checking the status of the index and running a re-index from the Portal admin endpoint, we were able to log in and view content.  We have not done an exhaustive test, but except for the Identity Store not being configured, everything looks good.

Screen shot of webgisdr log

Screen shot of ArcGIS Manager log

I am curious to know if others are finding more consistent results.  I also would like to know if others with a fairly simple, single-machine (VM) deployment such as ours are using any other tools, such as VM replication?  I will start a separate discussion thread.


Thanks for all your help!

0 Kudos
JonathanQuinn
Esri Notable Contributor

The "Failed to validate servers" error is likely a result of a bug that causes multiple processes to start within Portal. They conflict with each other which means the site doesn't work quite right. The Portal can be restarted to resolve that problem. It's fixed at 10.8. The errors in the Server logs can be ignored. They don't indicate a problem. We're looking into cleaning those errors up.

0 Kudos