webgisdr import fails at image server restore

TimHaverlandNOAA · ‎08-15-2024

Hi community

I have a tech support case logged on this issue, but wanted to see if the wisdom of this group can help.

Our environment: In our production AWS VPC, we have 10.9.1 HA enterprise/portal plus federated image server built with the esri cloudformation templates. Portal, server, and image are available via maps.domain.gov/portal, /server, and /image respectively. OS is CentOS7 (linux). This deployment is fully arcgis patched.

We need to migrate off of CentOS7 to Ubuntu, so I've set up a nearly identical 10.9.1/Ubuntu20.04 deployment using the same cloudformation templates in an isolated development VPC. This target is fully arcgis-patched.

I am trying to do a webgisdr export from our production system and import it to our development system. The export operation seems to go fine and the backup file is copied to S3.

The webgisdr PORTAL_ADMIN_URL I'm using is https://maps.domain.gov/portal

Upon webgisdr import, datastore and arcgis server appear to restore fine, but when the restore gets to image server, webgisdr fails with the following error:

2024-08-12 23:12:04 ERROR [pool-3-thread-3] com.esri.arcgis.webgis.component.service.impl.ServerDRService - {"code":500,"messages":["Import site failed with the following error. Server machine 'https://maps.domain.gov/server/admin/' returned an error. 'Unauthorized access. Token not found. You can generate a token using the 'generateToken' operation.'"],"status":"error"}

Wondering if anyone has seen this kind of error before.

JonathanEpstein · ‎08-15-2024

In my experience, originally obtained from a wonderful Esri Professional Services engineer, it's imperative that the exporting and importing systems (arcserver,arcportal, datastore etc) share exactly the same hostnames. You can/should finagle this by using /etc/hosts entries in both environments.

So even though arcserver01-private.maps.domain.gov in the old cluster had (say) IP 172.168.1.99 and in the new cluster it has IP 172.168.2.37, you should finagle this difference via /etc/hosts files, with one version of /etc/hosts on the old cluster and a different version on the new cluster.

TimHaverlandNOAA · ‎08-15-2024

Thank you for that tip Jonathan. It's a little disconcerting that when one uses the cloudformation templates to build enterprise/portal, the individual server and portal machines are identified by IP address and not hostname. The IPs are different between my production and development environments, so I suppose it's possible that this is causing an issue. Will keep this on my short list of things to consider.

TimHaverlandNOAA · ‎08-20-2024

Giving this a bump. It seems like others attempting webgisdr have invoked the name of @JonathanQuinn. Jonathan, have you seen an error like the above when restoring a federated image server?

I dug into the tomcat logs and found that when the "Token not found" error occurs, the following request was made:

"POST /server/admin/data/trustedServers HTTP/1.1" 200 144

Not sure if that's a clue or the point at which an upstream problem finally breaks the process.

I checked to make sure my admin username and password were the same on the source and target systems.

JonathanEpstein · ‎08-20-2024

I'm the wrong Jonathan I know, but I searched in our corporate Slack for your error string 'Unauthorized access. Token not found. You can generate a token using the 'generateToken' and found:

511 [2023-10-05T23:39:15+00:00] DEBUG: Response: 200 {"status":"error","messages":["Could not find resource or operation 'REDACTED0-arcserver-02- c.REDACTED1.REDACTED2.net' on the system."],"code":404}

And my notes there say "a 404 error masquerading as a 200".

I've also seen a 499 error in this context.

HTH.

JonathanQuinn · ‎08-20-2024

You may be running into BUG-000150335 The restore of federated servers with additional server functions may fail while the hosting server is being restored, which is fixed at 11.0. The restore of all federated servers, including the hosting server, start at the same time. However, there are internal operations that are called during restores, and those operations depend on the hosting server, (i.e. adding to trusted server lists, managing relational drivers, etc). If the hosting server is being restored and is unavailable, those internal operations may fail resulting in obscure errors. You may be running into that.

As for the hostname thing, the external/public URLs do have to be the same. If you take a backup from https://enterprise.domain.com/portal you have to restore to https://enterprise.domain.com/portal, but the internal machine names can be different. This has been the case since 10.5.1.

https://enterprise.arcgis.com/en/portal/latest/administer/linux/overview-disaster-recovery-replicati...

TimHaverlandNOAA · ‎08-27-2024

@JonathanQuinn I would be surprised if this was related to BUG-000150335. The webgisdr logs show that the hosting server is completely restored 20 minutes before the image server restore fails.

Odd thing is, I can do a webgisdr import/export on a development HA enterprise/portal plus image server system (both source and target systems built with the same cloudformation template) that has a small variety of feature, map image, and imagery services.

But when trying to import our production data I get this failure.

Tim

TimHaverlandNOAA · ‎11-08-2024

@JonathanQuinn @JonathanEpstein Still no luck with webgisdr import. I have a feeling that taking the system out of highly-available mode might help simplify things and allow webgisdr import to complete. Do you know if there's a way to temporarily shut down or remove standby components (portal, datastore, 2nd arcgis server machine) to see if that allows webgisdr import to complete; then add those components back in after the restore?