High Availability and Disaster Recovery.pdf

4906
7
03-08-2019 08:20 AM

High Availability and Disaster Recovery.pdf

This presentation covers concepts associated with minimizing data loss and downtime using out of the box methods tools and functionality available within ArcGIS Enterprise. It describes backing up and restoring an environment using the WebGIS DR tool, high availability and disaster recovery using geographic redundancy.

Attachments
Comments

Hi Jon -  I don't have a success story to share as of yet, but I am preparing to run a restore for our distributed enterprise environment at 10.6.1. 

Here's the scenario:

On or about March 7 of this year, there was a fiber cut that forced our machines to switch to a backup network.  Four guest collaborations from my enterprise to our agol were lost.  The host collaborations remained intact in agol.   I probably should have run DR yesterday March 18, 2019 but did not, instead choosing to drop the 4 guest collaborations and re-send invitations.  I did that and the collaborations were recovered and re-synced.

However, perhaps because of that action (but not yet certain) yesterday March 18, 2019 somehow someway my portal lost it's connectivity to it's hosting server site.  In my portal settings for Servers, I re-validated the hosting and federated sites.  But somehow the 'hosting server' became none or null.

As a result, all my hosted feature layers (about 400) no longer appear as 'feature layer (hosted)' but as 'feature layer' as if they were referenced.  When trying to perform an overwrite of the layers, the Pro sharing module errors out 'layer with that name already exists.  For now, I plan to go ahead and run a restore as detailed here:

Restore ArcGIS Enterprise—Portal for ArcGIS (10.7) | ArcGIS Enterprise 

I created a toImport.properties file, re-set the shared location to:

# The following accounts must have read and write permissions on the shared location:
# 1) The domain account used to run the web GIS software.
# 2) The account to run this tool.


SHARED_LOCATION = C:\TempImport\March-5-2019-6-16-29-PM-EST-FULL.webgissite

the C:\TempImport\March-5-2019-6-16-29-PM-EST-FULL.webgissite, copied the webgissite file to the same folder G:\TempImport and changed the backup location to:

# Specify the Web GIS backup location if you've set the BACKUP_STORE_PROVIDER to FileSystem.
BACKUP_LOCATION = C:\TempImport

If you have any other suggestions or advise that would be great.  Otherwise hopefully all will go well.

Thanks,

David

Sorry meant to add that I set up a temp import directory on the portal machine itself, granting permissions to the domain accounts

Hey David Coley‌, sounds like you were in luck in having backups, (or not luck, just following best practice ). I'm not sure why the hosted service lost it's configuration. A similar problem was resolved a number of releases ago.

If everything is on the same machine, setting the SHARED_LOCATION to the C:\ drive path will work. If you're distributed, which you mentioned, it needs to be a shared location that all machines running the software can reach. Let me know how it goes.

Hi Jon, thanks for the reply.  Well, our network suffered 2 separate fiber cuts in 3 days.  A backup network took over, at the same time our IT was re-configuring VPN firewall rules.  To say this was a perfect storm of mal-events would be correct.

Yes I quickly discovered that if I just re-ran my webgsdr as an -import command from my same -export location then I had no issues and the DR was successful.  I did however encounter other issues.  Hosted feature layers with a dependency (i.e. supporting views) had to be completely removed along with the views themselves. 

Unfortunately, 1 of my servers suddenly became unavailable to the default cluster on the federated site.

I still don't know why that would occur, but had to work with tech support to remove the machine from the cluster, uninstall and reinstall arc server, rejoin the site, re-add our commercial cert, and re-add to the cluster.

After I did that, a second machine that participates in my hosting site also suddenly stopped and became unavailable.  I don't know if this is caused by an indexing issue between portal, the host / fed site and data stores, OR if the webgsdr -export routine doesn't recognize that there are 2 machines in my clusters

Just a follow up.   I was able to simply perform a re-join of my machine to the hosting site's default cluster.  No uninstall  / re-install or ArcServer component was necessary.  I informed Akshita in tech support of this.  The enterprise is stable.

Overall, I would say that in terms of DR, it is important to understand what actually has happened. Make sure your hosting and federated sites are stable before beginning to recover any items or objects.  If any items have a dependency at time of failure, the dependencies and the sources will need to be re-created and re-added to any webmaps and apps.  I spent the better part of a day chasing down feature layer sharing properties because my apps kept bouncing users to the portal sign-in page.

That being said, it is helpful that the when a hosting site is unexpectedly disconnected from a portal, the features do remain as pseudo-referen1teced features.  If that were not the case, tens and dozens of feature layers would instantly become inaccessible to their webmap, thus crashing apps. 

Thanks again Jonathan Quinn‌ for this posting.

Thanks for the update. 

For the views, yes, there's a bug regarding that. It's fixed at 10.7.

As for the other issues, were there any logs that described why it was unavailable? What was the behavior? The DR tool does preserve multi-machine sites, so that shouldn't have been a problem as long as the machines were healthy to begin with. It's not so much the source, or where you backup from, it's more so the target, or where you restore. If that has two or more machines, they are preserved during the restore. References to machines in the backup don't matter.

For the sharing problem, it sounds like sharing permissions on some layers weren't honored? Is that right? All of that is stored in the internal database, which is backed up and restored, so I'm not sure why you'd have to fix any sharing problems after a restore.

Sure and that's good news on the views, esp that I see 10.7 is available today.  I do not know how my hosting site became unavailable to the portal in the first place.  It just happened after I lost 4 collaborations.  Perhaps a Nessus port scan caused a communication break, or a loss of the F5 - we use F5 to direct traffic, but not for load balancing.

Were there any logs: Monitor simply indicated that data was not being collected for the servers when they became unavailable. 

What as the behavior: When the servers became unavailable to their respective clusters, the environment became very sluggish.  Yes the machines were preserved, but again I have no idea why a machine would 'stop' when it in fact did show server.exe and soc.exe processes running - but only for the system and utility services.  No map, image, or gp services were able to run for the server that 'left' the cluster.  Once I re-joined the machines to the site, I re-added the comodo certificate authority ca, and we were back.

Again, not sure on the sharing issues either, I had to perform a restore from a 2 week old backup.  It is possible I had changed something in the interim.

Version history
Last update:
‎03-08-2019 08:20 AM
Updated by:
Contributors