Proper shutdown/restart order for multi-machine HA ArcGIS Enterprise deployment?

JamesGough · ‎08-16-2022

We have a multi-machine highly-available ArcGIS Enterprise deployment, deployed on premises. When our infrastructure team does server patching and needs to restart all of the machines, what is the proper shutdown/restart order to avoid issues?

For example, last night patching occurred and this morning the portal site was not working. I had to restart both portal nodes in order to get it back up. From looking at the logs it looks like the portal servers were restarted before the file server (see table below), so then the portal threw some severe error while the file server was down:

Cannot write to directory path '\\<fileserver machine>\esri\ArcGIS_Portal\arcgisportal'. Please check that the location is valid and that the Portal service account has permissions to the location.

HA: Error in HA plugin. Cannot read from directory path '\\<fileserver machine>\esri\ArcGIS_Portal\arcgisportal'. Please check that the location is valid and that the Portal service account has permissions to the location.

It seems like the portal never recovered from this error until i restarted them.

Our architecture basically look like this (All components are on Windows Server 2016 machines running ArcGIS 10.8.1):

Machines	Role
portal-01 portal-02	Portal
hosting-01 hosting-02	ArcGIS Server - Hosting Site
federated-01 federated-02	ArcGIS Server - federated
image-01	ArcGIS Server - federated (image server)
raster-01	ArcGIS Server - federated (raster)
datastore-01 datastore-02	ArcGIS Datastore (Relational)
fileserv-01	Windows File Share (used for all shared directories for portal and server sites)
geoanalysis-01 geoanalysis-02 geoanalysis-03	ArcGIS Server - federated (Geoanalytics server)
bigdata-01 bigdata-02 bigdata-03	ArcGIS Datastore (Spatiotemporal big data store)
geoevent-01	ArcGIS Server - federated (Geoevent)

ChristopherPawlyszyn · ‎08-16-2022

Preferably, the dependent machines would be shutdown while restarting the file server (or at least the services stopped for those components), then brought back up after the file server becomes available on the network again.

More practically speaking, and as you saw in the restart of Portal for ArcGIS on both machines, restarting following the file server should bring things back to a working state.

Another point I would consider is if your shared/backup locations for the ArcGIS Data Store configuration or WebGISDR backups are located on the same file server you wouldn't want to restart it during the automatic/scheduled backup time(s).

-- Chris Pawlyszyn

View solution in original post

ChristopherPawlyszyn · ‎08-17-2022

I couldn't say there is zero risk associated with it, hence the preferred intro above, but the error reported by Portal is specific to the HA plugin attempting to start the machines in the correct order according to the files in the ../content/items/portal-ha directory. If anyone were publishing to the site or creating content when the file server went down there is a high risk to those items/services being corrupted, but less likely that the entire site would be affected.

Since all components expect that their shared directories are always available while running, clean-up from corrupted items/services is a variable process that may require manual intervention that isn't documented.

-- Chris Pawlyszyn

View solution in original post

ChristopherPawlyszyn · ‎08-17-2022

Yes, that order makes perfect sense; in reality steps 1, 2, and 3 in shutdown can be combined as well as 3 and 4 in startup. Otherwise hosted services may take a bit to regain access to the relational data store.

In terms of Portal for ArcGIS, you want to make sure you shutdown the standby to avoid failover (if doing things manually) or stagger the restart times between primary and standby if scheduling automatically. This will keep things from becoming split-brained as the machines start-up; primary will be recorded in the HA configuration file so start-up should be consistent when it comes back online.

-- Chris Pawlyszyn

View solution in original post

George_Thompson · ‎08-16-2022

Not an expert on this but here is a link to some documentation on best practices: https://enterprise.arcgis.com/en/portal/10.8/administer/windows/apply-patches-and-updates-to-highly-...

--- George T.

JamesGough · ‎08-16-2022

Thanks, that documentation is helpful but it only addresses restarting the individual components, not the whole environment. I think the main concern is the file share where all of the shared directories are. My intuition is that the file share server needs to be restarted first, followed by all of the other components.

But I'm unsure what the whole procedure should be. Do all of the components need to be shutdown, and then only started after the file server has been fully restarted? Or can the components be left running, and then just restarted after the file server is restarted?

George_Thompson · ‎08-16-2022

Ah ok. I am not sure that that one. Hopefully someone could chime in on that topic.

--- George T.

JamesGough · ‎08-16-2022

Thanks anyway. Maybe @JonathanQuinn has some recommendations.

ChristopherPawlyszyn · ‎08-16-2022

Preferably, the dependent machines would be shutdown while restarting the file server (or at least the services stopped for those components), then brought back up after the file server becomes available on the network again.

More practically speaking, and as you saw in the restart of Portal for ArcGIS on both machines, restarting following the file server should bring things back to a working state.

Another point I would consider is if your shared/backup locations for the ArcGIS Data Store configuration or WebGISDR backups are located on the same file server you wouldn't want to restart it during the automatic/scheduled backup time(s).

-- Chris Pawlyszyn

JamesGough · ‎08-16-2022

I suspected that shutting down the machines while restarting the file server would be the safest option. is there any risk to leaving the other machine up while restarting the file server? Could the portal or server sites become corrupted?

Also, good point about the datastore automatic backups, i had not considered that!

ChristopherPawlyszyn · ‎08-17-2022

I couldn't say there is zero risk associated with it, hence the preferred intro above, but the error reported by Portal is specific to the HA plugin attempting to start the machines in the correct order according to the files in the ../content/items/portal-ha directory. If anyone were publishing to the site or creating content when the file server went down there is a high risk to those items/services being corrupted, but less likely that the entire site would be affected.

Since all components expect that their shared directories are always available while running, clean-up from corrupted items/services is a variable process that may require manual intervention that isn't documented.

-- Chris Pawlyszyn

JamesGough · ‎08-17-2022

Makes sense. So would the ideal order be something like this?

Shutdown Order

portal
servers
data store
file server

Start Up Order

File Server
data store
servers
portal

Also, does it matter which of the portal nodes come up as the primary machine?

ChristopherPawlyszyn · ‎08-17-2022

Yes, that order makes perfect sense; in reality steps 1, 2, and 3 in shutdown can be combined as well as 3 and 4 in startup. Otherwise hosted services may take a bit to regain access to the relational data store.

In terms of Portal for ArcGIS, you want to make sure you shutdown the standby to avoid failover (if doing things manually) or stagger the restart times between primary and standby if scheduling automatically. This will keep things from becoming split-brained as the machines start-up; primary will be recorded in the HA configuration file so start-up should be consistent when it comes back online.

-- Chris Pawlyszyn