ArcGIS Enterprise HA: Tile cache failover

3531
9
03-22-2021 12:36 PM
NicolasGIS
Frequent Contributor

Hello,

Setting in an HA architecture, I wonder how HA for a tileCache datastore in standBy mode is supposed to work ? Checking the HA doc

it is stated for primary and standby mode:

```

In primary-standby mode, your tile cache data store contains two machines. Both machines contain the same scene cache data. Scene layers access the cache data on the primary machine. If the primary machine fails, the standby machine becomes the primary machine, and scene layers access the cache data on the new primary machine. This allows continuous access to scene layers while you, as the ArcGIS Data Store administrator, recover or replace the machine that failed.

```

It's not the case for me. After configured HA in primary-standby mode, if I switch off the VM hosting the primary tileCache datastore, then the standy does not pick-up:

Checking the status on ArcGIS Server admin interface after having the VM offline for 15 min, I receive the following error:

```

Could not connect to the ArcGIS component at URL 'https://PORTAL01.COMPANY.COM:2443/arcgis/datastoreadmin/machines/PORTAL02.COMPANY.COM/validate'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error: connect timed out

```

What I am doing wrong ? 

When trying the option to "makePrimary", I receive as well an error:

```

Server machine 'https://PORTALQA02.COMPANY.COM:2443/arcgis/datastoreadmin/machines/PORTAL01.COMPANY.COM/makePrimary' returned an error. 'Attempt to make 'PORTAL02.COMPANY.COM' the primary data store machine with role 'PRIMARY' is not allowed.'

```

So far, from my experience, I cannot takes offline the primary tileStore cache for maintenance purpose. Any idea ?

With the relational datastore, I found out that editing the property "failover_on_primary_stop" in C:\Program Files\ArcGIS\DataStore\framework\etc\datastore.properties solves this issue but it only works with relational datastore. 

Quid about tileCache ?

Thanks !

@JonathanQuinn maybe ?

 

0 Kudos
9 Replies
ChristopherPawlyszyn
Esri Contributor

I've found in practice that the Tile Cache Data Store behaves the same as the rules outlined for the Relational failover (if the failover_on_primary_stop property is not enabled). With that being the case, this should apply to the failover process:

"The only human-initiated situations that cause a failover are if the primary data store machine is deliberately taken offline, or the ArcGIS Server site administrator runs the makePrimary REST command on the standby machine."
From: https://enterprise.arcgis.com/en/portal/latest/administer/linux/add-standby-machine.htm#:~:text=Fail....

Prior to shutting-down the primary machine, you should be able to promote the standby to primary via the Admin API, then you can patch/restart the previously-primary machine without interruption of your hosted scene layers. If that is not working as described, then we'd need to take a closer look at your particular configuration to get a better idea of what part of the process is breaking-down.


-- Chris Pawlyszyn
0 Kudos
NicolasGIS
Frequent Contributor

Thanks for your reply @ChristopherPawlyszyn  !

Well, it's not what I am experiencing so far: tileCache does not behave the same as relational regarding failover. Even if "failover_on_primary_stop" property is set, only the relational datastore failovers, the tileCache does not. It's not very surprising because in the file "datastore.properties", the property "failover_on_primary_stop" is in the section "Settings for relational data store".

Here is a summary of what I am systematically observing (tested at least 2, 3 times):

- On a brand new deployment, I can promote both standy "tileCache" or "relational" as Primary. In that case, shutting off the VM hosting the standby datastores is not an issue: ArcGIS Server can still connect to them on the primary and dataStore validation from ArcGIS Server works. 

=> Perfect when the intervention is planned

- But if the primary "tileCache" datastore is taken offline, the standy is not promoted and validating the datastore from the ArcGIS Server manager fails. It does work with relational datastore.

And for the tileCache, then it's over nothing can be done:

=> Trying to validate it, returns the following error:

```

Could not connect to the ArcGIS component at URL 'https://PORTAL01.COMPANY.COM:2443/arcgis/datastoreadmin/machines/PORTAL02.COMPANY.COM/validate'. The ArcGIS component on that machine may not be running or the machine may not be reachable at this time.Error: connect timed out

```

Trying to promote it as Primary fails  with the same error as well as it tries to communicate with the primary.

So what do we do if we have a serious issue with one node which at that time was hosting primary "tileCache" datastore and that it cannot be brought back online ?

Regarding the configuration, it is basic HA composed of 2 nodes:

Portal01: Windows Server 2019 + Portal for ArcGIS 10.8.1 + ArcGIS Server 10.8.1 + Datastore 10.8.1 (relational + tileCache)

Portal02: idem

 

Thanks !

0 Kudos
ChristopherPawlyszyn
Esri Contributor

Apologies, I wasn't clear in that statement. The events that initiate a failover with the tile cache data store match the events for the relational data store. While there is an option to failover on a graceful shutdown for the relational data store, the equivalent for the tile cache data store is not available. A graceful shutdown in this circumstance would include stopping the ArcGIS Data Store service, shutting down the machine, or restarting the machine.

The manually-initiated failover of the tile cache data store would have to occur by abruptly disconnecting the network connection on the machine that is currently the primary node, or by running the 'makePrimary' operation on the standby machine in the Server Admin endpoint while the primary is still accessible. Since you have additional components on the same machine, this is less simple to simulate as you would also cause a failover of Portal's roles if the primary nodes are the same for those two components. ArcGIS Server should run at half capacity without issue, just a reduction in the overall instances available to service requests.

 

This excerpt was what I was referring to for the similarities between the relational and tile cache data stores during failover events:

 

The following is a list of situations for which the standby machine becomes the primary for your relational data store. Note that the following three situations involve hardware or software failures.

  • The primary data store stops working. ArcGIS Data Store attempts to restart the data store on the primary machine. If it cannot restart, the data store fails over to the standby.
  • The primary's web app stops running and attempts to restart the web app on the primary machine. In the rare case that this does not work, the data store fails over to the standby machine.
  • The primary machine is unavailable. This can happen if the computer crashes, is unplugged, or loses network connectivity. ArcGIS Data Store makes five attempts to connect to the primary machine. If a connection is not possible after five attempts, the data store fails over to the standby machine.

From https://enterprise.arcgis.com/en/portal/latest/administer/linux/add-standby-machine.htm#ESRI_SECTION...


-- Chris Pawlyszyn
0 Kudos
NicolasGIS
Frequent Contributor

Thanks for your reply @ChristopherPawlyszyn 

I just destroyed (not shutdown) the "master VM" having Portal as master and having the 2 datastores (relational and tileCache) as primary.

Portal failover, relational datastore failover (thanks to the failover_on_primary_stop property) , and not the tileCache datastore.

Isn't it a case where tileCache should failovers ?  

Thanks

ChristopherPawlyszyn
Esri Contributor

Not sure what cloud platform you were testing in, but I was able to reproduce the same results in AWS. It looks like the termination of the instance falls under the graceful shutdown conditions since an ACPI shutdown event is triggered.

Terminate your instance - Amazon Elastic Compute Cloud
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html#what-happens-terminat...


-- Chris Pawlyszyn
0 Kudos
NicolasGIS
Frequent Contributor

Many thanks @ChristopherPawlyszyn  for your reply and your test ! Much appreciated.

My VMs are hosted in a private cloud running on Openstack. Might have the same behavior as AWS.

I was afraid that there was something wrong with my setup but from what I understand all the observed behaviors are expected ?

Personnaly, I find it a pity that the tileCache does not failover just like the relational dataStore but it's just a feedback !

Thanks !

ChristopherPawlyszyn
Esri Contributor

My suspicion is that OpenStack operates in a similar fashion, so would issue the shutdown command to the VM upon termination and not initiate the failover process. As I alluded to earlier in the thread, I do believe your tile cache data store is behaving as-expected at this point and the failover conditions haven't been met. One way to initiate the failover may be to remove/detach/disable the network interface from the VM in OpenStack prior to terminating it, but I'm not sure the level of effort to accomplish that on your hypervisor interface.

I went ahead and logged an enhancement request internally for adding that functionality, so that the development team is aware of the topic going forward. I agree that this would be a helpful step towards allowing testing of failover between the machines in a high availability tile cache configuration.

Another location where you can log the request is within the ArcGIS Enterprise Ideas board, since members of the product team review those ideas and determine relative interest.

ArcGIS Enterprise Ideas - Esri Community
https://community.esri.com/t5/arcgis-enterprise-ideas/idb-p/arcgis-enterprise-ideas

If you open a support case, the owning analyst should be able to attach the case to the existing enhancement. Feel free to reach out directly if you have any trouble with that process.


-- Chris Pawlyszyn
BorjaParesFuente
Occasional Contributor

Hi,

I have experienced this same beheaviour in an ArcGIS Enterprise 10.7.1 HA environment. I couldn't found any solution.

Regards

Borja

0 Kudos
Oiligriv
Frequent Contributor

Hi, the same problem occurs on 11.1 too.
I expect that when the network cable is disconnected, the TileCache Datastore will be promoted to primary (but it doesn't)
I can instead say that the Scena services continue to work correctly

I managed to capture this from the Log

<Msg time="2024-10-08T16:33:32,176" type="WARNING" code="110130" source="Data Store" process="6192" thread="1" methodName="" machine="RPU-SEI-DB-W004.SEIAESRI.LOCAL" user="" elapsed="" requestID="">ArcGIS Data Store has detected an issue with 'db'.</Msg>
<Msg time="2024-10-08T16:33:32,176" type="SEVERE" code="110131" source="Data Store" process="6192" thread="1" methodName="" machine="RPU-SEI-DB-W004.SEIAESRI.LOCAL" user="" elapsed="" requestID="">ArcGIS Data Store encountered too many problems. Failover may be invoked if standby is configured.</Msg> <Msg time="2024-10-08T16:33:32,176" type="SEVERE" code="110453" source="Data Store" process="6192" thread="1" methodName="" machine="RPU-SEI-DB-W004.SEIAESRI.LOCAL" user="" elapsed="" requestID="">Data store machine ' RPU-SEI-DB-W004.SEIAESRI.LOCAL' has failed.</Msg>

 

Thank you

Virgilio

0 Kudos