Select to view content in your preferred language

TileCache datastore very slow to restore

2824
21
11-29-2021 10:37 PM
NicolasGIS
Frequent Contributor

Hello,

I am testing webgisdr restore on a standby environment and found out the whole process is now taking up to 18h because of 15 hours spent restoring the tileCache datastore.

My webgisdr size is  45 Go divided like the following:

  • tileCache datastore 25 Go
  • relational datastore 11 Go
  • portal for ArcGIS 8 Go
  • AGS: 1 Go

Capture d’écran 2021-11-30 à 06.29.53.png

Isn't it too much ? Could there be an issue in my tileCache datastore. I tried making a backup, uninstalling it, restoring it but it still takes ages.

How long does it take in your side ?

And feedback/ROE would be appreciated !

Thanks

0 Kudos
21 Replies
JasonHansel1
Emerging Contributor

David,

Hey how are you? We have AV disabled and Firewall(s) turned off. 

0 Kudos
DavidHoy
Esri Contributor

Sorry Jason - that's all I have.
I would suggest take a look in the Couchdb forums - there may be some guru out there with special insight about backup

0 Kudos
NicolasGIS
Frequent Contributor

My tilecache is now taking more than 24hours to restore and the job is failing because of a job timeout:

2022-09-16 11:12:33 ERROR [pool-1-thread-2] com.esri.arcgis.webgis.component.service.impl.DataStoreDRService - {"jobId":"653c93cb-e0bd-475e-8120-87eae7f61f3a","description":"Deploy data store snapshot 20220910-180921-CEST-86-FULL from \\\\PATHTO\\backup\\temp\\WebGISSite1663222328934\\dataStore\\a3729eb8-a53e-4ac5-88a4-d90a56801b3b","lastModified":"2022-09-15 11:12","status":"scheduled"}
2022-09-16 11:12:33 INFO [main] com.esri.arcgis.webgis.util.WebGISUtil - The restore of ArcGIS Data Store has taken 24hr:00min:04sec.
2022-09-16 11:12:33 ERROR [main] com.esri.arcgis.webgis.service.impl.WebGISDRDispatcher - Exception: Failed to restore the ArcGIS Data Store.
2022-09-16 11:13:05 ERROR [main] com.esri.arcgis.webgis.client.WebGISDR - Failed to restore the ArcGIS Data Store.

Any idea how could I increase this threshold ? 

I am under the impression that it's not a classic ArcGIS Server geoprocessing job (I could not find any jobs related to that in arcgisserverjobs directories) so don't know where to look.

I don't think it is related to the "TOKEN_EXPIRATION_MINUTES" property of the webgisdr as I did not modified it and so it must be configured to the default 60 minutes which obviously is not the case so I don't know what is this property used for... Only for ArcGIS Server sites ?

https://enterprise.arcgis.com/en/portal/latest/administer/windows/create-web-gis-backup.htm

 

@JonathanQuinn  or @ChristopherPawlyszyn  maybe ?

 

Thanks !

0 Kudos
ChristopherPawlyszyn
Esri Contributor

What version are you currently testing with? The 24 hour timeout is hard-coded, so the best approach would be to try to cut down on latency between the components and the file share(s) as well as increasing the performance of the associated drives from an I/O perspective (if possible).

 

The first thing I'd consider is the proximity of the shared location to the ArcGIS Enterprise components, as the restore will be pulling those files for the individual component restores.

 

Are you able to backup only the tile cache manually, move the backup to a local drive on the standby machine, then restore using the Data Store utilities? This would give us a good comparison of the network location versus local disk.


-- Chris Pawlyszyn
0 Kudos
NicolasGIS
Frequent Contributor

It's done on 10.9.1 fully patched (including "Durability Enhancement Patch" that is supposed to improve tileCache performance...)

It's really bad news this value is hard-coded ! Please consider externalizing it as we are at the very beginning of our use of the tileCache with the BIM becoming more and more a thing. It's scary to think we are already at the limit !

Will do some more testing based on your feedback... for a change...

0 Kudos
ChristopherPawlyszyn
Esri Contributor

Doing some benchmarking on my end as well and will post the results, wanted to clarify that this is during a full mode restore and not utilizing incremental mode.


-- Chris Pawlyszyn
0 Kudos
NicolasGIS
Frequent Contributor

Thanks ! In the meantime, I am currently trying to restore on a drive with higher IO.

Yes, the issue is occuring while trying to restore a webgisdr created with "BACKUP_RESTORE_MODE = full"

0 Kudos
NicolasGIS
Frequent Contributor

Good news @ChristopherPawlyszyn !

I was able to restore the tileCache datastore by "upgrading" the drive hosting it.

On the drive on which it took more than 24 hours it has as characteristics: 80 MB/s and 100 IO Operations (both read and write)

I tried on a "better" drive with: 300MB/S and a rate of 5 IO operation per gigabyte with a guaranteed minimum of 500 IO operations and a maximum of 2000 IO operations (both, read and write)

and it restored successfully after 7 hours !

 

ChristopherPawlyszyn
Esri Contributor

Glad that shortened the restore time @NicolasGIS, although I understand the increased performance comes at an higher cost operationally.

 

In regards to my testing, I published a 35GB SLPK to an 11.0 stack and tested backup and restore to a separate site. The backup took a bit over an hour for the tile cache data store, while the restore took about the same (sending/staging to a network SMB share on a separate Windows machine). There's certainly room for improvement, I believe you're already attached to the existing performance defect logged in Salesforce (BUG-000139154) but I will discuss with the team on prioritization going forward.


-- Chris Pawlyszyn
0 Kudos
NicolasGIS
Frequent Contributor

Hi @ChristopherPawlyszyn ,

Just a small follow-up, on the same infrastructure, after having migrated from 11.0 to 11.1, tilecache datastore restore went from 7 hours to 1 hour !

Very nice enhancement 🙂