Distributed Collaboration Support ETL Data Services

DavidRunneals2

Background: We tried to use distributed collaborations to keep our public facing ArcGIS Online services updated from our services on portal that were published from our data warehouse (which would get updated on a schedule). Apparently distributed collaborations DO NOT support services where the service has data that is completely truncated and reloaded (ie when moving data from a transactional database to a publication/data warehouse) because distributed collaborations use global IDs to compare changes and if you do truncate and reload an entire dataset, it apparently is double the work (it first deletes all the existing records since they no longer exist and then uploads the new records). Esri's workaround of comparing the transactional dataset to the publication dataset to do updates/deletes before inserts would be a major waste of processing time and slow down critical databases and processes running the ETL.

Idea: Distributed Collaborations needs to have an option to that bypasses the delta comparisons and simply drop/truncate the table in the organizations that it's being shared to and then append the exported replica. This would support many workflows and make life so much more simplistic for those of us who want to use distributed collaboration at an enterprise level to push data to ArcGIS Online.

Kelly Gerrow‌ Paul Barker‌

TaupoDistrict_Council · ‎08-10-2020

Hi David Runneals‌, thanks for your post. We have a truncate/append process that updates data that we are hoping to share with a Collaboration. Could you clarify what you mean by 'DOES NOT support?' - does the process break completely or just take a long time to sync following a truncate/append operation on the data? Thanks in advance.

DavidRunneals2 · ‎08-11-2020

Per a discussion with the ArcGIS Online Team (Kelly Gerrow‌), they brought up that global IDs should be preserved where possible. In some instances (such as ArcGIS Roads and Highways and legacy applications with other systems/databases), that is not possible to do. I unfortunately can't find where I saw the best practices (help me out Kelly).

We were experiencing both of what you mentioned. Some of our ETL datasets would just take a long time (and there's apparently an undocumented 2 hour timeout) while other datasets would exceed the 2 hour timeout and not sync at all. If you have a relatively small dataset, I think doing a truncate reload wouldn't be bad, but for larger ones it definitely doesn't work. We did notice that it seemed to work for a little while, but then it would fail to run, so I encourage you to do testing for at least a month or 2 before pushing your collaborations to production, so you don't have to spend a month or more undoing them like we have been doing.

One other thing worth trying is splitting out your larger datasets into their own groups/collaboration workspaces so they are synced by themselves instead of with multiple datasets. We tried this and I don't think it worked very well, but you might have better success.

A couple of other thoughts that came to mind... One of the esri guys on our call suggested using database hashes and comparing those to see if you really need to do a truncate/reload (Comparing Data of Two Oracle Tables Using MD5 Hash ). I also just came across this as well: How to copy or load data and preserve GlobalID values