Select to view content in your preferred language

ETL Pattern: Avoid Serializing Bulk I/O

1436
2
09-24-2024 11:53 AM
BruceHarold
Esri Frequent Contributor
5 2 1,436

If you're synchronizing an external data source into a hosted feature layer in ArcGIS Online or ArcGIS Enterprise, then writing each periodic changeset is something you want to be as efficient as possible.  Once your datasets get into the hundreds of thousands or millions of features, calculating the optimal insert, update and delete transactions becomes very attractive, minimizing downtime and transaction failure risk.

The problem is, deriving edit transactions requires reading the target hosted layer, and this is an expensive operation.  This blog shows an approach that avoids querying your target feature layer in favor of using a file geodatabase export of the data, automatically downloaded locally, then read performantly to find the changeset, without impact in extra item storage.

Here is the approach in action:

BulkUpsert2BulkUpsert2Avoiding feature layer queryIn an earlier post I showed an example of deriving upsert and delete transactions by reading a target feature layer, which is a serialized approach. My target layer has over 1M features - reading it takes several minutes.  The above does better, against the same service.

The post download has a toolbox with the Pro 3.5 ArcGIS Data Interoperability ETL tool in it - see BulkUpsert2.  The tool reads a CSV file at a URL (also avoiding serialization, which is the default for the server), but also accesses portal content to trigger a file geodatabase export, wait in a loop until the export is finished, then downloads and reads the target feature layer content, not the service, and finally deleting the export item (the downloaded file geodatabase is also automatically deleted).  This is core functionality.  There are two mint green custom transformers in the tool.  An EsriOnlineTokenGetter retrieves a token needed for a later HTTP call (if you are working with an Enterprise portal you would use an EsriPortalTokenGetter).

Spoiler
Pro Tip!  The blog download now contains the tool BulkUpsert2WebConnection which avoids retrieving a token by using an OAuth2-based web connection.   The web connection is supported by the Esri ArcGIS Connector package, which supplies the Esri ArcGIS Feature Service format used in the workspace.  The web connection is then used by an HTTPCaller transformer, eliminating the need for a token .  If your organization enforces multi-factor authentication then retrieving a token is blocked and the custom transformers will not work.  It is now recommended practice to install the package, create a new web connection and use it for feature service read/write and ArcGIS REST API http calls.

The other custom transformer is a looping transformer you can inspect in its tab labeled FeatureServiceExportLooper.  It checks the export job status every 2 seconds until the job is complete.  Note that how long an export job takes is dependent on how big your service is and also how busy the host portal is - I have seen ArcGIS Online queue jobs as well as run them immediately.  Here is a run taking a little over a minute (just for the export) at a busy time in ArcGIS Online:

FeatureServiceExportLooperFeatureServiceExportLooperFeatureServiceExportLooperThe net result however is a significant net gain.  Here is a screen capture from the earlier blog and the serialized approach - note the session duration.

Reading the target feature layer takes longerReading the target feature layer takes longerReading the target feature layer takes longerSo that's how to avoid reading hosted feature layers with a serialized approach.  Now you have an option to time consuming changeset construction!

2 Comments
Contributors