Select to view content in your preferred language

ETL Pattern: Upsert & Delete - for your Esri data maintenance

819
4
09-06-2024 12:37 PM
BruceHarold
Esri Regular Contributor
2 4 819

Today's featured transaction is upsert, ably assisted by delete.  If you are maintaining a geodatabase or hosted feature layer (ArcGIS Online or ArcGIS Enterprise), unless you're doing something drastic like a truncate, you can package all your ETL edits up into these two methods using ArcGIS Data Interoperability (but also see core options below).  Inserts and updates travel together as upserts, while deletes speak for themselves.

For example use case data, I'm using the City of Los Angeles Open Data Address Points.  If you're keen to take a look at the raw material, this link will trigger a CSV file download (161MB+, 1M+ rows).

Los Angeles address pointsLos Angeles address points

The map shows the result of ETL into a hosted feature layer, which is my target information product.  It could also be any type of geodatabase.  The raw data is typical of many ETL sources, with these properties:

 

  1. The data is not immediately accessible to ArcGIS
  2. The data schema is a little cryptic
  3. The data changes frequently, but only a small fraction of a big dataset
    • Edits may be inserts, updates or deletes
    • No metadata fields exist to track edits
  4. The data has a persistent primary key field

#1 & #2 above are basic ETL challenges which are easily solved. #3 suggests that upserts and deletes are candidate methods, while #4 is what makes upserts possible.  Using a match key is the secret sauce for this post, but like any good cooking show the recipe comes after a look at the result!

Upsert & Delete ETL ToolUpsert & Delete ETL Tool

This single ETL tool supports two stages in the lifecycle of my target information product:

  • Creation of the feature service
  • Applying updates to the feature service on demand

To create the feature service, the greyed-out writer labeled Creation would be enabled and the rest of the workspace would not exist.  After service creation I disabled this writer and added the rest of the workspace.  The workspace still reads the source CSV file at a public URL, but also reads the target feature service (which of course must already have been created), calculates the upsert and  delete transactions, and applies them.  It's a simple and powerful ETL tool but there is critical enabling work needed outside the ETL tool.

The REST API Append method that supports upsert operations requires the match key field (House_Number_ID in my case):

  • Have a unique index
  • Not allow null values

Meeting these conditions requires two simple processing steps outside the ETL tool before upsert processing will work.  In my ETL tool, there is no way to set these properties, and  there are two relevant factors:  The Alter Field geoprocessing tool does not support changing the allow nulls property for a feature layer field, but the Add Attribute Index geoprocessing tool does allow creating a unique index on a feature layer field.

To work around the allow nulls problem I exported the initial feature layer to my project default geodatabase using the Export Features tool and in the field map control unset the allow nulls property for the House_Number_ID field.

Export Features to GeodatabaseExport Features to Geodatabase

With the output feature class in Pro, I then created a map layer, taking the opportunity to apply symbology other than the pink lemonade default, then overwrote my target feature layer.

With the overwritten target layer, I then added a unique index for my match key field:

Add Unique IndexAdd Unique Index

Now the target layer is in shape for configuring upsert (and delete) processing!

The ChangeDetector transformer is what generates the upsert and delete change sets.  Here are the settings:

Change Detector ParametersChange Detector Parameters

Using match attributes:

Detection FieldsDetection Fields

Change detection comes at the expense of reading the target features in addition to the source features but does deliver an optimally small edit payload.  For my subject matter data the whole job takes 5 minutes.

To reiterate, while I'm showing a hosted feature layer as my target information product, upsert write capability is also available for geodatabase features via ArcGIS Data Interoperability.

For completeness, I'll give a shout out to two other places upsert capability is offered in ArcGIS, namely the Append geoprocessing tool and  Data Pipelines Add and Update output option. However, if you need to delete features as well as upsert them, then in core geoprocessing you'll need to use Delete Features or Delete Rows or in Data Pipelines replace the output feature layer. This pipeline does that job for the feature layer my ETL tool operates on.

Data Pipeline Replacing Los Angeles AddressesData Pipeline Replacing Los Angeles Addresses

I'll take the opportunity here to call out that using Data Pipelines to maintain hosted feature layers in ArcGIS Online that were created by separate ETL processes is perfectly valid.  In my pipeline you'll see the Map Fields tool that greatly assists connecting the schema coming from a CSV file with how I defined the schema using Data Interoperability.  

So there it is, upsert is there for you, ready to go!

The blog download has my Spatial ETL tool plus associated toolbox with the models.

4 Comments
ShareUser
Esri Community Manager

What products are required for this?

BruceHarold
Esri Regular Contributor

Thanks for the question, I edited the article to hopefully make it clearer where upsert capability is supported in ArcGIS.  My worked example uses ArcGIS Data Interoperability, which can work against hundreds of data sources, but see also core geoprocessing supports upsert in the Append geoprocessing tool and ArcGIS Data Pipelines also has an upsert capability.

BruceHarold
Esri Regular Contributor

Here is a run with some sample feature counts.

Upsert with feature countsUpsert with feature counts

BruceHarold
Esri Regular Contributor

Hello everyone.  It is worth noting that upsert writes are asynchronous, which for small jobs (as can be created by change detection) has some overhead, so my example scenario isn't ideal.  In production you might consider upserts for cases where there are larger update and insert transactions, and use synchronous insert, update and delete write modes for smaller jobs.