ETL Patterns Blog

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Other Boards in This Place


Latest Activity

(21 Posts)
BruceHarold
Esri Frequent Contributor

In a previous blog I explored the performance of no-code change data capture versus view source swap in a speed test when the goal is maintaining a hosted feature layer - and declared it a tie!  Now I have a new entrant from the coded solutions world - using an ArcGIS Online hosted notebook to calculate and apply the delta transaction for a hosted feature layer where the source data changes daily.

My subject matter data is the same as the previous blog, street address points for the city of Los Angeles, updated daily.

Los Angeles Address PointsLos Angeles Address Points

There are a little over a million points in the dataset, with a few hundred changes daily: inserts, updates and deletes.  I've been keen to write about an upsert use case (the combination of insert and update in a single transaction) for a while, as it is now supported with the Append geoprocessing tool in Pro and hence in ArcPy in the notebook advanced runtime.

I'm getting ahead of myself, so let's first set the scene.  To consider a coded ETL workflow you need to be confident your data is well managed, with little or no need to make fixes or apply transformations during the ETL process - because while you can view data in a notebook it is very difficult to do deep data inspection and discover data problems.  If you can't trust your data then you should be using ArcGIS Data Pipelines or ArcGIS Data Interoperability.

In this case the city is delivering well-curated data so I'm happy to recommend a coded lift-and-shift process.

Back to the tools used.  In the blog download you'll find an ArcGIS Pro 3.6 toolbox with a model.  The model creates a file geodatabase feature class named  Addresses using CSV data it downloads from the city Open Data site.  The feature class has a tuned schema (see the field map control in the Export Features tool) and also a primary key field House_Number_ID with a not-null constraint and an index, requirements for using upsert transactions.

Model Making Addresses DataModel Making Addresses Data

I published my feature service from ArcGIS Pro after applying some symbology and popup behavior and made sure that House_Number_ID has a unique index in the feature service.  Now for the notebook that maintains the service.

I'll let you step through the notebook code (in the blog download), but the processing steps are:

  • Use DuckDB to read the source CSV file from the open data site download URL
  • While reading, enforce a schema and make geometry in a memory relation (table)
  • Using ArcPy, create a memory feature class from data in the DuckDB relation
  • Merge existing and incoming address data into another memory feature class
    • This contains both old and new records
  • Use Find Identical to find identical sequences in the merged data across geometry and all fields
    • The data is in Web Mercator so a tolerance of 1m allows for coordinate precision differences
  • Run Frequency to support finding rows that are unique
    • Edit features do not have identical matches
  • Use set mathematics to determine the upsert and delete records
  • Run Append and Delete Rows functions with the correct records

If you inspect the notebook cell messages you'll see the whole job took 9 1/2 minutes (quite big data is read into the notebook and geoprocessed) but the actual write commits were small and took only a few seconds - pretty good in my book.

After setting up scheduled processing weekday mornings I now have a continuously maintained information product!

Please do comment in this post with any observations or questions.

more
1 0 287
BruceHarold
Esri Frequent Contributor

We'll look at three cases:

  • Infrequent bulk replacement
  • Frequent append and upsert
  • Regular insert, update and delete editing
Read more...

more
3 0 268
BruceHarold
Esri Frequent Contributor

Closing the loop by sharing the content

Read more...

more
2 0 597
BruceHarold
Esri Frequent Contributor

Make data on demand!

Read more...

more
3 0 530
BruceHarold
Esri Frequent Contributor

It's alpha release so stay tuned for updates!

Read more...

more
2 0 918
BruceHarold
Esri Frequent Contributor

OK I struggled with the title, there is too much goodness in it, just read it!

Read more...

more
3 6 2,435
BruceHarold
Esri Frequent Contributor

With the release of ArcGIS Pro 3.5, the stars align a little more when it comes to the use of GeoParquet.  You can now work with local GeoParquet files for your mapping and analysis needs, but it is also much easier to ingest big GeoParquet data from an S3-API-compliant object store!

This post is about how simple it is to bring remote GeoParquet data into your project.

The enabling technology is DuckDB, now included in the default Python environment in ArcGIS Pro 3.5 - no more package management just for this spectacularly useful client technology.

Here is an example, the entire Overture Maps Foundation divisions dataset accessed from their AWS S3 object store and written to my project home geodatabase.

Overture DivisionsOverture Divisions

Automation is key to GIS happiness, so to access this data I created a simple notebook which you can find in the post download.  You'll need ArcGIS Pro 3.5 to run it, or an earlier release with your Python environment extended with DuckDB 1.1+.

It takes me about 6 minutes to download the 1m+ features to my project home geodatabase, but a big chunk of that is taken up in a couple of best-practice steps, namely sorting the features on area (descending) and repairing any geometry issues.  The sort step is so small features display on top of large features, the geometry repair is commonly needed for point-rich data that "tiles the plain' like these divisions do.

The lift and shift itself is fast.

I'll let you inspect the notebook for yourselves, but note the option to apply an attribute or spatial filter on the features you download, for example within a bounding box in lat/long or the name of a country.  Instead of manually download a set of very large parquet files from S3 you now have a simple tool to go get what you want, any time you like!

more
5 3 3,454
BruceHarold
Esri Frequent Contributor

Now your data can get younger!

Read more...

more
3 0 1,387
ShareUser
Esri Community Manager

Automation has arrived!

Read more...

more
0 4 850
BruceHarold
Esri Frequent Contributor

It's easy!

Read more...

more
1 0 944
114 Subscribers