Select to view content in your preferred language

Share Versioned Cloud Native Data With Optional Time-Travel

125
0
02-20-2025 09:53 AM
ShareUser
Esri Community Manager
0 0 125

It is always satisfying to share powerful new ways to solve problems, especially so when the solution has been "hiding in plain sight" for a while.  This time I'm showing how the combination of ArcGIS Enterprise branch versioning and cloud native data sharing delivers not only fast data access, to people without portal access, but the ability to ask the accessed data to travel back in time to when it was younger.  Like these parcels, see a previously undivided parcel and now its three subdivisions.

Parcel subdivisionParcel subdivisionParcel subdivision

Picture a dataset with millions of features and under heavy daily maintenance, like branch versioning is built to handle, your customers can access all or any part of the default version for any moment in time.  Forever.  Without extra load on your Enterprise portal.

So, how did I get there?  I simply noticed that the insert-only transaction model of branch versioning is a fit for incrementally creating GeoParquet files in cloud storage that jointly preserve data state over time and can be queried spatially and temporally to make local data on demand for your area and time of interest.

It is however a very fancy query!  The good news though is you don't have to figure it out, the blog download has a notebook with examples for my parcel subject matter, just plug yours in.

I didn't have to invent the query approach, Esri publishes workshop materials on the topic.  For example, if you go to around minute 18 in this presentation you'll see what such a query looks like.

I did have to make GeoParquet files that I can query, and a maintenance workflow for initial and incremental parquet file creation .  It all starts with the source branch versioned Enterprise geodatabase feature class.  Normally you can't see the system fields that power branch versioning of a feature class, but if you add the archive class to the map they are available:

Archive class added to the mapArchive class added to the mapArchive class added to the map

A couple of things to note in the fields map:  ObjectID is demoted to an ordinary long integer (values are not unique any more) and various fields named GDB_* are added.  They power viewing the data at a moment in time, which is how branch versioning works - the latest state for a feature wins, which may be a deleted state, but the data history isn't lost (unless you drop it), which makes time travel possible.

The archive class is also good for discovery of what edit moments are in your data.

With the archive class providing visibility to all fields, the sharing and maintenance workflow was possible.  It goes like this:

  • Create an initial parquet file with all archive class rows where GDB_BRANCH_ID = 0
  • On any schedule that makes sense, create delta parquet files for new default branch row states
    • These have a GDB_FROM_DATE later than the maximum in all existing parquet files
    • They also have GDB_BRANCH_ID = 0
  • Maintain all parquet files in your favorite S3-compliant object store at a glob path
  • Give your data customers a notebook or script tool they can use to extract data
    • The supplied notebook requires DuckDB version 1.0.0 in the Python environment

Now, I'm advertising this as cloud native data distribution, but at writing I'm still setting up my AWS account so the attached notebook is using a local filesystem path, I'll update that when I have a public S3 URL path available.  In the meantime you can download sample data for testing here, here, here and here. They are the initial bulk version copy and a few incremental delta files, with a few days edits each.  Change the notebook pqPath variable to suit your environment until I get the S3 path in place.

Spoiler
The data I'm using isn't really being maintained in a branch versioned geodatabase, I made sample data, kindly see data permissions in the item details for the links above.

You'll see in the notebook I supply a template for extent and time travel queries.  I find I can extract all 2.7 million parcels in my data in a little over 3 minutes, from local disk.  Access from S3 I would expect to be a little slower, we'll see when I have that set up.  Try out the notebook for yourself.

You might have some questions about the notebook, I'll see if I can anticipate a few:

  • DuckDB 1.0.0 is used as it is in the Esri Conda channel and later versions handle geometry differently
  • The bbox column in the parquet files is JSON type but queried as varchar as DuckDB didn't seem to recognise the data as JSON
  • I tried using the built-in rowid pseudocolumn in DuckDB but got errors, so I overrode it
  • I tried writing the output feature class by bouncing through a spatially enable dataframe but got errors
  • In the blog download the project atbx has a script tool I used to find desired output text field widths

Now I'm going to be a little selfish.  To make my sample data and parquet files I built a few ETL tools (Pro 3.4), which I could have scripted.  These tools are not in the blog download.  If you are interested in them please message me and I can share.  It will help the team here if we hear how many people are interested in this data sharing paradigm, so please help us to help you.