SQL powered ETL

BruceHarold · ‎07-30-2024

SQL powered ETL. How hard can it be, you know COPY in PostgreSQL and how to stage data in cloud warehouses, right?

How about downloading 51 million road segments from a cloud object store into your Pro project?

Road segments in the USA

Or the same data for France?

Parisian transport segments from Overture via SQLIn the world of data engineering, the process of Extract, Transform, Load (ETL) can often be a daunting task, especially when dealing with complex columnar data from cloud-native sources. However, with the right tools and techniques, it can be simplified significantly. This blog post delves into using SQL to streamline ETL processes, specifically focusing on handling complex georelational data types. By leveraging the power of SQL, along with tools like DuckDB and ArcGIS, we can efficiently manage and transform large datasets, such as those provided by the Overture Maps Foundation. Whether you're dealing with GeoParquet, JSON, or CSV files, this guide will show you how to unpack and normalize complex data structures, making your ETL tasks more manageable and less time-consuming.

You're probably aware of GeoParquet as an upcoming format, and this post uses the format as subject matter. If it's possible to inject complex objects into columnar data, someone will do it. I suppose the rationale is to supply a rich data model in a single row and a complete dataset in a single file or set of files of the same schema. However, it puts pressure on us ETL folks to unpack into a normalized schema.

Now to throw a little light on the subject matter, look closely at the map above (click to enlarge). The linear features are transport segments of many types (road, rail, and water types with many subtypes of road) from the Overture Maps Foundation transportation theme. Overture delivers global scale data as GeoParquet files in S3 and Azure blob storage with anonymous access. The parquet files are partitioned into chunks of about 1GB using a common schema per theme. The segment features for an area of interest can be split across any number of parquet files, random spatial access is facilitated by a bounding box property on the data. This is big data, you don't want to download it all to get going.

In the lower right-hand area, you'll see a yellow feature heading southeast out of Paris; it's the E54 route that makes its way to Mulhouse on the German border. Here is a smaller-scale view:

E54 from Paris to Mulhouse

Now you can see I have lots of data – millions and millions of transport segments in France. While the source data is of a global scale, I just extracted features in France. Not shown are many more millions of various event types - road surface, condition flags, access restrictions, speed zones, turn restrictions, routes, and so on. All are imported to ArcGIS from GeoParquet in an S3 bucket using SQL.

Full disclosure, the ETL is given a leg up by ArcPy, but nothing that will surprise you.

Processing uses notebooks in ArcGIS Pro. If you wanted, these could be hosted notebooks in ArcGIS Online or ArcGIS Enterprise. My invaluable helper in all this is the DuckDB package, supplying connectivity to web data sources, remote query capability for S3-API-compliant object stores, a performant local database, and above all, SQL support for files, including GeoParquet, and additionally supporting spatial SQL! Who knew files could behave like DBMS objects?!

Before we go further, what is one of these complex objects I'm talking about, I hear you say?

Here is one:

{
	"speed_limits": [
		{
			"is_max_speed_variable": true,
			"between": [
				0,
				0.025818175
			]
		},
		{
			"max_speed": {
				"value": 120,
				"unit": "km/h"
			},
			"is_max_speed_variable": true,
			"between": [
				0.025818175,
				0.125278098
			]
		},
		{
			"max_speed": {
				"value": 100,
				"unit": "km/h"
			},
			"is_max_speed_variable": true,
			"between": [
				0.125278098,
				0.219762685
			]
		},
		{
			"max_speed": {
				"value": 80,
				"unit": "km/h"
			},
			"is_max_speed_variable": true,
			"between": [
				0.219762685,
				0.831591278
			]
		},
		{
			"is_max_speed_variable": true,
			"between": [
				0.831591278,
				1
			]
		}
	]
}

The above is a struct column value from a single transport segment that defines speed zones along the segment. There are five zones for the segment and four have a maximum speed property. They could also have a minimum speed property or conditional properties pertaining to heavy vehicles, axle count, time of day, and so on. The 'between' array is a pair of linear event coordinates that ArcGIS can display given a suitable route system (which we'll build). You're going to see how to unravel all this with SQL.

My data is Overture Maps Foundation transportation theme at the 2024-08-20.0 release. Much of Overture's data is out of beta, but the transportation theme at writing remains in beta, so while you may edit the notebooks to use a later release, there may be schema changes which break the notebooks.

Where do you start? You start by seeing what you've got in the data source. Here, I am doing that in a utility notebook:

Inspecting the segments theme

The important things in the upper cell are importing duckdb (you'll need to clone your environment and install it from Conda Forge or PyPi) and creating a connection (which could be just to memory - no database path). The lower cell describes my subject matter data by reading the S3 hive. By inspection, I can see ordinary columns that are VARCHAR, but one, 'connector_ids', is an array signified by the [] brackets at the end. There are a bunch of these structs, which I think of as JSON dictionaries.

Frustratingly, the column_type details do not line-wrap, so a little Python to the rescue, and we can see struct columns can also be arrays, for example, as we see in speed_limits above.

Showing the full schema

So, having inspected the segment schema documentation and sample data, we can see the basic shape of it is segment linestring features (with unique key 'id'), WKB geometry, a bounding box struct, some descriptive fields, plus a set of array & struct fields we'll have to unpack into linear events.

In classic divide and conquer fashion, I split the job into tractable chunks with a view to making it easy for you to modify for your area of interest.

To get started, add the blog download notebooks to a Pro 3.3+ project; I created a new project. I'm guessing most people will have an area of interest, not the whole planet, so I based my processing around this. You are free to extract the whole planet but it may take some time.😉

Overture includes a political divisions dataset named division_area. The notebook that extracts this data is DivisionArea. Open and run DivisionArea, and inspect the output.

DivisonArea features

Identify some features in your area of interest. Determine a layer definition query ('where' clause) for the data that defines the total set of features within which you want to extract transportation theme data from Overture. For example, my area of interest is the country of France:

country = 'FR' and subtype in ('country') and names.primary in ('France')

You don't need to apply a layer definition query, it's the SQL we want. You can select any number or combination of DivisionArea features, disjoint or not.

The next step is to extract the segments data from S3 to a local project DuckDB database. That will save going back to the S3 bucket again for data.

Open the Segments notebook. Edit the line for wktRelation to use your where clause. You'll see a few samples I left behind. Save and run the notebook. Depending on the feature count in your area of interest it may take some time to extract the data. The notebook also makes a route system. In my case it took about 50 minutes for all metropolitan France, including making a Connectors feature class, which you may consider optional.

Now you have your transport segments table in DuckDB you can run the rest of the notebooks, in any order, to make your transport information product - the various events on the segments.

Here is a simple one for road condition flags in the Flags notebook:

Flags notebook

If you do a little reading on DuckDB you'll see it includes the GDAL library which has the OpenFileGDB driver so it's possible to use a simpler approach to write out the data. However, while this automatically derives a schema, it defaults text fields to 64K width and appears not to offer the OPENFILEGDB_DEFAULT_STRING_WIDTH configuration parameter, so I defined the schema manually with ArcPy. To find suitable text field widths I used a two-stage process, first writing with very large widths, then checking the actual data size using a simple script tool (code below, atbx included in the blog download) to refine the text widths:

tbl = arcpy.GetParameterAsText(0)
d = arcpy.da.Describe(tbl)
flds = [f.name for f in d['fields'] if f.type == 'String']
mDict = {f:0 for f in flds}
with arcpy.da.SearchCursor(tbl,flds) as cursor:
    for row in cursor:
        for f in flds:
            if row[flds.index(f)]:
                mDict[f] = max(mDict[f],len(row[flds.index(f)]))
for k in mDict.keys():
    arcpy.AddMessage(f"""Input table '{tbl}' field '{k}' has maximum data width '{mDict[k]}'""")

You'll see in the notebooks I may use multiple UNNEST steps on each array and struct. This powerful function does what it suggests and flattens the data - arrays into new rows and structs into new columns. However, if a row has a null value for an array or struct then the unnest function does not emit the row so you'll see UNION clauses to bring all rows through, for example in the AccessRestrictions notebook.

# when_using
conn.sql("""create or replace temp view restrictions_view3 as select
            id,
            access_type,
            when_during,
            when_heading,
            unnest(when_using) as when_using,
            when_recognized,
            when_mode,
            vehicle,
            begins,
            ends
            from restrictions_view2 where when_using is not null
            union all by name
            select * exclude (when_using) from restrictions_view2 where when_using is null;""")

If 'when_using' is null we still want the row because we have yet to unpack 'when_recognized' and 'when_mode'. The exclude term prevents a column type error between VARCHAR values unnested from 'when_using' and STRUCT values in 'when_using' before unnesting.

All the notebooks except ProhibitedTransitions create event tables you get on the map as linear events along the SegmentsM route system, using the begins and ends fields as from-measure and to-measure values. See the Make Route Event Layer geoprocessing tool.

ProhibitedTransitions creates line features with a transition_order sequence flag.

That's it, bulk data delivered from a cloud object store with the help of SQL! If you look back at the problem before starting, it looked very daunting, something for some serious developer time. Remote GeoParquet files in an S3 hive, complex column types with nested data, geometric scoping using linear referencing – all these problems go away when you throw the right tools at them. The star of the show though is SQL, making complex ETL simpler.

Notes: I had some feedback that prompts me to add this clarification:

The order in which to implement this sample is:

Download the blog attachment to a project home folder, preferably a new project
Install the DuckDB package if you have not already, and activate the environment
- This is called python-duckdb in Conda Forge, duckdb in PyPi
Run the DivisionArea notebook and inspect the output feature layer
Figure out the where clause you need to select DivisionArea features for your area of interest
Plug the where clause into the Segments notebook
Run the Segments notebook
- This will create a segments.duckdb local database
Run the other notebooks
- These depend on reading segments.duckdb