Go Cloud Native! Overture GeoParquet, From Object Store To Feature Layer Via Online Notebook

1331
7
01-19-2024 11:27 AM
BruceHarold
Esri Regular Contributor
10 7 1,331

Every now and then a compelling workflow is enabled by new ideas, and data distribution by cloud native formats is my example today.  Take this (very plain) map for example:

Places of interest in London, EnglandPlaces of interest in London, England

What you are looking at is a hosted  group layer in ArcGIS Online with 306229 place points around London, England, with related tables; 306229 address details, 18566 business brand names, 237806 website URLs, 414651 source attributions, 206639 social media link URLs, and 279495 phone numbers.  If we explore places in Covent Garden we can see dozens of business brands:

Covent GardenCovent Garden

 

This being England, you might be interested in tea rooms about the city.  Using place category and related tables we can see not just their locations but their address, website, social media links and phone numbers:

Tea RoomsTea Rooms

Tea Rooms Related DataTea Rooms Related Data

 

 This place data is one of the themes from Overture Maps Foundation and is made available under this license.  If you surf the Overture website, you'll see it is a collaboration of Amazon, Meta, Microsoft and TomTom as steering members,  Esri as a general member, and many other contributor members and is envisaged as a resource for "developers who build map services or use geospatial data".  I'm democratizing it a bit more here, giving you a pattern for consuming the data as hosted feature layers in your ArcGIS Online or ArcGIS Enterprise portals.

Let's dig into the details of how to migrate the data to feature services.

The data is made available at Amazon S3 and Azure object stores as Parquet files, with anonymous access.   I'll let you read up on the format details elsewhere but Parquet is definitely one star of the cloud native show because it is optimized for querying by attribute column, and in the case of GeoParquet, this includes a spatial column (technically multiple spatial columns if you want to push your luck).  As GeoParquet is an emerging format it still has some things that are TODO, like a spatial index (which would let you query by spatial operators), but Overture very thoughtfully include a bounding box property which is simple to query by X and Y.

The technology that is the second star of the cloud native show is DuckDB.  DuckDB enables SQL query of local or remote files like CSV, JSON and of course Parquet (and many more) as if they are local databases.  Remote file query is especially powerful if the host portal supports the Amazon S3 REST API and a client that talks this can use HTTP to send SQL queries, which DuckDB can.  Especially powerful is DuckDB's ability to unpack complex data types (arrays and structs) into a rich data model like I'm doing here (base features and 1:M related tables), and not just flat tables.  Only the query result is returned, not the remote file.

The third star of the cloud native show is ArcGIS Online hosted notebooks, which can be used to orchestrate DuckDB transactions and integrate the results into new or refreshed hosted feature layers in Online or Enterprise.  The superpower of this combination of Parquet, DuckDB and Python is that global scale data can be queried for a subset of interest in a desired schema using industry standard SQL, plus automatedThis forever deprecates the legacy workflow of downloading superset data files to retrieve the part you want.  At writing, the places data resolves to 6 files totaling 5.54GB, not something you want to haul over the wire before you start processing!  If you think about it, any file format you have to zip to move around and unzip to query (shapefile, file geodatabase) generates some friction, Parquet avoids this.

The notebook named DuckDBIntegration is what I used to import the places data and is in the blog download.  Notebooks aren't easy to share graphically but I'll make a few points.

Firstly, ArcGIS notebooks don't include the DuckDB Python API, so it needs to be installed from PyPi, here is my cell that does the import and also loads the spatial and httpfs extensions needed for DuckDB in this workflow.

 

try:
    import duckdb
except:
    !pip install duckdb==0.10.2
    import duckdb
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
release = "2024-04-16-beta.0"
outGDB = r"/arcgis/Places.gdb"

 

I did not model the full Overture places schema, only the elements that looked populated in my area of interest.

Spoiler
If you browse the places schema note the YAML tab.  Don't be surprised in 2025 if you see YAML powering AI to let you have conversations with your data 😉

Each extract from the S3 hive follows the same pattern.    Here is the cell that extracts address data - note the UNNEST operator that unpacks a struct.

 

print('Extracting Addresses starting at {}'.format(getNow()))
sql = """with address as (select id, unnest(addresses, recursive := true) from places_view where addresses is not null
and {}) select id, freeform, locality, postcode, region, country from address;""".format(whereExp)
relation = duckdb.sql(sql)
addresses = arcpy.management.CreateTable(out_path=r"/arcgis/Places.gdb",out_name="Addresses").getOutput(0)
arcpy.management.AddField(in_table=addresses,field_name="id",field_type="TEXT",field_length=40,field_is_nullable="NULLABLE",field_is_required="REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="freeform",field_type="TEXT",field_length=1000,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="locality",field_type="TEXT",field_length=100,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="postcode",field_type="TEXT",field_length=1000,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="region",field_type="TEXT",field_length=100,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="country",field_type="TEXT",field_length=2,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
print('Addresses table created at {}'.format(getNow()))
with arcpy.da.InsertCursor(addresses,["id","freeform","locality","postcode","region","country"]) as iCursor:
    row = relation.fetchone()
    i = 1
    if row:
        while row:
            if i % 10000 == 0:
                print('Inserted {} rows at {}'.format(str(i),getNow()))
            iCursor.insertRow(row)
            i+=1
            row = relation.fetchone()
del iCursor
print('Addresses table populated at {}'.format(getNow()))

 

You'll see I set specific field widths for text fields.  Common web formats (CSV, JSON, Parquet) don't let you tune data types like this but I like to do so.  However if you're sharp eyed you'll see the postcode field has a crazy width of 1000.  The pattern I settled on to determine field widths was to have a utility cell (in the download) in my notebook during construction which I used to explore the data.  I found maximum field sizes that way.  It turns out the crazy big postcode I found looks like a data error.

Likely Data ErrorLikely Data Error

To go into production with this approach first figure out a query on extent or address fields (region, country) that works for you, then plug it into the notebook after sharing it to ArcGIS Online.  You'll need to supply your own credentials of course, and change target gis if you're using Enterprise for the output feature service.  To initialize the feature service, run the notebook manually down to the cell that zips up the working file geodatabase, download the zip file from the notebook and share it as an item and feature service, then plug in the item id.  Then at each Overture release, change the theme URL to match the release and refresh your service.

To give you a feel for performance, it takes a few minutes for the S3 hive to be queried for each layer but the results write at tens of thousands of features per second.  It takes about 50 minutes for my test extent and data model, including overwrite of a target feature service.

Naturally, if the data schema changes or to support other Overture themes, you'll need to author notebook changes or new notebooks.  Do share!

I'm sure you'll agree that this is a powerful new paradigm that will change the industry.  I'm just following industry trends.  It will be fun to see if Parquet is what comes after the shapefile for data sharing.

The blog download has the notebook but not a file geodatabase (it's too big for the blog technology) but when you generate your own services don't forget the data license here.  Have fun!

7 Comments
ShareUser
Esri Community Manager

Are you extracting data for your entire AOI and making a copy of it in a hosted feature service?  What benefit does that give you?  You said the Overture data was hosted on Amazon S3 and freely available to the Public.  You also said the Geoparquet format is optimized for attribute queries.  Why would you want to make a copy of the Overture data and store it in a hosted feature service where you have to pay for storage?

BruceHarold
Esri Regular Contributor

@ShareUser A feature service with behaviors like relationship classes is easily digestible in ArcGIS Pro for mapping and analysis, a remote hive of GeoParquet files (much) less so, at least for a Pro user.

ShareUser
Esri Community Manager

So it's an interim measure until Geoparquet matures and ArcGIS Pro adds support for Geoparquet?

BruceHarold
Esri Regular Contributor

@ShareUser Now that is exactly the right question!  GeoParquet certainly has some very attractive properties, such as a rich data type set (including arrays and structs so you can have an implied data model in one table), queryable metadata, support in the S3 API across the major object stores etc.  This makes it a candidate to replace the venerable shapefile, if the user community chooses.

ShareUser
Esri Community Manager

So GeoParquet and Cloud Optimized Geotiff could replace simple map services and simple image services leaving ArcGIS Server to do more geoprocessing, geocoding, etc.

ShareUser
Esri Community Manager

So GeoParquet and Cloud Optimized Geotiff could replace simple map services and simple image services leaving ArcGIS Server to do more geoprocessing, geocoding, etc.

BruceHarold
Esri Regular Contributor

@ShareUser I suppose simple datasets in object stores could be used that way, but on the vector data front I'm personally interested in how AI could be used to generate SQL that gives you the view of GeoParquet that you want to consume, in a smart enough client.