Select to view content in your preferred language

Go Cloud Native! Overture GeoParquet, From Object Store To Feature Layer Via Online Notebook

1911
7
01-19-2024 11:27 AM
BruceHarold
Esri Regular Contributor
10 7 1,911

Every now and then a compelling workflow is enabled by new ideas, and data distribution by cloud native formats is my topic today.  Here is my example data:

Places of interest in London, EnglandPlaces of interest in London, England

What you are looking at is a hosted  group layer in ArcGIS Online with 309,402 place of interest points in London, England, with related tables; 309,402 address details, 18,383 brand names, 239,309 website URLs, 417,720 source attributions, 207,114 social media link URLs, and 281,556 phone numbers.  Here is a chocolatier in Covent Garden:

Chocolatier in Covent GardenChocolatier in Covent Garden

This being England, you might expect there are tea rooms about the city.  Using place category and related tables we can see not just their locations but their address, website, social media links and phone numbers:

Tea Rooms in LondonTea Rooms in London

 

Tea Rooms and related dataTea Rooms and related data

This place data is one of the themes from Overture Maps Foundation and is made available under this license.  If you surf the Overture website, you'll see it is a collaboration of Amazon, Meta, Microsoft and TomTom as steering members,  Esri as a general member, and many other contributor members and is envisaged as a resource for "developers who build map services or use geospatial data".  I'm democratizing it a bit more here, giving you a pattern for consuming the data as hosted feature layers in your ArcGIS Online or ArcGIS Enterprise portals.

Let's dig into the details of how to migrate the data to feature services.

The data is made available at Amazon S3 and Azure object stores as Parquet files, with anonymous access.   I'll let you read up on the format details elsewhere but Parquet is definitely one star of the cloud native show because it is optimized for querying by attribute column, and in the case of GeoParquet, this includes a spatial column (technically multiple spatial columns if you want to push your luck).  As GeoParquet is an emerging format it still has some things that are TODO, like a spatial index (which would let you query by spatial operators), but Overture very thoughtfully include a bounding box property which is simple to query by X and Y.

The technology that is the second star of the cloud native show is DuckDB.  DuckDB enables SQL query of local or remote files like CSV, JSON and of course Parquet (and many more) as if they are local databases.  Remote file query is especially powerful if the host portal supports the Amazon S3 REST API and a client that talks this can use HTTP to send SQL queries, which DuckDB can.  Especially powerful is DuckDB's ability to unpack complex data types (arrays and structs) into a rich data model like I'm doing here (base features and 1:M related tables), and not just flat tables.  Only the query result is returned, not the remote file.

The third star of the cloud native show is ArcGIS Online hosted notebooks, which can be used to orchestrate DuckDB transactions and integrate the results into new or refreshed hosted feature layers in Online or Enterprise.  The superpower of this combination of Parquet, DuckDB and Python is that global scale data can be queried for a subset of interest in a desired schema using industry standard SQL, plus automatedThis forever deprecates the legacy workflow of downloading superset data files to retrieve the part you want.  At writing, the places data resolves to 6 files totaling 5.54GB, not something you want to haul over the wire before you start processing!  If you think about it, any file format you have to zip to move around and unzip to query (shapefile, file geodatabase) generates some friction, Parquet avoids this.

The notebook named DuckDBIntegration is what I used to import the places data and is in the blog download.  Notebooks aren't easy to share graphically but I'll make a few points.

Firstly, ArcGIS notebooks don't include the DuckDB Python API, so it needs to be installed from PyPi, here is the code that does the import and also loads the spatial and httpfs extensions needed for DuckDB in this workflow.  The DuckDB version at writing is 1.0.0.

 

try:
    import duckdb
except:
    !pip install duckdb==1.0.0
    import duckdb
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
conn.sql("set enable_object_cache=true;")

 

Once DuckDB is loaded it is a matter of extracting the layers of interest.  I did not model the full Overture places schema, for example omitting alternate place categories.

Spoiler
If you browse the places schema note the YAML tab.  Don't be surprised in 2025 if you see YAML powering AI to let you have conversations with your data 😉

As Overture data is of global scale, with hundreds of millions of features for the base Places layer, the extract must be limited to my area of interest.  The method I adopted was to use the Division Area theme as my selection layer.  This is all the world's political divisions from country to neighborhood scale - I build a selecting geometry from a SQL query on division area subtypes and names.  The blog download includes a notebook ExtractDivisions which extracts the data to a file geodatabase feature class.  After some data exploration I determined this SQL where clause would define my area of interest:

 

country = 'GB' and subtype in ('county','locality') and names.primary in ('City Of London','London','Camden','Tower Hamlets','Islington')

 

Then in my cell that makes my extraction polygon I use this SQL and DuckDB's spatial operators to construct my selection geometry, both as a Well Known Text polygon and XY bounds variables:

 

sql = f"""select ST_AsText(ST_GeomFromWKB(geometry)) as wkt
          from read_parquet('s3://overturemaps-us-west-2/release/{release}/theme=divisions/type=division_area*/*', filename=true, hive_partitioning=1) 
          where 
          country = 'GB' and subtype in ('county','locality') and names.primary in ('City Of London','London','Camden','Tower Hamlets','Islington');"""
wktList = [row[0] for row in conn.sql(sql).fetchall()]
wkt = wktList[0]
for w in wktList[1:]:
    wkt = conn.sql(f"""select ST_AsText(ST_Union(ST_GeomFromText('{wkt}'),ST_GeomFromText('{w}')));""").fetchone()[0]
xmin,ymin,xmax,ymax = conn.sql(f"""with aoi as (select ST_GeomFromText('{wkt}') as geom)
                                   select ST_XMin(geom),ST_YMin(geom),ST_XMax(geom),ST_YMax(geom) from aoi;""").fetchone()

 

I then make a view of the Places theme which defines the features I want:

 

sql = f"""create or replace view places_view as select *
          from read_parquet('s3://overturemaps-us-west-2/release/{release}/theme=places/type=place*/*',filename=false, hive_partitioning=1)
          where bbox.xmin >= {xmin}
            and bbox.ymin >= {ymin}
            and bbox.xmax <= {xmax}
            and bbox.ymax <= {ymax}
            and ST_Intersects(ST_GeomFromText('{wkt}'), ST_GeomFromWKB(geometry));"""
conn.sql(sql)

 

The Places data schema models the expansion tables as arrays of structs, so that each place may have a 1:M relationship to its expansion properties.  If this terminology is new to you, it is equivalent to Python lists of dictionaries, but with the property that the dictionaries are guaranteed to have the same key names in each row.

Now I have my Places view it is a matter of throwing SQL statements at it to create relations (cursors) to use with ArcPy insert cursors to write each layer.  Now, if you inspect the ExtractDivisions notebook you'll see it is possible to use a built-in file geodatabase driver in DuckDB to dump data out quickly and easily.  However this doesn't control the schema very well, so I take a longer winded approach to creating the output data, relying on ArcPy to create my output.  Here is the code which writes the Places base layer:

 

arcpy.env.workspace = arcpy.management.CreateFileGDB(out_folder_path=out_folder_path,
                                                     out_name=out_name,
                                                     out_version="CURRENT").getOutput(0)
arcPlaces = arcpy.management.CreateFeatureclass(out_path=arcpy.env.workspace,
                                                out_name="Places",
                                                geometry_type="POINT",
                                                has_m="DISABLED",
                                                has_z="DISABLED",
                                                spatial_reference=sR).getOutput(0)
arcpy.management.AddField(in_table=arcPlaces,field_name="id",field_type="TEXT",field_length=32)
arcpy.management.AddField(in_table=arcPlaces,field_name="name_primary",field_type="TEXT",field_length=1000)
arcpy.management.AddField(in_table=arcPlaces,field_name="category_main",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcPlaces,field_name="confidence",field_type="FLOAT")
arcpy.management.AddField(in_table=arcPlaces,field_name="update_time",field_type="DATE")
arcpy.management.AddField(in_table=arcPlaces,field_name="version",field_type="SHORT")
sql = f"""select id,
                 names.primary as name_primary,
                 categories.main as category_main,
                 confidence,
                 try_cast(update_time as datetime) as update_time,
                 version,
                 geometry
                 from places_view;"""
duckPlaces = conn.sql(sql)
with arcpy.da.InsertCursor(arcPlaces,["id","name_primary","category_main","confidence","update_time","version","shape@"]) as iCursor:
    row = duckPlaces.fetchone()
    i = 1
    if row:
        while row:
            if i % 10000 == 0:
                print('Inserted {} Places rows at {}'.format(str(i),getNow()))
            row = list(row)
            row[-1] = arcpy.FromWKB(row[-1])
            iCursor.insertRow(row)
            i+=1
            row = duckPlaces.fetchone()
del iCursor

 

So, the above creates my point layer, then it is a matter of extracting each expansion table, which all follow the same pattern. Here is the code that extracts addresses from the Places view:

 

sql = """with address as (select id, unnest(addresses, recursive := true) from places_view where addresses is not null)
         select id, freeform, locality, region, postcode, country from address;"""
duckAddresses = conn.sql(sql)
arcAddresses = arcpy.management.CreateTable(out_path=arcpy.env.workspace,out_name="Addresses").getOutput(0)
arcpy.management.AddField(in_table=arcAddresses,field_name="id",field_type="TEXT",field_length=32)
arcpy.management.AddField(in_table=arcAddresses,field_name="freeform",field_type="TEXT",field_length=300)
arcpy.management.AddField(in_table=arcAddresses,field_name="locality",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcAddresses,field_name="region",field_type="TEXT",field_length=50)
arcpy.management.AddField(in_table=arcAddresses,field_name="postcode",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcAddresses,field_name="country",field_type="TEXT",field_length=2)
with arcpy.da.InsertCursor(arcAddresses,["id","freeform","locality","region","postcode","country"]) as iCursor:
    row = duckAddresses.fetchone()
    i = 1
    if row:
        while row:
            if i % 10000 == 0:
                print('Inserted {} Addresses rows at {}'.format(str(i),getNow()))
            iCursor.insertRow(row)
            i+=1
            row = duckAddresses.fetchone()
del iCursor

 

Note the fancy "unnest" operator in the SQL.  This expands the array of address structs into new rows for each address, inheriting the id field for each expanded row.

Once all expansion tables are extracted I make the relationship classes from Places points to each expansion table:

 

for destination in ["Addresses","Brands","Websites","Sources","Socials","Phones"]:
    arcpy.management.CreateRelationshipClass(origin_table="Places",
                                             destination_table=destination,
                                             out_relationship_class="Places_{}".format(destination),
                                             relationship_type="SIMPLE",
                                             forward_label=destination,
                                             backward_label="Places",
                                             message_direction="NONE",
                                             cardinality="ONE_TO_MANY",
                                             attributed="NONE",
                                             origin_primary_key="id",
                                             origin_foreign_key="id",
                                             destination_primary_key="",
                                             destination_foreign_key="")

 

Then attach some metadata to each layer as a good practice, some counts, times and license details:

 

current_time_utc = datetime.now(pytz.utc)
pst = pytz.timezone('US/Pacific')
current_time_pst = current_time_utc.astimezone(pst)
current_time_pst_formatted = current_time_pst.strftime('%Y-%m-%d %H:%M:%S')
objDict = {obj:0 for obj in ["Places","Addresses","Brands","Phones","Socials","Sources","Websites"]}
for k in objDict.keys():
    objDict[k] = int(arcpy.management.GetCount(k).getOutput(0))
for obj in objDict.keys():
    objCount = objDict[obj]
    new_md = arcpy.metadata.Metadata()
    new_md.title = f"Overture Maps Foundation {obj} in London, release {release}."
    new_md.tags = f"{obj},London,Overture"
    new_md.summary = f"For mapping and analysis of {obj} in ArcGIS."
    new_md.description = f"{objCount} {obj} features in London, Great Britain."
    new_md.credits = f"Esri, {current_time_pst_formatted} Pacific."
    new_md.accessConstraints = f"https://cdla.dev/permissive-2-0/"
    tgt_item_md = arcpy.metadata.Metadata(obj)
    if not tgt_item_md.isReadOnly:
        tgt_item_md.copy(new_md)
        tgt_item_md.save()
    else:
        print(f"{obj} metadata is read only.")

 

I'll let you inspect the rest of the DuckDBIntegration notebook code yourselves, but all that is left is to zip up the output geodatabase and use it to overwrite the feature service being maintained.  The most important point about the feature service is it must have been created by sharing a file geodatabase item, not shared from Pro.  Making the initial file geodatabase item is just a matter of manually running the notebook down to the cell that does the zipping, then downloading the zip file manually and using it to create the item.

To go into production with this approach first figure out a query on extent or division area fields that works for you, then plug it into the notebook.  You'll need to supply your own credentials of course, and change target gis if you're using Enterprise for the output feature service. Then at each Overture release, change the release variable to match Overtute and refresh your service.

To give you a feel for performance, look at the download notebook, you'll see it takes about 6 1/2 minutes to run - amazingly fast for the data scale.  I am though co-located with the S3 region being accessed.

Naturally, if the data schema changes or to support other Overture themes, you'll need to author notebook changes or new notebooks.  Do share!

I'm sure you'll agree that this is a powerful new paradigm that will change the industry.  I'm just following industry trends.  It will be fun to see if Parquet is what comes after the shapefile for data sharing.

The blog download has the notebook but not a file geodatabase (it's too big for the blog technology) but when you generate your own services don't forget the data license here.  Have fun!

Tags (2)
7 Comments