ETL Pattern: Working With Cloud Well-Known Files

BruceHarold · ‎08-20-2024

For a while now, file-based data has been migrating to the web and we're all used to manually downloading what we need and taking it from there in our desktop software like ArcGIS Pro.

This blog is about automating that experience, replacing the download and processing steps with easy to understand, well defined and shareable tooling that delivers your information product on demand.

At two ends of data scale I have worldwide political divisions and building permit activity in Vancouver, Canada. The former dataset is big and sourced from an AWS S3 bucket, the latter is small and sourced from an open data portal. They are both ingested using notebooks in ArcGIS Pro with the same basic tooling.

The small dataset first:

Vancouver building permits

The big dataset:

Worldwide political divisions

And in the middle ground, the political divisions for just Germany:

A view of political divisions for Germany

All these datasets were ingested similarly but with noteworthy differences we'll discuss below.

If you're not familiar with GeoParquet you will be, and I'm making it my well-known format of choice for this post. Other common cloud file types like CSV, Excel and JSON would work too. By "cloud" I mean not just where you go to get them but formats that lend themselves to remote query without your having to download and inspect a local copy.

While I say that the same basic tooling can reach the big data on S3 and the small data on a website, the capabilities of the respective servers matters. S3 knows how to deliver files according to queries against hive storage, whereas a website just knows how to deliver downloads of whole files.

Let's see how to tackle the smaller dataset first - building permits, by automated download from a website.

The data resides in an open data portal, if you surf the site you'll see download format options.

The data is automated in the notebook VancouverBuildingPermits. If you inspect the code you'll note I'm using DuckDB to help out - it is a spatially enabled SQL database and web client. You'll see the data download and conversion to feature class is very simple - the tooling infers the schema from the Parquet source plus SQL statements.

import arcpy
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import duckdb
import math
import os
arcpy.env.overwriteOutput = True
url = r"https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/issued-building-permits/exports/parquet?lang=en&refine=issuedate%3A%222024%22&timezone=America%2FLos_Angeles"
conn = duckdb.connect()
conn.sql("set force_download=true;")
conn.sql("install httpfs;load httpfs;")
conn.sql("install spatial;load spatial;")
conn.sql(f"create temp view permitsView as select * from read_parquet('{url}');")
bldgPermits = conn.sql("""select permitnumber as Permit_Number,
                                 try_cast (permitnumbercreateddate as date) as Permit_Number_Created_Date,
                                 try_cast (issuedate as date) as Issue_Date,
                                 permitelapseddays::int as Permit_Elapsed_Days,
                                 projectvalue::float as Project_Value,
                                 typeofwork as Type_Of_Work,
                                 address as Address,
                                 projectdescription as Project_Description,
                                 permitcategory as Permit_Category,
                                 applicant as Applicant,
                                 applicantaddress as Applicant_Address,
                                 propertyuse as Property_Use,
                                 specificusecategory as Specific_Use_Category,
                                 buildingcontractor as Building_Contractor,
                                 buildingcontractoraddress as Building_Contractor_Address,
                                 issueyear as Issue_Year,
                                 geolocalarea as Geo_Loc_Area,
                                 yearmonth as Year_Month,
                                 ST_AsText(ST_GeomFromWKB(geom)) as SHAPE
                                 from permitsView;""")
df = bldgPermits.df()
df.spatial.set_geometry("SHAPE",sr=4326,inplace=True)
aprx = arcpy.mp.ArcGISProject("CURRENT")
gdb = aprx.defaultGeodatabase
out_fc = "Issued_Building_Permits"
location = os.path.join(gdb,out_fc)
df.spatial.to_featureclass(location,sanitize_columns=False)

If you inspect the code in the download you'll see some extra steps I used to round up text field widths to factors of 10, an unnecessary flourish in this case but just to show you how to do this if you expect the data to be edited in future and you don't want truncation of values.

So that is a simple case of ingesting a small cloud-resident file into ArcGIS with minimal control over the schema - note again standard SQL statements are available to define your view of the cloud-sourced file. Compare this approach to defining the schema using geoprocessing in this related post.

Let's step up to global scale data using the Division_Area notebook, using data from Overture Maps Foundation.

# Get the Division_Area layer from Overture Maps Foundation
import arcpy
from datetime import datetime
import duckdb
import os
aprx = arcpy.mp.ArcGISProject("CURRENT")
homeFolder = aprx.homeFolder
gdb = aprx.defaultGeodatabase
release = "2024-07-22.0"
def getNow():
    return str(datetime.utcnow().replace(microsecond=0))
arcpy.env.overwriteOutput = True
sR = arcpy.SpatialReference(4326)
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
conn.sql("set enable_object_cache=true;")

# Make the DuckDB relation
# Insert any desired where clause, for example "country = 'GB'"
# The order-by expression helps to display small divisions on top of large ones, but is expensive
print(f"Processing starting at {getNow()}")
whereExp = "1 = 1"
sql = f"""select id,
                 bbox.xmin as xmin,
                 bbox.ymin as ymin,
                 bbox.xmax as xmax,
                 bbox.ymax as ymax,
                 version,
                 subtype,
                 names.primary as primary_name,
                 class,
                 region,
                 country,
                 norms.driving_side as driving_side,
                 geometry
                 from read_parquet('s3://overturemaps-us-west-2/release/{release}/theme=divisions/type=division_area/*',filename=true, hive_partitioning=1)
                 where {whereExp}
                 order by ST_Area(ST_GeomFromWKB(geometry)) desc;"""
duckDivisions = conn.sql(sql)

# Make the output feature class with precise schema control
print('Creating output feature class at {}'.format(getNow()))
arcDivisions = arcpy.management.CreateFeatureclass(out_path=gdb,
                                                   out_name="Division_Areas",
                                                   geometry_type="POLYGON",
                                                   spatial_reference=sR).getOutput(0)
arcpy.management.AddField(in_table=arcDivisions,field_name="id",field_type="TEXT",field_length=32)
for f in ["xmin","ymin","xmax","ymax"]:
    arcpy.management.AddField(in_table=arcDivisions,field_name=f,field_type="FLOAT")
arcpy.management.AddField(in_table=arcDivisions,field_name="version",field_type="SHORT")
arcpy.management.AddField(in_table=arcDivisions,field_name="subtype",field_type="TEXT",field_length=20)
arcpy.management.AddField(in_table=arcDivisions,field_name="primary_name",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcDivisions,field_name="class",field_type="TEXT",field_length=6)
arcpy.management.AddField(in_table=arcDivisions,field_name="region",field_type="TEXT",field_length=10)
arcpy.management.AddField(in_table=arcDivisions,field_name="country",field_type="TEXT",field_length=2)
arcpy.management.AddField(in_table=arcDivisions,field_name="driving_side",field_type="TEXT",field_length=10)

# Write the output
print('Writing output at {}'.format(getNow()))
with arcpy.da.InsertCursor(arcDivisions,["id","xmin","ymin","xmax","ymax","version",
                                         "subtype","primary_name","class","region",
                                         "country","driving_side","shape@"]) as iCursor:
    row = duckDivisions.fetchone()
    i = 1
    while row:
        if i % 100000 == 0:
             print(f"Inserted {i} Division_Areas rows...")
        row = list(row)
        row[-1] = arcpy.FromWKB(row[-1])
        iCursor.insertRow(row)
        i+=1
        row = duckDivisions.fetchone()
del iCursor

# Repair geometries, drop null geometries
print(f"Fixing any bad geometries at {getNow()}")
arcpy.management.RepairGeometry(
    in_features=arcDivisions,
    delete_null="DELETE_NULL",
    validation_method="OGC")

print(f"Finished at {getNow()}")

The principal differences between this global scale data ingest and that for the building permits is that the source is obtained by a wildcard query on S3 hive storage, and the schema is defined manually. The tooling is the same.

So that is a look at user-focused ingest of data at large and small scale, but what if you're a data owner and you want to offer infinitely flexible cloud well-known file-based data delivery to a wide catchment of users without their having to go to the trouble of building tooling? This is where the mid-scale example of political divisions for Germany comes in.

The client library we're using - DuckDB - is also a storage format, but one with spatial database capabilities, including views, so for the case of cloud-sourced data it is straightforward to offer a data product that contains only a view which is read on-demand by any end user. This means, for the effort of figuring out a view definition a data product of any scale can be delivered in a tiny database file.

The blog download contains such a data product - mydivisionview.duckdb - which was created by the notebook MakeViewDatabase and can be ingested by anyone using the notebook ReadViewDatabase. In this case then a file of only 268KB can deliver data of national (or even global) scale. This is the power of cloud-native well-known files read on-demand from the cloud. If the dataset source changes the same DuckDB database will always access the latest data.

Below is the code for ReadViewDatabase. If you copy mydivisionview.duck db into your project hole folder it will deliver the view contents into your project default geodatabase in seconds (15 on my machine):

import arcpy
from datetime import datetime
import duckdb
import os
def getNow():
    return str(datetime.utcnow().replace(microsecond=0))
print(f"Extracting data at {getNow()}")
arcpy.env.overwriteOutput = True
aprx = arcpy.mp.ArcGISProject("CURRENT")
homeFolder = aprx.homeFolder
gdb = aprx.defaultGeodatabase
sR = arcpy.SpatialReference(4326)
duckDB = os.path.join(homeFolder,"MyDivisionView.duckdb")
conn = duckdb.connect(duckDB)
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set enable_object_cache=true;")
duckDivisions = conn.sql(f"""select id,
                                xmin::float as xmin,
                                ymin::float as ymin,
                                xmax::float as xmax,
                                ymax::float as ymax,
                                version::short as version,
                                subtype,
                                primary_name,
                                class,
                                region,
                                country,
                                driving_side,
                                geometry
                                from MyDivisionView;""")
arcDivisions = arcpy.management.CreateFeatureclass(gdb,
                                                   out_name="My_Division_Areas",
                                                   geometry_type="POLYGON",
                                                   spatial_reference=sR).getOutput(0)
arcpy.management.AddField(in_table=arcDivisions,field_name="id",field_type="TEXT",field_length=32)
for f in ["xmin","ymin","xmax","ymax"]:
    arcpy.management.AddField(in_table=arcDivisions,field_name=f,field_type="FLOAT")
arcpy.management.AddField(in_table=arcDivisions,field_name="version",field_type="SHORT")
arcpy.management.AddField(in_table=arcDivisions,field_name="subtype",field_type="TEXT",field_length=20)
arcpy.management.AddField(in_table=arcDivisions,field_name="primary_name",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcDivisions,field_name="class",field_type="TEXT",field_length=6)
arcpy.management.AddField(in_table=arcDivisions,field_name="region",field_type="TEXT",field_length=10)
arcpy.management.AddField(in_table=arcDivisions,field_name="country",field_type="TEXT",field_length=2)
arcpy.management.AddField(in_table=arcDivisions,field_name="driving_side",field_type="TEXT",field_length=10)
with arcpy.da.InsertCursor(arcDivisions,["id","xmin","ymin","xmax","ymax","version","subtype",
                                         "primary_name","class","region","country",
                                         "driving_side","shape@"]) as iCursor:
    row = duckDivisions.fetchone()
    i = 1
    while row:
        #if i % 1000 == 0:
        #     print('Inserted {} My_Division_Area rows'.format(i))
        row = list(row)
        row[-1] = arcpy.FromWKB(row[-1])
        iCursor.insertRow(row)
        i+=1
        row = duckDivisions.fetchone()
del iCursor
conn.close()
print(f"Data extracted at {getNow()}")

I'll finish with a few summary points:

Small cloud well-known files in the cloud are easily ingested.
Unlimited scale cloud well-known files in object storage are easily ingested.
Any scale cloud well-known file data in any view definition can be easily shared in tiny containers.