More From The Duck Pond - Using DuckDB In ArcGIS Pro

ShareUser · ‎01-24-2024

In an earlier post I got my feet wet using DuckDB in a hosted notebook in ArcGIS Online. That post highlighted reading GeoParquet files in an object store to maintain a feature service. That's a powerful workflow but it is much more likely your daily work involves wrangling common file formats which take up way too much time to ingest - like CSV, Excel and JSON. This post is about making your life easier by showing how DuckDB enables SQL for any file format and also supports simple conversion to Esri formats. If you have basic SQL and Python skills - as in you can read this sample and the web help - you will be fine.

Here is some live action, Bloomington Indiana open 311 incidents migrated to a feature class in seconds.

Bloomington 311 IncidentsBloomington 311 Incidents

The notebook (in the blog download) reads a CSV file at a public URL and writes a feature class into the project default geodatabase. Notebooks are a good place to start compared to a script tool because you can work in a cell interactively as you figure out your SQL. Once things are working you might copy the code to a script tool for easy automation on a schedule.

But first you need to install DuckDB! The usual route to adding a Python package to your environment is to clone the arcgispro-py3 environment then add a package in the Package Manager UI backstage in Pro. You would look for a package named python-duckdb. This errored for me (it happens) so I did a manual install. I cloned my default environment to one I named arcgispro-py3-quack and installed DuckDB by opening the Python Command Prompt from the ArcGIS program group then running this command ('username' will be yours):

conda install -c conda-forge python-duckdb=1.0.0 -p "C:\Users\username\AppData\Local\ESRI\conda\envs\arcgispro-py3-quack" --yes

Make sure to get the latest DuckDB distribution, at writing it is 1.0.0.

Let's walk through each notebook cell. The first one is pretty simple, the imports and some calls to duckdb to set up the environment.

# Imports, environment

import arcpy
import duckdb
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import os
import requests
from urllib.parse import urlparse
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
conn.sql("set enable_object_cache=true;")
arcpy.env.overwriteOutput = True

The second cell makes a pandas dataframe by reading the CSV data while automatically inferring data types - such a big help. You'll notice I code defensively for the situation the server I'm hitting doesn't support HTTP range request access like DuckDB wants - I download the data - which in this case was necessary.

# Read data into duckdb

url = r'https://data.bloomington.in.gov/resource/aw6y-t4ix.csv?$limit=200000'
filename = os.path.basename(urlparse(url).path)
base, extension = os.path.splitext(filename)
try:
    sql = "create or replace view bloomington311_view as select * from read_csv_auto('{}');".format(url)
    conn.sql(sql)
except:
    response = requests.get(url)
    with open(filename, 'w', encoding="utf-8") as f:
        f.write(response.text)
    f.close()
    sql = "create or replace view bloomington311_view as select * from read_csv_auto('{}');".format(filename)
    conn.sql(sql)
# Build the query you want in SQL to rename, cast or otherwise process the data.
# This dataset has a column in WKT already, if your data does not then you'll need
# to make one using the ST_ functions in DuckDB.
sql = """select service_request_id, requested_datetime, updated_datetime, closed_date, status_description, source,
service_name, description, agency_responsible, address, city, state, try_cast (zip as varchar(10)) as zip, sladays,
request_complete_days, sla_diff_days,
geocoded_column as SHAPE from bloomington311_view where SHAPE is not null;"""
df = conn.sql(sql).df()

Note the point about using or creating WKT values for your geometry. The spatial extension for DuckDB has a rich set of spatial functions.

Now convert the dataframe to a feature class. You have other conversion options. Note that the arcgis Python API has no idea where the dataframe came from - DuckDB will do fine! - so I spatially enable the dataframe and then write to a geodatabase feature class. No wrangling of attribute field properties, they just work (see the SQL above, the classic situation of casting ZIP codes to text came up).

# Write the feature class

df.spatial.set_geometry("SHAPE",sr=4326,inplace=True)
aprx = arcpy.mp.ArcGISProject("CURRENT")
gdb = aprx.defaultGeodatabase
out_fc = arcpy.ValidateTableName(base,gdb)
location = os.path.join(gdb,out_fc)
df.spatial.to_featureclass(location)

Lastly, announcement!

print("Created feature class {}".format(location))

That's it, fast, simple, smart migration of data with some help from DuckDB!

The notebook is in the blog download, let us know how you get on.