Select to view content in your preferred language

More From The Duck Pond - Using DuckDB In ArcGIS Pro

2242
2
01-23-2024 01:34 PM
BruceHarold
Esri Frequent Contributor
9 2 2,242

In an earlier post I got my feet wet using DuckDB in a hosted notebook in ArcGIS Online.  That post highlighted reading GeoParquet files in an object store to maintain a feature service.  That's a powerful workflow but it is much more likely your daily work involves wrangling common file formats which take up way too much time to ingest - like CSV, Excel and JSON.  This post is about making your life easier by showing how DuckDB enables SQL for any file format and also supports simple conversion to Esri formats.  If you have basic SQL and Python skills - as in you can read this sample and the web help - you will be fine.

Here is some live action, Bloomington Indiana open 311 incidents migrated to a feature class in seconds.

Bloomington 311 IncidentsBloomington 311 Incidents

 

The notebook (in the blog download) reads a CSV file at a public URL and writes a feature class into the project default geodatabase.  Notebooks are a good place to start compared to a script tool because you can work in a cell interactively as you figure out your SQL.  Once things are working you might copy the code to a script tool for easy automation on a schedule.

But first you need to install DuckDB!  The usual route to adding a Python package to your environment is to clone the arcgispro-py3 environment then add a package in the Package Manager UI backstage in Pro.  You would look for a package named python-duckdb.  This errored for me (it happens) so I did a manual install.  I cloned my default environment to one I named arcgispro-py3-quack and installed DuckDB by opening the Python Command Prompt from the ArcGIS program group then running this command ('username' will be yours):

 

conda install -c conda-forge python-duckdb=1.0.0 -p "C:\Users\username\AppData\Local\ESRI\conda\envs\arcgispro-py3-quack" --yes

 

Make sure to get the latest DuckDB distribution, at writing it is 1.0.0.

Let's walk through each notebook cell.  The first one is pretty simple, the imports and some calls to duckdb to set up the environment.

 

# Imports, environment

import arcpy
import duckdb
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import os
import requests
from urllib.parse import urlparse
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
conn.sql("set enable_object_cache=true;")
arcpy.env.overwriteOutput = True

 

The second cell makes a pandas dataframe by reading the CSV data while automatically inferring data types - such a big help.  You'll notice I code defensively for the situation the server I'm hitting doesn't support HTTP range request access like DuckDB wants - I download the data - which in this case was necessary.

 

# Read data into duckdb

url = r'https://data.bloomington.in.gov/resource/aw6y-t4ix.csv?$limit=200000'
filename = os.path.basename(urlparse(url).path)
base, extension = os.path.splitext(filename)
try:
    sql = "create or replace view bloomington311_view as select * from read_csv_auto('{}');".format(url)
    conn.sql(sql)
except:
    response = requests.get(url)
    with open(filename, 'w', encoding="utf-8") as f:
        f.write(response.text)
    f.close()
    sql = "create or replace view bloomington311_view as select * from read_csv_auto('{}');".format(filename)
    conn.sql(sql)
# Build the query you want in SQL to rename, cast or otherwise process the data.
# This dataset has a column in WKT already, if your data does not then you'll need
# to make one using the ST_ functions in DuckDB.
sql = """select service_request_id, requested_datetime, updated_datetime, closed_date, status_description, source,
service_name, description, agency_responsible, address, city, state, try_cast (zip as varchar(10)) as zip, sladays,
request_complete_days, sla_diff_days,
geocoded_column as SHAPE from bloomington311_view where SHAPE is not null;"""
df = conn.sql(sql).df()

 

Note the point about using or creating WKT values for your geometry.  The spatial extension for DuckDB has a rich set of spatial functions.

Now convert the dataframe to a feature classYou have other conversion options. Note that the arcgis Python API has no idea where the dataframe came from - DuckDB will do fine! - so I spatially enable the dataframe and then write to a geodatabase feature class.  No wrangling of attribute field properties, they just work (see the SQL above, the classic situation of casting ZIP codes to text came up).

 

# Write the feature class

df.spatial.set_geometry("SHAPE",sr=4326,inplace=True)
aprx = arcpy.mp.ArcGISProject("CURRENT")
gdb = aprx.defaultGeodatabase
out_fc = arcpy.ValidateTableName(base,gdb)
location = os.path.join(gdb,out_fc)
df.spatial.to_featureclass(location)

 

Lastly, announcement!

 

print("Created feature class {}".format(location))

 

That's it, fast, simple, smart migration of data with some help from DuckDB!

The notebook is in the blog download, let us know how you get on.

2 Comments
MatejVrtich
Esri Contributor

Hi @BruceHarold,

Thank you for all the great articles you’ve written about ETL, they’ve been really inspiring.

This particular one encouraged me to dive into the duck, especially with the recent addition of DuckDB support in the default Python environments in both ArcGIS Pro and ArcGIS Server.

I wanted to share some of my own experiments and created this little “Quack” Python script tool just for fun (see attached). It's using DuckDB to Arrow to ArcGIS pipeline as I wanted to investigate the Arrow support in ArcGIS as well.

The idea behind the tool is to be shared as a Web Tool in ArcGIS Enterprise and to offer an open way to load data into an app as an in-memory feature layer or table, with an option to store the result as a hosted feature layer as well (handled by the Web Tool). Both Map Viewer and Experience Builder can work directly with in-memory data, which provides a great UX for using the Web Tool itself.

So, have fun — and please keep publishing 🙂

Regards,
Matej

I paste it here as I do not see the attach file button, sorry 🙂

import duckdb
import arcpy
import pyarrow as pa

def add_esri_metadata(arrow_table: pa.Table, wkid: int = 4326) -> pa.Table:
    """Attach ArcGIS-compatible geometry metadata."""
    sr = arcpy.SpatialReference(wkid)
    wkt = sr.exportToString()

    new_fields = []
    for field in arrow_table.schema:
        metadata = {
            k if isinstance(k, bytes) else k.encode("utf-8"): (
                v if isinstance(v, bytes) else v.encode("utf-8")
            )
            for k, v in (field.metadata or {}).items()
        }

        if field.name == "geometry":
            metadata[b"esri.encoding"] = (
                b"WKB"  # TODO: add support for more encodings
            )
            metadata[b"esri.sr_wkt"] = wkt.encode("utf-8")

        if field.name == "OBJECTID":
            metadata[b"esri.oid"] = b"esri.int64"

        new_fields.append(pa.field(field.name, field.type, metadata=metadata))

    return pa.Table.from_arrays(
        arrow_table.columns,
        schema=pa.schema(new_fields, metadata=arrow_table.schema.metadata),
    )


def quack(sql, result="memory/quack"):

    con = duckdb.connect(":memory:")

    for ext in ("httpfs", "spatial"):
        con.execute(f"INSTALL {ext};")
        con.execute(f"LOAD {ext};")

    # Execute query and get Arrow table
    arr = con.execute(
        f"""
    {sql}
    """
    ).arrow()

    # Check if any geometry column exists
    has_geometry = False
    geometry_types = ["geometry", "geom", "shape", "wkb_geometry", "the_geom"]

    # Check column names (case-insensitive)
    column_names = [col.lower() for col in arr.column_names]

    # Check for geometry column by name
    for geom_type in geometry_types:
        if geom_type in column_names:
            has_geometry = True
            break

    # Also check schema for spatial types (DuckDB spatial extension)
    if not has_geometry:
        for field in arr.schema:
            field_type_str = str(field.type).lower()
            if any(
                spatial_type in field_type_str
                for spatial_type in [
                    "geometry",
                    "point",
                    "linestring",
                    "polygon",
                    "multipoint",
                    "multilinestring",
                    "multipolygon",
                    "geometrycollection",
                ]
            ):
                has_geometry = True
                break

    if arcpy.Exists(result):
        arcpy.AddMessage(f"Output {result} already exists - using Delete")
        arcpy.management.Delete(result)

    # Use appropriate tool based on geometry presence
    if has_geometry:
        arcpy.AddMessage("Geometry detected - using CopyFeatures")
        arcpy.management.CopyFeatures(add_esri_metadata(arr), result)
    else:
        arcpy.AddMessage("No geometry detected - using CopyRows")
        arcpy.management.CopyRows(arr, result)

    con.close()
    return result


if __name__ == "__main__":

    sql = arcpy.GetParameterAsText(0)
    result = arcpy.GetParameterAsText(1)

    result = quack(sql, result)

    arcpy.SetParameterAsText(1, result)

 

Edit: I should warn you against exposing the SQL parameter this way (without restrictions) in a Web Tool/GP Service if sharing to ArcGIS Enterprise, as it gives a lot of power if directly passed into the execute function. You can even list files from a file system from inside the ArcSOC.exe running the tool with SELECT * FROM glob('*'); or even return the contents of files themselves with SELECT string_agg(content, '\n') AS file_content FROM read_text(filename);

BruceHarold
Esri Frequent Contributor

Thanks Matej!  This is really useful!

And I will keep publishing - in fact just finishing rewriting two posts now...

Contributors