ETL Pattern: Working With Local Well-Known Files

BruceHarold · ‎08-05-2024

At the simple end of the ETL Patterns continuum is the most common format of them all - CSV, or comma separated values, and its close cousin, the Excel spreadsheet. Vast troves of data move around as these formats - easily shared, human readable - what's not to like in ArcGIS? Well, these things!

No proper schema built into the format
Column names often not valid in ArcGIS
Text fields with numeric values are treated as numeric
Text field widths are assumed to be 8000 (CSV) or 255 (Excel) bytes
Integer fields may be treated as double precision float
Null values may be encoded with zero or other unlikely value
Geometry is often encoded in an unsupported format for feature class creation

Take this popular dataset from data.gov, mirrored from Washington's open data portal:

First, the data as a map:

Washington EV Population

Now the source CSV file:

CSV or Excel Data Issues

In red I'm picking out some of these usual suspect violations. Column names have spaces or metacharacters, USPS postcodes can have leading zeros so should be text, zeroes have been used to encode nulls in Electric Range and Base MSRP columns, Vehicle Location is in WKT format and 2020 Census Tract has big integers and a name beginning with a number.

The subject dataset has velocity - it is regularly updated - so you if you're interested in using it, or any of the thousands like it - you'll want to automate handling these issues with geoprocessing.

First, some research! In the blog download is a script tool I used to scan the CSV file for maximum data widths, here is the code:

tbl = arcpy.GetParameterAsText(0)
d = arcpy.da.Describe(tbl)
flds = [f.name for f in d['fields'] if f.type == 'String']
mDict = {f:0 for f in flds}
with arcpy.da.SearchCursor(tbl,flds) as cursor:
    for row in cursor:
        for f in flds:
            if row[flds.index(f)]:
                mDict[f] = max(mDict[f],len(row[flds.index(f)]))
for k in mDict.keys():
    arcpy.AddMessage(f"""Input table '{tbl}' field '{k}' has maximum data width '{mDict[k]}'""")

This yields:

Maximum Text Field Widths

Ignoring the Postal Code column, which local knowledge tells me is 5 characters wide, I can now see the widths required per text field.

I'm now in a position to design a schema that correctly handles the subject matter file. I chose ModelBuilder to persist my processing as it can wrap core geoprocessing and Python functions I plan to use. The finished model is this, which we'll walk through.

EVPopulation Model

The first order of business is to impose a designed schema. This is done using the field map control in Export Table, basically copying the data into an in-memory table in the schema you want. The Field Map control lets you rename, cast and even construct fields very flexibly.

In this case there is only one source field for each output field, so the Action is always First, and only the field properties are manipulated.

Field Map

Downstream of Export Table the output feature class is created using the in-memory table as a template, rows (still with no geometry) are appended to it, then geometry is created from the WKT column and the two fields with zero-encoded nulls are fixed with Calculate Field tools (note the conditional processing). Lastly, the Vehicle Location field is dropped as surplus to requirements.

Are we done? No! Best practice is full automation, and this data is coming from the web.

The URL used for file download isn't valid as a geoprocessing Text File input like the model wants. There is also a risk that a user might use an input CSV file of a different name than Electric_Vehicle_Population_Data.csv, which would trigger the field map in Export Table to revert to default field handling, undoing all our schema design work.

To handle both web data input and consistent file name input the model EVPopulation is wrapped in a parent model, URL2EVPopulation:

URL2EVPopulationWeb Download

First a Calculate Value model tool downloads the current data to a consistently named file of Text File parameter type, then the EVPopulation model is called to process the data. See how a very simple Python snippet takes care of automation? 😉

Now we have automated CSV file handling! I stretched the concept of "local well-known files" a little by reaching out to the web for data, but these days files across the internet are local. I can schedule or manually run URL2EVPopulation any time I like.

The models and script tool are in the blog download.