Building a Data Driven Organization, Part #8: Automate making CSV data beautiful

BruceHarold · ‎11-17-2021

This post is in two parts, firstly handling common problems with CSV data headed into your GIS, and secondly some useful ways to share messages about data changes. In both cases we will automate processing with ArcGIS Data Interoperability in ArcGIS Pro 2.9 with a nod to ArcPy for a helping hand (but you don't have to code).

You will learn how to send a message card to a Microsoft Teams channel so your whole organization can be notified when data events happen:

Teams channel message card

There are oceans of CSV files out there, it's the 'lowest common denominator' of portable tabular 'database' (possibly excepting column aligned text) which many apps can ingest, CSV is easy to host, send and retrieve, and if you're desperate you can actually read it manually. This has made CSV a highly successful distribution format, including for data that changes frequently. My sample data is daily 311 requests. Here I am reviewing the data:

(view in My Videos)

Every time I tackle CSV the same issue arises - what's the schema? Rarely do curators of the data publish the data definition of each column or provide a schema.ini file, so the schema needs to be discovered if you want a quality outcome. Many subsidiary issues arise frequently. Is what looks like an integer column always an integer in all rows? Can I reliably convert date-looking columns into actual datetime data? CSV has no null value - an empty string "" might mean null, or might need to be an actual empty string. What field width do I need for each string field? What numeric types will work for numbers in the data? It goes on. I'll give you a pattern here to make schema discovery and data definition easy.

You see I started with a clerical examination in Excel. It is OK for browsing but I would not convert CSV via XLSX and the geoprocessing tool Excel to Table 'hands free' because no data issues get found or fixed. Similarly, Table to Table is robust but does things like create text fields with 8000 byte width. Apart from loose field definitions there are going to be data errors you should repair, and nothing beats a combination of automated and interactive inspection for that.

Then what did browsing the data in Excel tell us? Luckily the web page where the CSV is downloaded from has a data dictionary that describes most field purposes, but not the field definition, however with a little clerical effort we can make these observations:

CSV Field Notes:

service_request_id	Unique identifier, looks to be integer
service_request_parent_id	Self join for service_request_id if it's a duplicate request, I see "nan" where nulls should be
sap_notification_number	If the 311 request is in the city's SAP system this is the identifier, looks like a 64bit integer
date_requested	Looks like ISO format, great!
case_age_days	How long it has taken to 'get it done' in whole days, so should be small integer
case_record_type	Team that is responsible at the city, text
service_name	Classification of the 311 service type, text
date_closed	Looks like ISO format again
status	Current request state - might update during the request lifespan, text
lat	Latitude (but I see some are missing), double
lng	Longitude (missing if Latitude is missing), double
street_address	Civic address of the 311 request, text
zipcode	ZIP code, may be zip+4, should be text but I see some decimal values!
council_district	District, encoded as an integer
comm_plan_code	Community, encoded as an integer
comm_plan_name	Community, as text
park_name	Park or open space name if applicable, text
case_origin	Method used to submit the request, text
referred	If the status field is 'referred' this describes to whom, text
public_description	Crowd sourced text, my favorite data! It can be any width
imfloc	Not described, drop it
floc	Not described, drop it

So far so good, we have a rough idea of how the schema should look. How do we refine it?

In the post download toolbox is a Python script tool ReportCSVProperties which does a full scan of the data and creates an in-memory metadata table showing each field definition that will work with the data, i.e. numbers will be preserved and text will not be truncated. Running it with the sample data the message stream looks like this (actually each 1% progress is reported):

Inspecting 323441 rows...
Making in-memory table...
Inspecting data...
1% complete
...
50% complete
...
100% complete

The output table gives me a good start with the schema:

ReportCSVProperties Output

If I trusted these definitions I could use this schema information in the field map control in Table to Table to directly convert the data to geodatabase, but I happen to know at a minimum there are "nan" string values polluting what should be an integer field (causing it to be treated as string) plus datetime and XY coordinate fields in there which would require subsequent geoprocessing to expose properly; however for the record you could adopt the schema and apply field widths like this: (I usually round widths up as a little future-proofing, so for example the street_address field width 104 I bump up to 120):

Table to Table Field Map Example

My goal here showing you automated ETL, not step-wise processes, so we will jump into ArcGIS Data Interoperability functionality. Beginning with the the CSV source, it is at a hyperlink. I right-clicked on the dataset icon to get the download URL. So lets go web-to-spatial-features in one tool, correcting and casting as we go:

Import 311 CSV to Geodatabase Features

Lets look at the CSV reader. I manually set the schema using the number types and string widths found by my helper tool except where I knew the "nan" values threw it off, plus I detected ISO format datetimes and created geometry from the lat/lng fields. When you specify a field to be integer, any string values are replaced with nulls, so the "nan" values (this means "not a number", its a computing convention, usually cased as "NaN") are fixed. So just at the reader step I have the exact schema I want. But wait there's more! One of the great features of the Workbench app is being able to cache data at each canvas element and inspect it. Look what the reader turned up! Turn caching on and run just the reader, then click the green magnifying glass to inspect the cache.

Reader with caching enabled

Inspecting the cache

The data should all be in San Diego, but I see dots all over the US plus one in Australia. Plus selecting features with no geometry finds 2450 features. These issues need fixing! To fix null geometries and out-of-area geometries I geocoded features that were not in the city area. If I could not geocode them I emailed a note suggesting the data be reviewed.

Email from the ETL process

That gets my CSV data cleaned up and into a geodatabase, from which I made a feature service. Note there are four layers in the service for each request status type - Closed, In Process, New and Referred.

Feature Service from CSV

So that's part one, you can see how easy it is to get CSV data into beautiful shape in your GIS.

Now for the fun part, beautiful messaging when the data changes!

The data changes daily, to refresh the feature service I made the ETL tool RefreshGetItDone311. Here is a screen grab:

Refresh 311 Feature Service

This tool is a bit busier than its parent that writes to geodatabase because it reads and writes the four layers in the feature service after doing change detection. Step through it yourself, note the two tunnels and how I merge existing geometry and changed attribution to write only the daily delta to the service. This workspace can be made into a scheduled task to maintain the data continually. I'll leave the service up for a while, you can add it to a map yourself.

The real thing I want to highlight though is the very last transformer in the workspace, a WorkspaceRunner. This will perform some postprocessing - refresh a pie-chart item and send a Microsoft Teams adaptive card with a map link to a channel - smart and attractive messaging.

Pie Chart made from new 311 features

Here is the postprocessing workspace, ReportGetItDone311.

Update barchart and message Teams

New 311 features with case_age_days equal to zero are counted, a pie chart image made (segmented by request type), then the pie chart image is used to update an image item in Online and lastly a card message is sent to aTeams channel webhook.

Firstly the pie chart. While authoring the ReportGetItDone311 tool I created an image item. If you make image items public you can use their data URL as a file path in a card. I also created a web map with the New feature - filtered by case_age_days = 0. I added a bar chart to the web map. You can read here that you can create and modify web maps just by sending URLs to ArcGIS Online.

With my pie chart and web map items shared publicly the last step was to tell the world whenever I run RefreshGetItDone311. I thought this was going to be scary hard but it turned out to be crazy simple. First I made an incoming webhook item in my Teams channel of choice:

Incoming webhook

This gives me the URL to send a card to with an HTTPCaller. Now I need the payload JSON. If you open App Studio in Teams you get a handy card editor:

Teams Card Editor

I copied out the JSON from the card editor, made an edit manually which a little research indicated was necessary and then in my HTTPCaller plugged in some runtime variable substitution to pick up the date and feature count for the day's processing. Just to prove it wasn't a fluke I'll run it again in the morning before posting this blog.

...zzzzzzz...

Here is the result:

Card in Teams

Web Map Linked From Card

Pie Chart Item In Online

So there you have it, plain old CSV turned into beautifully typed feature service data and talking to us with a beautiful message. The tools are in the post download (I will deactivate the Teams webhook URL).