Building a Data Driven Organization, Part #13: Handle Structured Data Simply And With Precision

987
2
02-24-2022 05:19 AM
BruceHarold
Esri Regular Contributor
1 2 987

Very often making information products means implementing a feature classification system using business rules of a user community.  Examples abound: administrative divisions of land, classes of asset, agricultural production, forms of land cover, quality metrics, ownership hierarchies, and the example I'm running with in this post - NAICS industrial classifications.

There are more than a thousand industry classes when you go to the most granular level.  You will appreciate that querying them can be complex and error prone.  This post is about you as a data curator making complex queries simple for your colleagues by using ArcGIS Data Interoperability.

With NAICS data, queries can look like this, needing just a few clicks:

Selecting Data ClassesSelecting Data Classes

And not like this, scrolling though 1000+ rows and guessing search terms:

Industry Classes by SQL QueryIndustry Classes by SQL Query

Best of all - you don't have to write a line of code.

As a bonus extra I'm going to include some tips on precision geocoding of this industry data.  Let's get going!

My subject industry data is from an Open Data site for the City of Seattle, specifically their Active Business License Tax Certificate dataset.  Seattle use Socrata technology, which is directly readable by Data Interoperability once you download and install the relevant Hub package.  Socrata is a popular choice for open data sites, typically being used for purely tabular data, as is the case here.  My hypothetical use case is I run a business specializing in pest control and I want to map potential customers in the food business for target marketing.  I'm going to build a Spatial ETL tool that lets me use smart pickers for this complex data and appropriate geocoding options to make the data spatial, then I can share the tool with anyone interested in the same business data but perhaps for different industry sectors.

The North American Industrial Classification System is used in the United States, Canada and Mexico and is broadly compatible with the UN's International Standard Industrial Classification System, there are analogs worldwide.  The business license dataset is not spatial but is very well maintained and contains good address data.  I'm going to geocode it with tight quality control; I'll do this by confining geocode matches to high precision point of interest (POI, meaning matches include the business name) categories available in the ArcGIS Online World Service - the ones that match my industry classes of interest.  Successful POI geocodes validate business names and can return things like website and phone number details.  I will also allow the high-precision cases of normal address geocoding, if a POI match cannot be found we use the street address.  No matching to any sort of zone or area will be allowed.  I will geocode to match the industry classifications of interest, with a fallback to precise types of street address.

Here is the tree-pick experience for geocode category, the categories themselves are built into the geocode service but the picker is enabled by Data Interoperability, just like the industry class one:

Geocode CategoriesGeocode Categories

There are several categories I picked that are out of sight in the screen capture above, but navigating to them was very easy given the tree structure and hierarchy of available choices.

So I can easily choose my Socrata-hosted industries of interest to extract that data, and I can pick how I want the data geocoded by category.  Is there anything else I need to control?  Yes, when inspecting the data I could see a few food business licensees located across the border in Canada, so to support specifying two allowed countries I built a picker for the country code parameter:

Country CodeCountry Code

Country code is a hard filter, no matches can stray into another country no matter how much an address is like one in another country.

For completeness I also built a picker for the language code parameter which you can read about at the same link as the one above for country code:

Language CodeLanguage Code

For my data, specifying language code wasn't useful, but if you need to work multilingually you can.

Now let's get to the real value proposition for this post, showing you how to build smart ETL tool parameters.  The 'trick' is using advanced options when building Choice parameters.

In my tool that extracts and geocodes business data I have a number of user parameters.  To make each tree picker I used the Import option (see to the right in the Choice Configuration area) to read a CSV file supplying the tree definition.  The CSV file has Display and Value columns that the importer is looking for:

Choice Tree Parameter ConfigurationChoice Tree Parameter Configuration

The CSV files are in the post download, the one for industry classes looks like this:

Industry Parameter Import CSVIndustry Parameter Import CSV

The only remarkable detail for this file is I used a pipe delimiter and not a comma as the display values contain commas.

How did I make the CSV files?  They are all built with one Data Interoperability workspace MakeParameterCSVs.fmw, in the post download.  Start Workbench from the Analysis ribbon in Pro, open the file and we can walk through it.

MakeParameterCSVsMakeParameterCSVs

To make Categories.csv and Countries.csv for geocoding parameter purposes the geocode service definition is queried and the JSON response picked apart; this is a simple example of handling JSON from the web.  LangCodes.csv is made by querying the web help.  In the post download is a Python script GeocodeServerProperties.py that will give you an idea of all available properties in the service.  As a side note, if you are geocoding against a StreetMap geocode service on-premise the same properties are available.

To make NAICSCodes.csv the appropriate NAICS spreadsheet download URL is read by an Excel reader and the desired hierarchy built.  The Socrata dataset is also read to make sure only industry codes that exist in Seattle will be available in the destination tool parameter.

With the desired CSV files in place my ETL tool - BusinessExtractor - can be configured.  Let's walk through that:

BusinessExtractor WorkspaceBusinessExtractor Workspace

Basically Socrata is read and geocoded with the findAddressCandidates REST endpoint.  If the data was bigger in row count I would have used geocodeAddresses.  I use 4 concurrent threads in the HTTPCaller and performance was excellent.  Note that you will need to edit the ArcGISOnlineTokenGetter to use your credentials, and you will need geocoding privilege in Online.  Seasoned users will know there is a Geocoder transformer  I could have used, but it doesn't support category filtering or language code.

Spoiler
The Geocoder transformer is scheduled for a rewrite so stay tuned for a post on that!

Geocoding parameters are taken from the input pickers.  The most challenging part was parsing the category parameter to make it correctly formed for the REST API.

Here is how categories arrive from the user interface (order does not matter)...

"African Food" "American Food" "Argentinean Food" "Australian Food" "Austrian Food" Intersection "Point Address" "Street Address" Subaddress "BBQ and Southern Food" Bakery "Balkan Food" "Belgian Food" Bistro "Brazilian Food" Breakfast Brewpub "British Isles Food" Burgers "Cajun and Creole Food" "Californian Food" "Caribbean Food" "Chicken Restaurant" "Chilean Food" "Chinese Food" "Coffee Shop" "Continental Food" Creperie "East European Food" "Fast Food" "Filipino Food" Fondue "French Food" "Fusion Food" "German Food" "Greek Food" Grill "Hawaiian Food" "Ice Cream Shop" "Indian Food" "Indonesian Food" "International Food" "Irish Food" "Italian Food" "Japanese Food" "Korean Food" "Kosher Food" "Latin American Food" "Malaysian Food" "Mexican Food" "Middle Eastern Food" "Moroccan Food" "Other Restaurant" Pastries Pizza "Polish Food" "Portuguese Food" Restaurant "Russian Food" "Sandwich Shop" "Scandinavian Food" Seafood Snacks "South American Food" "Southeast Asian Food" "Southwestern Food" "Spanish Food" "Steak House" Sushi "Swiss Food" Tapas "Thai Food" "Turkish Food" "Vegetarian Food" "Vietnamese Food" Winery Butcher "Candy Store" Grocery Market "Wine and Liquor" "Bar or Pub"

...and here is how they go to the category parameter in the API (again, order does not matter):

Intersection,Subaddress,Bakery,Bistro,Breakfast,Brewpub,Burgers,Creperie,Fondue,Grill,Pastries,Pizza,Restaurant,Seafood,Snacks,Sushi,Tapas,Winery,Butcher,Grocery,Market,African Food,American Food,Argentinean Food,Australian Food,Austrian Food,Point Address,Street Address,BBQ and Southern Food,Balkan Food,Belgian Food,Brazilian Food,British Isles Food,Cajun and Creole Food,Californian Food,Caribbean Food,Chicken Restaurant,Chilean Food,Chinese Food,Coffee Shop,Continental Food,East European Food,Fast Food,Filipino Food,French Food,Fusion Food,German Food,Greek Food,Hawaiian Food,Ice Cream Shop,Indian Food,Indonesian Food,International Food,Irish Food,Italian Food,Japanese Food,Korean Food,Kosher Food,Latin American Food,Malaysian Food,Mexican Food,Middle Eastern Food,Moroccan Food,Other Restaurant,Polish Food,Portuguese Food,Russian Food,Sandwich Shop,Scandinavian Food,South American Food,Southeast Asian Food,Southwestern Food,Spanish Food,Steak House,Swiss Food,Thai Food,Turkish Food,Vegetarian Food,Vietnamese Food,Candy Store,Wine and Liquor,Bar or Pub

Here is the end result, a high proportion of point of interest geocodes with the vast majority of the remainder geocoded to rooftop locations (PointAddress and SubAddress).  Exactly what I was looking for to use in my pest control business!

Seattle Food BusinessesSeattle Food Businesses

The workspaces, CSV files and toolbox are in the post download.  Now you can build smart input parameters for your ETL tools!

2 Comments
KimOllivier
Occasional Contributor III

Wow!

BruceHarold
Esri Regular Contributor

@KimO you should know by now 😉