No Schema Gap, No Shapefiles and No Problems!

BruceHarold · ‎12-19-2022

At writing, it's the time of year you take stock of how you're doing to see if there are any lessons for next year. I just took a look at how we're doing supporting data movement - ETL's reason for being - and I have work to do, so I'll make a start here.

ArcGIS Data Interoperability is in the middle of spatial data sharing workflows, and I'm seeing two big camps. In one camp there are millions of Esri feature services being cooperatively maintained and in the other there are millions of Shapefiles being shared; also in the "file" camp are CSV and Excel, adding more millions of their own. It's just a fact of life not everyone can stand up a feature service to share every data source and files are here to stay.

This post is about avoiding schema risk when sharing file based data. It is very simple.

Feature services power mapping, apps and analysis, live as managed items in your Online or Enterprise portal and importantly for foundational data, support industry standard relational data types (i.e. a schema) for the job they need to do. The three file types I mention above do not. CSV is honest about it, Excel behaves itself inside its native app but puts up a fight otherwise, and Shapefile talks to us from 1998 in a widely spoken but easily misunderstood dialect.

I better get a graphic in here before the search engines lose interest in me. For a target schema to talk about I deployed an Esri solution - Address Data Management - which you can find from the app switcher in arcgis.com if you have appropriate privileges:

Solutions: Get your industry standard schema here!

When you deploy a solution a folder is created in your content containing items needed to support the solution. I highlight a hosted feature service SiteAddresses in my deployment. It is empty (no features) when deployed but defines a schema I'll target to load file-based data.

Deployed solution items - note SiteAddresses

For some file-based source data I'll go with the city and county of San Francisco's Enterprise Addressing System Addresses with Units (EAS) open data, specifically the Shapefile download option available in the Export link at that URL.

At the URL in a browser, from the Columns in this Dataset table you can inspect the Column Name, Description and Type details, including the API name for each field if you expand the row pulldown - this is the name that ends up in the Shapefile (unless the API name is non-unique in 10 characters in which case it will get an automatically generated name as we see in "address_number_suffix" becoming "address__2" here). As is usual for Shapefile you should take the column typing with a grain of salt, for example ZIP Code (API/Shapefile field name zip_code) is typed as a number but in the target schema needs to be a text field of width 5 characters.

OK so we have some source data and a target schema, how do we connect the two?

Let's say we're in the file sharing camp because we cannot write to the SiteAddresses service but need to supply files which the service administrator can use to do so. If we just supplied the Shapefile we haven't done the job, just given the administrator the problem. He or she would have to learn our source schema and build a tool for data loading. The point here is to provide data that is ready to validate and load - and it's crazy simple. I'm going with OGC GeoPackage as my file deliverable as ArcGIS Pro can use the Append geoprocessing tool to load data to the feature service. You can also use Mobile Geodatabase. SQLite workspaces are easy to share and most importantly support a proper schema.

Create a Spatial ETL tool and add a Shapefile reader for the EAS data and an OGC GeoPackage writer, taking care to use the Import from Dataset option for the table definition.

Import from Dataset

Navigate to the target feature service to import the schema. This assumes you have access rights but if not ask the data customer to give you a GeoPackage or Mobile Geodatabase containing empty objects to pick up the schema.

Import the feature service schema

Now you have a perfect starting point for your ETL. Note for my particular example the SiteAddresses service has multiple layers and I'm selecting two, one with the base address fields and the other with postal details. The service has a relationship class between the two and this sort of detail is exactly why schema observance is important.

ETL starting point with perfect schema

Now what? Ignoring the postal detail table for now, the next step is to conduct a preliminary schema mapping. You'll also notice I included a few transformers to handle null values coming from the Shapefile (these might be empty strings or encoded as special values), generate USNG values needed for my output and perform reprojection, which is handled automatically but my habit is to use the Esri engine. The last step is to connect fields with a switchyard approach between the last transformer and the layer writer. This isn't perfect, you can only map input:output fields in a 1:M cardinality, doing things like concatenating multiple fields into one and performing calculations between them is coming up...

Preliminary schema mappingNow the secret sauce. Right click on the last connector and select "Replace Link with AttributeManager":

Replace the last connector with an AttributeManager

The AttributeManager duly appears, and you can finalize your schema adjustments. Green arrows signify connection of data per field, yellow and red are unconnected and will send or receive no data. For system managed output fields like ObjectID and the edit tracking set, you can remove them in the writer if you don't like looking at all that red ;-).

AttributeManager

Clicking on the AttributeManager gear wheel to access its properties brings up the processing grid. Processing happens top to bottom (this is important to remember if you are using calculated values in lower fields or removing fields and attempting to use them after removal) and supports creation of fields and calculation of field values, plus re-ordering fields. I forgot to map address__2 to ADDRNUMSUF in my initial mapping and didn't have a source field to supply MUNICIPALITY values so I fixed these issues in the grid.

AddressManager Processing Grid

Setting attribute values is supported by pickers for a large set of functions for string, math and datetime manipulations, making expression building easy and vastly more productive than how you might work with SQL or coding in Python.

Successful ETL from file data to GeoPackageThere you have it. Take any file-based data that is weakly typed (not being disparaging of the data curators here, it's what they have to work with) and share it in a format and schema that will support enterprise GIS workflows. Best of all - no code!

The Data Interoperability workspace is in the download.