Fast Batch Geocoding In Any Environment With ArcGIS Data Interoperability

BruceHarold · ‎04-23-2021

If your system of record is one of the many options available to you which are not supported geoprocessing workspaces in ArcGIS Pro you might be tempted to adopt less than optimal workflows like manually bouncing data through CSV or file geodatabase just to get the job done. This blog is about avoiding that, simply reading your data from where it lives, geocoding it and writing the output where you want it, all from the comfort of your Pro session, or even as an Enterprise web tool.

Here are 1 million addresses I geocoded from Snowflake and wrote back to Snowflake without the data touching the ground:

An aside - I got the Snowflake data into a memory feature class using this tool.

We're seeing people who need flexibility in one or both storage technologies where their address data is managed and where the geocoded spatial data lives. Data has gravity and there is no need to fight that. Lets see how to achieve this.

Full disclosure, in the blog download is the fmw source for a Data Interoperability ETL tool for ArcGIS Pro 2.8, which at writing isn't released. I'm using that as it has the FME 2021 engine which supports parallelized http requests, which is relevant for performance. If you don't have Pro 2.8 then ask our good friends at Safe Software for an evaluation of FME 2021 to surf the workspace.

Here is how it looks when edited, and I always like this bit, you can't see any code because there isn't any! We're 21% through the 21st century, who writes code any more just to get work done? 😉

Partly I'm talking about moving data around, but you already knew you could do that, its the batch geocoding that is the value here, lets dig into that.

When you're batch geocoding data that is in flight between systems of record the Geocode Addresses geoprocessing tool isn't suitable, it requires table input and writes feature classes. I'm using the geocodeAddresses REST endpoint. To make the endpoint available I published a StreetMap Premium locator to my portal. I could have used ArcGIS Online's World Geocoding Service or a service published from a locator I built too, the API is identical. Which way you go will depend on a few decision points:

StreetMap locators on-premise don't send your data out to the internet
StreetMap locators can be scaled how you want on your own hardware
StreetMap locators have a fixed cost
Online requires no setup but has a variable cost based on throughput
Online sends your address data out to the internet (albeit securely)

I want to make the point you can scale your batch geocoding how you want, so I went with StreetMap.

Now how to drive: Under the covers the geocoding engine considers an address in its entirety, how well it matches all known addresses, picking a winner (or failing to). You will notice you can supply address content in parts (base address, zone fields) or as one value - SingleLine. If you supply address parts the engine in fact concatenates them in some well-known orders based on address grammar for the area covered before submitting them to the engine. Assuming you know the structure of your address data as well as Esri does, you may as well do this yourself and supply the whole address as SingleLine values, so you'll see this is what I do in my sample. The only other data dependency is you'll need an integer key field in your data you can name ObjectID, this comes out the other end as the ResultID field which you can use to join back to your source data.

Lets go through the parameters of the ETL tool.

CountryCode is a hard filter for country. If your locator supports multiple countries like StreetMap North America does and your data is from known countries, customize and use this parameter. If you only have a single country locator don't supply a value.

MatchOutOfRange is a switch for finding addresses a small way beyond known house number ranges for street centerline matches.

The BatchSize should not be larger than the maximum allowed for your locator. You can see this number in the locator service properties. The blog download also has a handy Python script for reporting service properties, edit it for your URL.

Categories allow you to filter the acceptable match types, you could customize this parameter to support only certain types of Address matches for example.

LocationType lets you select rooftop or roadside coordinates for exact house point matches.

GeocodeConcurrency isn't a locator property, its a property of the batch handling of the ETL tool. My sample uses 4 - meaning at any time the service is handling 4 batch requests. Make sure to configure as many geocode service instances as you need and make this property agree. In my case I don't have serious metal to run on, I have a small virtual machine somewhere in the sky, but if you have a need for a 64-instance setup then go for it. At some point though you'll be limited by how fast you can read data, send it out and catch the results. My guess is few people will need more than 8 instances in a large geography.

If you're a Data Interoperability or FME user you'll know there is already a Geocoder transformer in the product and it can use Enterprise or Online locators. However it works on one feature at a time and I want multiple concurrent batch processing for performance. In my particular case I'm using Snowflake and its possible to configure an external function to geocode against an Esri geocode service, this is also one feature at a time (but stay tuned for news from Esri on this function in Snowflake).

That's pretty much it, surf the ETL tool for seeing how I made batch JSON and how to interpret the result JSON, and implement your own workflows. Enjoy!

But wait there's more! I edited the tool 27th April 2021 to squeeze more performance (30%!) out of the already parallelized geocode step by managing the http calls in two queues. While one set of 4 batches is processing another set is uploading. This is the bookmark to look for.

JoeBorgione · ‎04-23-2021

We're 21% through the 21st century, who writes code any more just to get work done?

I sure do... When can we expect 2.8 to be released so we can give this a go?

BruceHarold · ‎04-23-2021

Hi Joe, Data Interoperability 2.8 beta will be available early next week.

BruceHarold · ‎04-27-2021

Joe & everyone, see I updated the tool to double-dip on parallelized geocoding by managing two queues.

Natalie_Campos · ‎03-28-2023

Hi Bruce

Nearly 2 years later, would this still be the best possible route for "Batch" geocoding within Data Interoperability?

BruceHarold · ‎04-03-2023

Hi Natalie, it is probably simpler and easier to use the Geocoder transformer now, it has had some love after I created this sample.

Natalie_Campos · ‎04-17-2023

Thanks @BruceHarold . We did try with the out of box Geocoder Transformer and it takes significantly longer resulting in a time out from the SQL database. See below.

2023-04-13 00:38:46| | |ERROR |2023-04-13 00:38:46|25145.9| 0.0|ERROR |Microsoft SQL Server Non-Spatial Reader: Query failed, possibly due to a malformed statement. Query Text `[no query]'. Provider error `(-2147467259) TCP Provider: An existing connection was forcibly closed by the remote host.'

Additionally, we posted a Batch Geocoding question in the FME Community and Safe said the HTTPCaller would be our best option. So we plan to move forward using this HTTPCaller method. We thank you for the workflow.

One issue we continue to have is special characters. When the same dataset is geocoded from ArcGIS Pro the table is geocoded with no errors or issues. When the same table is used in the Data Interop workbench, we are required to use a large number of String Replacer transformers to prevent the workbench from failing.

Do you have any advice on the difference between how the Pro GP Tool or Wizard reads the sql table vs Data Interop?

@LindsayKing1

@AviCueva

Natalie_Campos · ‎04-17-2023

@AviCueva

BruceHarold · ‎04-18-2023

Hi

Are you replacing accented characters in a specific language or HTML elements? Please describe the data. Unicode should work.

AviCueva · ‎04-18-2023

Thanks Bruce,

Data comes in from a SQL View. Data has been entered by numerous individuals using different devices/writing styles.

We take the SingleAddress field from the SQL View, pass the data through String Replacers transformer, then pass that to the batch geocoder.

The following are examples of the type of string replacement needed to avoid issues with the batch geocoder:

Text to Replace: \"C\"

Replacement Text: C

\"C\"

(

)

\"B\"

#

@

½

“D”

\"D\"

C/O

\"F\"

In addition, if a zip code has a trailing dash at the end, that will also cause the batch geocoder to fail.

We're trying to get the batch geocoder up and running as the out of the box Geocoder's performance is dramatically slower when dealing with large number of records. (800k+)

With the string replacements in the model, we've been able to successfully get the batch geocoder working. (Thank you!) However, it's fairly fragile as we're one bad input away from failure.

Any advice/assistance you could provide would be much appreciated!

BruceHarold · ‎04-18-2023

I haven't run into these character sensitivities myself so can't offer much advice.

If you could bundle up a set of addresses and share them we might be able to track what is going on by plugging them into a browser hitting findAddressCandidates, if you like open a support call if this reproduces the issue.

AviCueva · ‎04-18-2023

Bruce, I've tried running the same set of data through the findAddressCandidates REST endpoint. It does not error out through the browser.

I'll take your advice and bundle up some of the trouble address points and enter a support call. Thank you!

Natalie_Campos · ‎05-04-2023

@BruceHarold

Is there a "best practice" for maintaining attributes with the data when using this approach. We added the attributes to the json templater to ensure the attributes stay with the record and then use the json defragmenter to extract both the geocoding json and the attribute json. This essentially created two json fields that to be extracted. This have worked but also seems to have increased our batch geocoding time of 800k records from under 2 hrs to over 8.

@AviCueva

BruceHarold · ‎05-05-2023

Hi, assuming your data has a key field you could preserve just that field and join your attributes back onto the geocode results and not have to haul the data around in JSON documents.