Address dataset problem

ClaireKolaja1 · ‎03-22-2017

Hi all. I have a question that I am hoping you all could help me with.

I am using ArcGIS Pro (but I could also use ArcMap 10.5). I use StreetMap Premium for geocoding. I have a large dataset with addresses where one person may have multiple addresses. Sometime these are actually different address and sometime they are the same address repeated. I want to geocode the address then (if they geocode to the same address/XY coordinate) collapse the address so that each address will only be one row (with the begin and end date from the first and last row respectively). If the address fields are not exactly the same it will not match it to the same geocoordinate and usually skips to the zip code. You can see in the example below the rows 1 and 2 below will match to the same coordinate but row 3 will probably geocode to the zip code since the street name is not recognized (please take into account I made up the address). Usually this is because the street name is misspelled but all of the other fields are the same. Is there some tool to clean these address when this happens? I am familiar with SAS programming if anyone know of a way to clean the addresses in that program. I was wondering if anyone on here has ever done this or if there is an easier way to do this in ArcGIS rather than geocoding then exporting and cleaning in SAS and the bringing back into ArcGIS Pro.

Row ID Address City State Zipcode DateBegin DateEnd

1 1 123 Northbay Lane Sometown CA 12345 1/1/2015 2/28/2015

2 1 123 Northbay Ln Sometown CA 12345 3/1/2015 11/30/2016

3 1 123 Nrthby Lane Sometown CA 12345 12/1/2016 3/10/2017

Change to:

Row ID Address City State Zipcode DateBegin DateEnd

1 1 123 Northbay Lane Sometown CA 12345 1/1/2015 3/10/2017

In this example I just show one example for one person but this could happen several times for one person then the data sets are several hundred thousand observations so I can't manually make the changes.

Any advice would be greatly appreciated!

- Claire

JoeBorgione · ‎03-23-2017

It doesn't matter if you use ArcGIS Pro, ArcGIS 10.5, 10.4, 10,3, 9.3 or even ArcView 3.x; standardizing addresses is the bane of geocoding. My personal approach would be to clean the data up first, then geocode rather than geocode and look for duplicate points.

Take a look at A better way to parse an address? I got some help from Geonet user Darren Wiens to put together a Python script that parses out address components. Once you have all the components of an address parsed, you can perform a series of summary tables to examine redunancy: Look for non-standard street types: LANE v LN. Look for redundant house number + street name combinations, etc. It's not sexy work, and your eyes might get crossed after a while, but after messing with geocoding mass quantities of ugly address data for the past umpteen years (In 9-1-1, noboday knows their address), that's been my approach. And don't worry, your eyes eventually un-cross...

Best of luck!

That should just about do it....

ClaireKolaja1 · ‎03-23-2017

Thank you for your help. I am definitely going to start with parsing the data and see where it takes me. Hopefully my eyes won't take too long to uncross!

JoeBorgione · ‎03-23-2017

Are the names important to you? You may want to start there. Create a summary table based on the persons name. That will tell you how many times that name appears, which is in theory how many addresses each person has listed. I'd take a look at those names who have the highest numbers of addresses. Create a relate between your summary table and the original address table with the name field as the key. Select the name with the highest count value and then use the related table tool to expose all the records in the original table. You might find a common name (think John Q. Public) that appears 50 times. But there may be 10 different John Q. Publics and they are each associated with 5 addresses. You also may find that John Public lives at the same address as John Q. Public . Ohh the possibilities.

A couple years ago I was given just under a million records of law enforcement calls to geocode. I'll call the department AnyTown PD. Some of the addresses were good. Some had addresses like Arizona, or Illinois or Mexico or even blank. Not much to work with, right? Did you know there are 48 ways to spell AnyTown? I could go on but you get the picture.

With summary tables and relates you can expose tons of information you can use to help filter it down. Like I said, it's not real pretty work; no whales are going to be saved and you won't slow down global warming, but you will get a handle on your addresses.

That should just about do it....

ClaireKolaja1 · ‎03-27-2017

We do not have names in this dataset, so thankfully one less thing to clean (we use studyids). I will start on the tables for those with the most address entries (I think the highest number was around 20-30). For the parsing the code that you linked to, all of the address are similar in that the do not have apartment numbers at the end. I have noticed for our address the apartment number was sometimes provided (or unit, room or lot number). I think to parse the addresses correctly I am going to have to identify those with this sometimes extra information and then move than information to the address2 field, right?

That law enforcement dataset sounds daunting! I am glad you get to share what you learned from it with us on here

JoshuaBixby · ‎03-23-2017

For your geocoding, which tools and in what order are you using the various tools in your workflow? For example, are you using Standardize Addresses in your workflow? If so, how is it falling short for you?

I see the hierarchy of cleaning up addresses as: address parsing, address standardizing, and address validation. Address matching can be done at any level of the hierarchy, but the results are more accurate the higher in the hierarchy it is done.

Address validation is the holy grail, right? After all, what good is a standardized address if the address doesn't exist or isn't valid. I guess one could argue there is value in matching made-up addresses, but most people working with addresses are interested in valid addresses. There are numerous companies that offer address validation, although it usually costs money and is done through online APIs, which can be unappealing for different reasons.

Assuming you want to stay in the Esri ecosystem and simply clean the data up a bit before feeding it through your geocoding workflow, you could do some parsing and standardization on the front end. The approach I would take is:

Parse the addresses using something like usaddress.
Normalize Street suffixes using the USPS SUFFIX ABBREVIATIONS.
Select addresses with matching street number, street suffix, city, state, zipcode.
Compare Street name using something like fuzzywuzzy.

The biggest guesswork comes with the fuzzy matching because at some point a threshold will have to be set for what is considered "the same." Although step three could be bypassed and fuzzy matching could be done on the full address, that would involve comparing each record against all other records. Assuming the number of misspelled street names is small relative to the total number of addresses, I think step 3 would lead to a more efficient process.

ClaireKolaja1 · ‎03-23-2017

Thank you for your response! To be honest I am somewhat new to geocoding this large of datasets (with so many rows for each person and self reported data leads to some very creative spellings of street names!) so had not even thought to parse and standardize the fields. I don't think we would spend money validating through a company but after reading your response I think I am going to follow the steps you suggest. Thank you again for the advice!