Geocoding Large Datasets, What to do with Candidates Tied

KevinWolff · ‎02-25-2011

Hi all! I am a graduate student in Criminology and had a few questions regarding geocoding which I was hoping experts in this field could help me address. We are working on a large, multi-city project which involves geocoding crime data from over 100 cities across the U.S. With that being said, I have a lot of incident-level data to geocode in order to aggregate to counts by census tract. Additionally, some of the data is not of the best quality, and therefore there are often a decent amount of candidates tied.

My question is, within the geography/geocoding literature and or in practice, when addressing large datasets (2+ Million Records) what is generally done with the cases where the candidates tie? While there are not a tremendous number of ties (3-5%), with datasets this large it would be impractical to go through and select one of the tied addresses by hand for each case which has tied. I am wondering if there is any literature within the field, or a general rule which is followed for adding these cases to the tied or untied pile.

It seems to me that because our ultimate goal is census tract counts it may be appropriate to use the ties because it is unlikely that choosing either of the matched addresses would lead to the point being assigned to a different tract. However, I am not sure that this is the best way to approach this issue and that is why I have come to you all for help.

I appreciate any knowledge on this issue, including sources I should be familiar with. Thank you for your time. Cheers

Sincerely,
Kevin Wolff

JoeBorgione · ‎02-25-2011

I don't know of any literature cititations to reccomend, however, just for fun you maight make some random checks on your ties and see if the tied addresses are not in different tracts. You'll need to decide what a significant sample of that would be.

Also, I think your partial hits might be more of a problem. Let's say you have an incident logged as 1234 S Main St and you get a partial match at 1234 N Main ST because S Main ST is now called S Old Main St but the LEO still calls it S MAIN ST. You may want to take a look at your partial hits; working in 9-1-1, I always look carefully at those that match but less than 100%. Obviously that's not a practical standard for the numbers of records you are dealing with.

Hope this helps-

That should just about do it....

KevinWolff · ‎02-28-2011

Thank you for the quick response Joe. I just wanted to make sure I understand correctly what you are recommending that I could do to address this issue. For the ones which are 100 percent matched, but tied, I could take a sample and verify that they do in fact lie within the same tract. However, I understand that the partial matches may be much more of a problem, if the two candidates are matched at 78 it is more likely that they are in very different locations. Is it possible then to have Arc count them as matched if there score is 100 but unmatched if it is any less (if I find that I am comfortable with the 100 percent matches)?

Secondly, is there anything known about how Arc chooses 1 of the tied candidates as the match address? Is this process random? Or is it determined by the layout of the line file? It occurred to me if it were random, it is likely to be less of an issue, however if it just chooses the first one it comes across it may be biased in some way. Any insight into this process would be greatly appreciated.

Thanks again for all the help! Cheers.

JoeBorgione · ‎02-28-2011

My response(s) in bold

Thank you for the quick response Joe. I just wanted to make sure I understand correctly what you are recommending that I could do to address this issue. For the ones which are 100 percent matched, but tied, I could take a sample and verify that they do in fact lie within the same tract. However, I understand that the partial matches may be much more of a problem, if the two candidates are matched at 78 it is more likely that they are in very different locations. Look at two addresses: 1234 S Main ST and 1234 N Main St. Toss out the pre-directions and they are an exact match. That's why partial hits can be problematic. Is it possible then to have Arc count them as matched if there score is 100 but unmatched if it is any less (if I find that I am comfortable with the 100 percent matches)? Absolutely; when you create your locator you can adjust what scores can be a match or a miss. You can also score ties as a miss if you like.

Secondly, is there anything known about how Arc chooses 1 of the tied candidates as the match address? Is this process random? Or is it determined by the layout of the line file? It occurred to me if it were random, it is likely to be less of an issue, however if it just chooses the first one it comes across it may be biased in some way. Any insight into this process would be greatly appreciated. I've never really thought about that way; typically I view ties as a problem in the data I'm matching against and go in and fix the problem. If two addresses match against two separate streets, that means the ranges overlap, and I like to fix those.

Thanks again for all the help! Cheers. Good luck

That should just about do it....

KarynBackus · ‎03-01-2011

In 9.x, the matching of ties is arbitrary. As I understand it, the first candidate that matches at 100% is used. If choosing not to match when candidates are tied, then cases with duplicate candidates with the same geometry will remain unmatched. This can be an issue when using alternate names (like 10th St and Tenth St).

In v10, the documentation states that candidates with the same geometry will not be considered Ties. This avoids the 10th vs Tenth issue. However, matching of ties with different geometry is still arbitrary.