Using a cursor to compare two side by side fields

JoeBorgione · ‎09-27-2017

I need a little bit of inspiration. I have two side by side fields in a table, let's call then Addr1 & Addr2. They happen to be the input and result of a geocoding process where Addr1 is the matched address and Addr2 is the input address. This particular table are just matches less than 100%. Some of the reasons for jump right at you, others not so much. But, it's a list of 3100 records and I'd rather have a cursor go through it than stare at the records and not see the difference.

Having read Richard Fairhursts /blogs/richard_fairhurst/2014/11/08/turbo-charging-data-manipulation-with-python-cursors-and-diction..., I'd like to try the dictionary approach, but I'm a little fuzzy as to proper way to construct the dictionary and then cursor through it. Consider the following record where the prefix direction is mismatched:

I have come up with 4 different MisMatchTypes and would like to update the MisMatch to the appropriate value. All the records are in this format: HouseNumber, PreFix, Name, Type/SufDir.

The Match Values are pretty simple:

1 = HouseNumber Mismatch (probably won't have any of these...)

2 = PreDir Mismatch (very common)

3 = Name Mismatch (misspellings on either the input or the matched data source)

4 = SufDir/Type Mismatch. (very common)

In the example above, it is a type 2 mismatch

With the two fields having a standard format, they are easy to parse out (see A better way to parse an address? ) to compare the various elements. It's the cursor/dictionary piece that I need some help with.

TIA

Richard Fairhurst‌

Dan Patterson‌

dkwiens‌

bixb0012‌

That should just about do it....

DarrenWiens2 · ‎09-27-2017

Inside your update cursor, just split the Addr1 and Addr2 fields and compare (substitute row[0], row[1] as fits your cursor):

if Addr1.split(" ")[0] <> Addr2.split(" ")[0]:
  mismatch = 1
if Addr1.split(" ")[1] <> Addr2.split(" ")[1]:
  mismatch = 2
...
if Addr1.split(" ")[3] <> Addr2.split(" ")[3]:
  mismatch = 4‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

You might want to rather use a string field to store the mismatches, as there could be multiple mismatch types (e.g. "HPN" vs. "H" vs. "PNS" etc.).

View solution in original post

DarrenWiens2 · ‎09-27-2017

Inside your update cursor, just split the Addr1 and Addr2 fields and compare (substitute row[0], row[1] as fits your cursor):

if Addr1.split(" ")[0] <> Addr2.split(" ")[0]:
  mismatch = 1
if Addr1.split(" ")[1] <> Addr2.split(" ")[1]:
  mismatch = 2
...
if Addr1.split(" ")[3] <> Addr2.split(" ")[3]:
  mismatch = 4‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

You might want to rather use a string field to store the mismatches, as there could be multiple mismatch types (e.g. "HPN" vs. "H" vs. "PNS" etc.).

JoeBorgione · ‎09-27-2017

Thanks Darren- as always you're suggestions are appreciated. Testing now.....

(eta: that was a fun one to work through....)

That should just about do it....

DanPatterson_Retired · ‎09-27-2017

Joe... You should really start looking into numpy or pandas, then bring everything back using Extend Table

Joins are permanent and the need for nested, double recycling cursors etc is easily overcome.

JoshuaBixby · ‎09-27-2017

You don't need to do anything crazy with dictionaries here since you are directly comparing values within the same row of a cursor. Although I agree with Darren's overall approach, I am not sure if just splitting the address fields based on spaces will be robust enough. You said the addresses are standardized, but we all know there are limits to standards.

I recommend using an address parsing library, like usaddress 0.5.4 — usaddress 0.5.4 documentation. Let the address parser do the hard work, and it can also handle a wider range of address discrepancies.

>>> import usaddress
>>> 
>>> for label in usaddress.LABELS:
...     print label
...     
AddressNumberPrefix
AddressNumber
AddressNumberSuffix
StreetNamePreModifier
StreetNamePreDirectional
StreetNamePreType
StreetName
StreetNamePostType
StreetNamePostDirectional
SubaddressType
SubaddressIdentifier
BuildingName
OccupancyType
OccupancyIdentifier
CornerOf
LandmarkName
PlaceName
StateName
ZipCode
USPSBoxType
USPSBoxID
USPSBoxGroupType
USPSBoxGroupID
IntersectionSeparator
Recipient
NotAddress
>>> 
>>> addr1 = usaddress.tag('5 W ALBION VILLAGE WAY')
>>> addr2 = usaddress.tag('5 E ALBION VILLAGE WAY')
>>> 
>>> for label in usaddress.LABELS:
...     if addr1[0].get(label, None) != addr2[0].get(label, None):
...         print label
...         
StreetNamePreDirectional
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

JoeBorgione · ‎09-28-2017

Joshua- hehehehe... I'm the guy who set the standard for the addresses.....

Joshua Bixby‌

That should just about do it....