Python hashlib to compare shapefiles\feature classes

PeterWilson · ‎07-21-2015

I'd like to find out if anyone within the community has used Python hashlib to compare shapefiles or feature classes. Would you need to compare the geometry and records to figure out if the shapefiles or feature classes are the same. Any advice in how I can use python hashes to compare shapefiles or feature classes will be appreciated.

Regards

BruceHarold · ‎08-12-2015

Hi

I attempted this a few years back, the code from my tool is attached.

Really I was wanting to mimic the ChangeDetector transformer in Data Interoperability/FME.

If you have real work to do buy Data Interoperability or FME, or if you are working with polyline features the new Detect Feature Changes tool (Advanced) might do the trick.

Regards

View solution in original post

NeilAyres · ‎07-21-2015

Why not just use the Feature Compare tool in Data Management/Data Comparison

DanPatterson_Retired · ‎08-11-2015

closure on this Peter?

XanderBakker · ‎08-11-2015

Unfortunately, the Feature Compare tool is not smartest tool there is... If I compare two files which are the same, then the result is as expected:

FeatureClass: Shape types are the same.
FeatureClass: Feature types are the same.
Table: Table row counts are the same.
FeatureClass: Feature class extents are the same.
GeometryDef: GeometryDefs are the same.
Field: Field properties are the same.
Table: Table row counts are the same.
SpatialReference: Spatial references are the same.
Table: Rows are the same.

However if I would delete feature 2 from the featureclass, like this:

... and run the tool sorting on the common field (myField) you will notice that it compares based on the row number, not on the value in the sort field. It just sorts and compares row 2 from the base features to row 2 from the test features and will find that everything is different starting from record 2...

So hashing might be a solution...

XanderBakker · ‎08-11-2015

now... if we would hash it like this and compare both featureclasses...

import arcpy
def main():
    fc1 = r"C:\GeoNet\FeatureCompare\test.gdb\polygons1"
    fc2 = r"C:\GeoNet\FeatureCompare\test.gdb\polygons2"

    # create dictionary for each featureclass
    flds1 = getFields(fc1)
    dct_fc1 = {getHashFeature(r, r[1], flds1):r[0] for r in arcpy.da.SearchCursor(fc1, flds1)}
    flds2 = getFields(fc2)
    dct_fc2 = {getHashFeature(r, r[1], flds2):r[0] for r in arcpy.da.SearchCursor(fc2, flds2)}

    for h2, oid2 in dct_fc2.items():
        if h2 in dct_fc1:
            print "OID {0} from fc2 corresponds to OID {1} in fc1".format(oid2, dct_fc1[h2])
        else:
            print "OID {0} from fc2 has no match with features in fc1".format(oid2)

    for h1, oid1 in dct_fc1.items():
        if h1 in dct_fc2:
            print "OID {0} from fc1 corresponds to OID {1} in fc2".format(oid1, dct_fc2[h1])
        else:
            print "OID {0} from fc1 has no match with features in fc2".format(oid1)

def getFields(fc):
    """Get all fields without fields to exclude like OID, Geometry, """
    flds_exclude = getFieldsExclude(fc)
    flds = [fld.name for fld in arcpy.ListFields(fc) if fld.name not in flds_exclude]
    flds.insert(0,"SHAPE@")
    flds.insert(0,"OID@")
    return flds

def getFieldsExclude(fc):
    """List of fields to exclude"""
    lst = []
    props = ["OIDFieldName", "shapeFieldName", "areaFieldName", "lengthFieldName"]
    desc = arcpy.Describe(fc)
    for prop in props:
        if hasattr(desc, prop):
            lst.append(eval("desc.{0}".format(prop)))
    return lst

def getHashFeature(row, geom, flds):
    """Hash feature (without OID, length and area fields)"""
    # row: OBJECTID, SHAPE, other flds
    # md5, sha1, sha224, sha256, sha384, sha512
    import hashlib
    dct = {}
    for fld in flds[2:]:
        dct[fld] = row[flds.index(fld)]
    dct[flds[1]] = geom.JSON
    h = hashlib.sha512(str(dct))
    return h.hexdigest()

if __name__ == '__main__':
    main()

It returns this:

OID 3 from fc2 corresponds to OID 3 in fc1
OID 5 from fc2 corresponds to OID 5 in fc1
OID 4 from fc2 corresponds to OID 4 in fc1
OID 1 from fc2 corresponds to OID 1 in fc1
OID 3 from fc1 corresponds to OID 3 in fc2
OID 5 from fc1 corresponds to OID 5 in fc2
OID 4 from fc1 corresponds to OID 4 in fc2
OID 2 from fc1 has no match with features in fc2
OID 1 from fc1 corresponds to OID 1 in fc2

...it matches the features based on the hash (sha512) generated.

However, don't get your hopes up to high since the code is terribly slow...

BruceHarold · ‎08-12-2015

Hi

I attempted this a few years back, the code from my tool is attached.

Really I was wanting to mimic the ChangeDetector transformer in Data Interoperability/FME.

If you have real work to do buy Data Interoperability or FME, or if you are working with polyline features the new Detect Feature Changes tool (Advanced) might do the trick.

Regards

XanderBakker · ‎08-12-2015

Hi Bruce Harold looks promising. Will definitely have a look at it. Thanks for sharing!

PeterWilson · ‎01-12-2016

I wasn't able to generate my own comparison tool, so I ended up utilizing FME Software.

XanderBakker · ‎01-12-2016

To bad to hear that you couldn't create the tool, but FME is a very good alternative. You may want to mark the answer by Harold as correct answer

WesMiller · ‎02-18-2016

I've been trying for a while to come up with something similar and even converted Bruce's code to use multiprocessing. Below is the code that shows the most promise it will process 67000+ features and compare against 67000+ features in about a minute.

#Script to compare 2 feature classes
#Author Wes Miller


import arcpy
import numpy as np
import datetime


#Print start time and date 
dt = datetime.datetime.now()
print dt


#The Feature Class to Check
originFC = arcpy.GetParameterAsText(0)
#The feature class with updates needed for feature class above
changeFC = arcpy.GetParameterAsText(1)


origDesc = arcpy.Describe(originFC)
oidName = origDesc.oidFieldName


field_names = arcpy.GetParameterAsText(2)#Example: "Shape","TMS","OWNERNAME","TOTBDGVAL","DISTRICT","ADD1","ADD2","ADD3","MAP","BLOCK","PARCEL","CALCULATED_ACREAGE","ADDR_SITE"


originFCarr = arcpy.da.FeatureClassToNumPyArray(originFC,(field_names),null_value=-9999)
originFCarrID = arcpy.da.FeatureClassToNumPyArray(originFC,(oidName))
changeFCarr = arcpy.da.FeatureClassToNumPyArray(changeFC,(field_names),null_value=-9999)
changeFCarrID = arcpy.da.FeatureClassToNumPyArray(changeFC,(oidName))


print "Arrays Complete"
count = 0
deletesarr = np.where(np.invert(np.in1d(originFCarr,changeFCarr)))
addsarr = np.where(np.invert(np.in1d(changeFCarr,originFCarr)))
adds = []
for each in  addsarr[0]:
    adds.append(changeFCarrID[each][0])
deletes = []
for each in deletesarr[0]:
    deletes.append(originFCarrID[each][0])


#Current Outputs are printed here for testing purposes
print adds
print 
print deletes


ndt = datetime.datetime.now()
print ndt