Table Join/comparison between different gdbs

EliChadwick · ‎02-03-2013

I have a process that copies data between dev and UAT. There is a need for a QA script that can confirm that the data copies across cleanly. We are primarily interested in whether the "Shape_Area" is the same between the two gdbs (including some tolerance for rounding).

Dev is a file gdb and UAT is an enterprise gdb. The main point is that they are different gdbs.

The dev gdb and the UAT gdb have the same name. The tables to compare also have the same name as do the fields that you would join on.

We initially used two SearchCursors which worked but was slow.

I explored the MakeQueryTable_management function but it requires the tables to be in the same workspace.

Here is the current approach:

## This approach involves appending the shape_area field from ## the target dataset to a copy of the source dataset and then ## doing a comparison between the areas. The results get reported ## and the copy gets dropped. This was necessary due to a bug in ## 10.1 that prevents the deletion of the right most field  print "Importing arcpy module...\n"  import arcpy  ## Compare Feature Area print "Compare Feature Area\n"  ## Variables sourceFC = "C:\\GIS\\Source.gdb\\UA1991_NZTM" destFC = "C:\\GIS\\Dest.gdb\UA1991_NZTM" TempFC = "C:\\GIS\\Source.gdb\\TEMPFC"  ## Create a temporary table to perform the join on. print "Take a copy\n" arcpy.Copy_management(sourceFC, TempFC, "")  ## Join the temporary table to the target table print "Perform Join\n" arcpy.JoinField_management (TempFC, "UA1991", destFC, "UA1991","Shape_Area")  ## Variables fields = ["UA1991","UA1991_NAME","Shape_Area","Shape_Area_1"] whereClause = "Shape_Area <> Shape_Area_1"  ## Create a cursor object containing all records ## where the shape_area differs between datasets ## and report output to screen print "Find records that differ" notEqualCur = arcpy.da.SearchCursor(TempFC, fields, whereClause) for row in notEqualCur:     print "\nUA1991_NAME = " + str(row[1])         print "UA1991 = " + str(row[0])     print "S Area = " + str(row[2])     print "D Area = " + str(row[3])     areaDiff = (row[2] - row[3])     areaDiffPerc = areaDiff/row[2]     if abs(areaDiffPerc) > 0.0001:         print "AREA DIFFERENCE: " + str(areaDiff)         print "AREA DIFFERENCE PERCENTAGE: " + str(abs(areaDiffPerc))  ## Drop the temporary table arcpy.Delete_management(TempFC)  ## delete the cursor del notEqualCur  print "\nDone."

I'm interested in a faster approach with less IO. Is it possible to do a join across gdbs without copying the fields to the source table?

MathewCoyle · ‎02-05-2013

Hi Mathew

We investigated the Feature Compare function. I was under the impression that the two datasets required the same records and record count. In our case there is a chance that one or more rows may have been dropped. I understood this to mean that it was not possible to do a comparison on datasets with different row counts. That is why the search cursor approach was used. Spin around one cursor populating another cursor using a where clause = slow. I'd love to hear if there was a way to use Feature Compare on datasets with different rowcounts.

Thanks
Eli

It will flag the different row counts as a miscomparison but that has nothing to do with the geometry which you are comparing when you select compare geometry only. Also, it sounds like you were using nested cursors which are very slow. You want to use dictionary cursor populations and calculations.

View solution in original post

MathewCoyle · ‎02-04-2013

I have a hard time believing two search cursors would take longer than copying out both datasets and joining them locally. In any case, have you tried the Feature Compare tool?

VikramS · ‎02-04-2013

I have a process that copies data between dev and UAT. There is a need for a QA script that can confirm that the data copies across cleanly. We are primarily interested in whether the "Shape_Area" is the same between the two gdbs (including some tolerance for rounding).

Dev is a file gdb and UAT is an enterprise gdb. The main point is that they are different gdbs.

The dev gdb and the UAT gdb have the same name. The tables to compare also have the same name as do the fields that you would join on.

We initially used two SearchCursors which worked but was slow.

I explored the MakeQueryTable_management function but it requires the tables to be in the same workspace.

Here is the current approach:
## This approach involves appending the shape_area field from
## the target dataset to a copy of the source dataset and then
## doing a comparison between the areas. The results get reported
## and the copy gets dropped. This was necessary due to a bug in
## 10.1 that prevents the deletion of the right most field

print "Importing arcpy module...\n"

import arcpy

## Compare Feature Area
print "Compare Feature Area\n"

## Variables
sourceFC = "C:\\GIS\\Source.gdb\\UA1991_NZTM"
destFC = "C:\\GIS\\Dest.gdb\UA1991_NZTM"
TempFC = "C:\\GIS\\Source.gdb\\TEMPFC"

## Create a temporary table to perform the join on.
print "Take a copy\n"
arcpy.Copy_management(sourceFC, TempFC, "")

## Join the temporary table to the target table
print "Perform Join\n"
arcpy.JoinField_management (TempFC, "UA1991", destFC, "UA1991","Shape_Area")

## Variables
fields = ["UA1991","UA1991_NAME","Shape_Area","Shape_Area_1"]
whereClause = "Shape_Area <> Shape_Area_1"

## Create a cursor object containing all records
## where the shape_area differs between datasets
## and report output to screen
print "Find records that differ"
notEqualCur = arcpy.da.SearchCursor(TempFC, fields, whereClause)
for row in notEqualCur:
    print "\nUA1991_NAME = " + str(row[1])    
    print "UA1991 = " + str(row[0])
    print "S Area = " + str(row[2])
    print "D Area = " + str(row[3])
    areaDiff = (row[2] - row[3])
    areaDiffPerc = areaDiff/row[2]
    if abs(areaDiffPerc) > 0.0001:
        print "AREA DIFFERENCE: " + str(areaDiff)
        print "AREA DIFFERENCE PERCENTAGE: " + str(abs(areaDiffPerc))

## Drop the temporary table
arcpy.Delete_management(TempFC)

## delete the cursor
del notEqualCur

print "\nDone."
I'm interested in a faster approach with less IO. Is it possible to do a join across gdbs without copying the fields to the source table?

Don't access your feature class straight from disc . Make both of them as Feature Layer using MakeFeatureLayer_management and try running search cursor . I believe it should be much faster .

JamesCrandall · ‎02-04-2013

Faster approach: Copy your FC's to the IN_MEMORY space then peform the join and querying there.

http://gis.stackexchange.com/questions/31699/ways-to-speed-up-python-scripts-running-as-arcgis-tools

EliChadwick · ‎02-04-2013

We investigated the Feature Compare function. I was under the impression that the two datasets required the same records and record count. In our case there is a chance that one or more rows may have been dropped. I understood this to mean that it was not possible to do a comparison on datasets with different row counts. That is why the search cursor approach was used. Spin around one cursor populating another cursor using a where clause = slow. I'd love to hear if there was a way to use Feature Compare on datasets with different rowcounts.

EliChadwick · ‎02-04-2013

[/HR]

I have a hard time believing two search cursors would take longer than copying out both datasets and joining them locally. In any case, have you tried the Feature Compare tool?

We investigated the Feature Compare function. I was under the impression that the two datasets required the same records and record count. In our case there is a chance that one or more rows may have been dropped. I understood this to mean that it was not possible to do a comparison on datasets with different row counts. That is why the search cursor approach was used. Spin around one cursor populating another cursor using a where clause = slow. I'd love to hear if there was a way to use Feature Compare on datasets with different rowcounts.

EliChadwick · ‎02-04-2013

I have a hard time believing two search cursors would take longer than copying out both datasets and joining them locally. In any case, have you tried the Feature Compare tool?

Hi Mathew

We investigated the Feature Compare function. I was under the impression that the two datasets required the same records and record count. In our case there is a chance that one or more rows may have been dropped. I understood this to mean that it was not possible to do a comparison on datasets with different row counts. That is why the search cursor approach was used. Spin around one cursor populating another cursor using a where clause = slow. I'd love to hear if there was a way to use Feature Compare on datasets with different rowcounts.

Thanks
Eli

EliChadwick · ‎02-04-2013

Faster approach: Copy your FC's to the IN_MEMORY space then peform the join and querying there..

Don't access your feature class straight from disc . Make both of them as Feature Layer using MakeFeatureLayer_management and try running search cursor . I believe it should be much faster .

Thanks for your responses. I will give it a go and let you know how I get on.

I tried replicating the join using the Data interoperability extension (FME) and it ran in seconds so it will be interesting to compare the results.

MathewCoyle · ‎02-05-2013

Hi Mathew

We investigated the Feature Compare function. I was under the impression that the two datasets required the same records and record count. In our case there is a chance that one or more rows may have been dropped. I understood this to mean that it was not possible to do a comparison on datasets with different row counts. That is why the search cursor approach was used. Spin around one cursor populating another cursor using a where clause = slow. I'd love to hear if there was a way to use Feature Compare on datasets with different rowcounts.

Thanks
Eli

It will flag the different row counts as a miscomparison but that has nothing to do with the geometry which you are comparing when you select compare geometry only. Also, it sounds like you were using nested cursors which are very slow. You want to use dictionary cursor populations and calculations.