I'm having a little trouble setting up this script. Here are the bullet points.
I've looked through using model builder with no luck. With python dictionaries and search cursors i'm getting stuck trying to join against [SiteID] and [EasyID] at the same time. I am also don't know how to return the dictionaries with just integer EasyID's and loop through updating the next smallest integer.
Here is what I've got so far. Much of it stems from what I read from Richard Fairhurst's post Turbo Charging Data Manipulation with Python Cursors and Dictionaries
import arcpy #Tables T1 = r"C:\Python\Scratch.gdb\Table1" T2 = r"C:\Python\Scratch.gdb\Table2" fields = ["FeatureID", "SiteID", "EasyID"] # Get FeatureID dictionaries for each Table T1Dict = {r[0]:(r[0:]) for r in arcpy.da.SearchCursor(T1, fields)} T2Dict = {r[0]:(r[0:]) for r in arcpy.da.SearchCursor(T2, fields)} # Get SiteID+EasyID dictionaries for each Table T1ConcatDict = {str(r[1]) + "," + str(r[2]):(r[0]) for r in arcpy.da.SearchCursor(T1, fields)} T2ConcatDict = {str(r[1]) + "," + str(r[2]):(r[0]) for r in arcpy.da.SearchCursor(T2, fields)} #First, If T2.FeatureID is Null but T2.EasyID and T2.SiteID are in T1, Update T2.FeatureID with arcpy.da.UpdateCursor(T2, fields) as updateRows: for updateRow in updateRows: # store the Join value by combining 3 field values of the row being updated in a keyValue variable keyValue = str(updateRow[1]) + "," + str(updateRow[2]) # verify that the keyValue is in the Dictionary if keyValue in T1ConcatDict & updateRow[0] is None & updateRow[1] is not None: # transfer the value stored under the keyValue from the dictionary to the updated field: FeatureID. updateRow[0] = T1ConcatDict[keyValue][0] updateRows.updateRow(updateRow) #Rebuild Dictionary if it is needed again T2ConcatDict = {str(r[1]) + "," + str(r[2]):(r[0]) for r in arcpy.da.SearchCursor(T2, fields)} T2Dict = {r[0]:(r[0:]) for r in arcpy.da.SearchCursor(T2, fields)} ''' #Get Max(EasyID) within SiteID NumberList = [] for value in T1Dict[2]: try: NumberList.append(int(value)) except ValueError: continue T1EasyNumberDict = [s[2] for s in T1Dict[2] if s.isdigit()] T1MaxEasyDict = max(T1EasyNumberDict) ''' #Second, If T2.FeatureID and T2.EasyID are Null, Update T2.EasyID with next smallest number (as string) in either T1 or T2 for the specific SiteID with arcpy.da.UpdateCursor(T2, fields) as updateRows: for updateRow in updateRows: # store the Join value by combining 3 field values of the row being updated in a keyValue variable keyValue = str(updateRow[1]) + "," + str(updateRow[2]) # verify that the keyValue is in the Dictionary if keyValue in T1Dict & updateRow[0] is None & updateRow[1] is None: # transfer the value stored under the keyValue from the dictionary to the updated field. # Perhaps retrieve max int occurs here? updateRow[2] = max(T1Dict[keyValue][2], T2Dict[keyValue[2]) updateRows.updateRow(updateRow) #Third, Insert into T1 if T2.SiteID is not null and T2.EasyID is not null #Forth, Update T2.FeatureID with T1.FeatureID from previous insert where T2.SiteID=T1.SiteID and T2.EasyID=T1.EasyID #Lastly, Insert any T1 features into T2 where T1.EasyID not in (Select EasyID from T2 where T2.SiteID = T1.SiteID) and T1.FeatureID not in (Select FeatureID from T2)
Thank you for any advise.
Edit: Here is a sample of sites with 2 explanatory columns
T1.FeatureID | T2.FeatureID | T1.SiteID | T2.SiteID | T1.EasyID | T2.EasyID | Status | Result |
358589 | 358589 | 136238 | 136238 | 1 | 1 | Existing T1 and T2 Feature | No Change |
358590 | 358590 | 136238 | 136238 | 2 | 2 | Existing T1 and T2 Feature | No Change |
358594 | 358594 | 136238 | 136238 | 4 | 4 | Existing T1 and T2 Feature | No Change |
652538 | 652538 | 136238 | 136238 | 5 | 5 | Existing T1 and T2 Feature | No Change |
486028 | 486028 | 136238 | 136238 | 8 | 8 | Existing T1 and T2 Feature | No Change |
486029 | 486029 | 136238 | 136238 | 9 | 9 | Existing T1 and T2 Feature | No Change |
525300 | 525300 | 136238 | 136238 | 34 | 34 | Existing T1 and T2 Feature | No Change |
574802 | 574802 | 136238 | 136238 | 998 | 998 | Existing T1 and T2 Feature | No Change |
670911 | 136238 | 300 | New T1 Feature | (Step 6) Inserted Into T2 | |||
493840 | 136238 | 9996 | New T1 Feature | (Step 6) Inserted Into T2 | |||
493839 | 136238 | 9997 | New T1 Feature | (Step 6) Inserted Into T2 | |||
493831 | 136238 | 9999 | New T1 Feature | (Step 6) Inserted Into T2 | |||
696019 | 136238 | 105-106 | New T1 Feature | (Step 6) Inserted Into T2 | |||
696037 | 136238 | 9999N | New T1 Feature | (Step 6) Inserted Into T2 | |||
696014 | 136238 | Area1 | New T1 Feature | (Step 6) Inserted Into T2 | |||
670910 | 136238 | N | New T1 Feature | (Step 6) Inserted Into T2 | |||
580636 | 136238 | N30c | New T1 Feature | (Step 6) Inserted Into T2 | |||
360401 | 136401 | AC | Exiting T1 Feature | (Skip) Not Inserted since no T2.SiteID match | |||
360402 | 136401 | SP | Exiting T1 Feature | (Skip) Not Inserted since no T2.SiteID match | |||
360510 | 136427 | Area 1 | Exiting T1 Feature | (Skip) Not Inserted since no T2.SiteID match | |||
362653 | 136635 | 15 | Exiting T1 Feature | (Skip) Not Inserted since no T2.SiteID match | |||
362943 | 362943 | 136698 | 136698 | 1 | 1 | Existing T1 and T2 Feature | No Change |
362944 | 362944 | 136698 | 136698 | 2 | 2 | Existing T1 and T2 Feature | No Change |
362945 | 362945 | 136698 | 136698 | 3 | 3 | Existing T1 and T2 Feature | No Change |
362946 | 362946 | 136698 | 136698 | 4 | 4 | Existing T1 and T2 Feature | No Change |
362947 | 362947 | 136698 | 136698 | 5 | 5 | Existing T1 and T2 Feature | No Change |
362950 | 362950 | 136698 | 136698 | 11C | 11C | Existing T1 and T2 Feature | No Change |
362948 | 362948 | 136698 | 136698 | 8 | New T2 Feature, Exists in T1 | (Step 5) Update T2.EasyID | |
362949 | 362949 | 136698 | 136698 | 9 | New T2 Feature, Exists in T1 | (Step 5) Update T2.EasyID | |
362951 | 136698 | 136698 | 15 | 15 | New T2 Feature, Exists in T1 | (Step 1) Update T2.FeatureID | |
362954 | 136698 | 136698 | 16 | 16 | New T2 Feature, Exists in T1 | (Step 1) Update T2.FeatureID | |
362955 | 136698 | 136698 | 17 | 17 | New T2 Feature, Exists in T1 | (Step 1) Update T2.FeatureID | |
362956 | 136698 | 136698 | 18 | 18 | New T2 Feature, Exists in T1 | (Step 1) Update T2.FeatureID | |
362957 | 136698 | 136698 | 19 | 19 | New T2 Feature, Exists in T1 | (Step 1) Update T2.FeatureID | |
136698 | 20 | New T2 Feature | (Step 3,4) Insterted into T1, Update T2.FeatureID | ||||
136698 | 21 | New T2 Feature | (Step 3,4) Insterted into T1, Update T2.FeatureID | ||||
136698 | 22 | New T2 Feature | (Step 3,4) Insterted into T1, Update T2.FeatureID | ||||
136698 | 25 | New T2 Feature | (Step 3,4) Insterted into T1, Update T2.FeatureID | ||||
136698 | New T2 Feature | (Step 2,3,4) Get next lowest integer for T2.EasyID -> 6 | |||||
136698 | New T2 Feature | (Step 2,3,4) Get next lowest integer for T2.EasyID -> 7 | |||||
136698 | New T2 Feature | (Step 2,3,4) Get next lowest integer for T2.EasyID -> 10 | |||||
136698 | New T2 Feature | (Step 2,3,4) Get next lowest integer for T2.EasyID -> 11 |
Solved! Go to Solution.
if you are determined to implement the rule as you have describe it here is some code that should work to get a siteID dictionary for T2 that should find the sorted list of integers that are not yet used for each SiteID and assign the next unused number to the Null EasyID records associated with each SiteID. Also, do not bother filtering cursors if you intend to put it in a dictionary. It is faster to put everything into the dictionary and then do all of the logic tests, type validations, and list tracking in code. You might filter for the Null EasyID values prior running the updateCursor, but even if you have 1 million records to process they will take only about 5 minutes for the update cursor to run through all of them (and the SQL that filters for Null values might take longer, since Null values queries run pretty slow, especially if EasyID is not indexed).
# Get the list of easyIDs associated with each SiteID in a dictionary for T1 T1SiteIDDict = {} with arcpy.da.SearchCursor(T1, fields) as searchRows: for searchRow in searchRows: keyValue = searchRow[0] if not keyValue in T1Dict: # Key not in dictionary. Add Key pointing to a list of a list of field values T1SiteIDDict[keyValue] = [searchRow[1]] else: # Append a list of field values to the list the Key points to T1SiteIDDict[keyValue].append(searchRow[1]) del searchRows, searchRow # Get the list of easyIDs associated with each SiteID in a dictionary for T2 T2SiteIDDict = {} with arcpy.da.SearchCursor(T1, fields) as searchRows: for searchRow in searchRows: keyValue = searchRow[0] if not keyValue in T1Dict: # Key not in dictionary. Add Key pointing to a list of a list of field values T2SiteIDDict[keyValue] = [searchRow[1]] else: # Append a list of field values to the list the Key points to T2SiteIDDict[keyValue].append(searchRow[1]) del searchRows, searchRow SideIDDict = {} for keyValue in T2SiteIDDict.keys(): intList = [] for easyID in T2SiteIDDict[keyValue]: if easyID.isnumeric(): # easyID is a number if float(easyID) == int(easyID): # easyID is an integer, so add it to the list intList.append(int(easyID)) if keyValue in T1SiteIDDict: for easyID in T1SiteIDDict[keyValue]: if easyID.isnumeric(): # easyID is a number if float(easyID) == int(easyID): # easyID is an integer, so add it to the list intList.append(int(easyID)) # remove already used numbers out of the numbers from 1 to 9999 # and get a sorted list stored for each SiteID SiteIDDict[keyValue] = sorted(set(range(1, 10000)) - set(intList))
- import arcpy
- import sys
- #Tables
- T1 = r"C:\Python\Scratch.gdb\Table1"
- T2 = r"C:\Python\Scratch.gdb\Table2"
- fields = ["SiteID", "EasyID", "FeatureID", "OID@"]
- with arcpy.da.UpdateCursor(T2, fields) as updateRows:
- for updateRow in updateRows:
- if updateRow[1] == None:
- templist = SiteIDDict[updateRow[0]]
- updateRow[1] = templist[0]
- # I believe updates of the list affect the list in the dictionary
- templist.remove(updateRow[1])
- updateRows.updateRow(updateRow)
I think this is one of these cases where it would be so valuable to have to sample data attached to the thread. Also you could mention Richard Fairhurst (see Tagging people, places, and content within your post ) since you based the script on his post. This will send him a notification and he might be able to look at your code.
Anyway, please attach a sample of the data, to be able to debug the code.
Good points Xander Bakker. I have included example info and fixed the mentioning. Thanks.
What distinguishes the skipped records from the records that were step 6 inserted into the T2 table? I see no way to keep track of what is new in T1 since the last time the script was run to make that choice. All of the T2 insertions and the skipped records have no SiteID, so that is not a difference to make that choice. I don't agree that you have the steps in the correct order from what I can see. Your logic appears backwards, since normally I would deal with Nulls and new verses old records as my first steps in any comparison script, not towards the end. My scripts deal with new verses old by renaming existing data and deriving current data from another source, so that the comparison is easy to make. Possibly a variation of that approach would apply here, so that a last run version of the data is created so you can make sure you know what is really new and what you have processed before.
I would also add the ObjectID field for both tables to your field list and reorder the fields as:
["SideID", "EasyID", "FeatureID", "OID@"]
The ObjectID would be used in subroutines to impose order on Null values and to validate your assumptions of unique keys.
Anyway you never said if errors are occurring of if just unexpected values are being assigned. Unexpected values indicate a logic failure, while errors indicate a syntax or data validation failure. I am almost certain you will experience many logic errors developing and testing the script since there are so many dependencies at each stage that have to be considered, so develop only on test data and back up your data before trying it out on your live data.
My blog avoided going into anything this complex, because it hopes to make the core of the principles and the approach I was demonstrating easy to follow. You may want to look at this post to see an example of where I adapted the approach to deal with a much more complex many-to-many relationship between tables for further ideas about ways to vary the basic approach outlined in the Blog.
This script will run at night. The T1 features that don't have a Site ID in T2 are basically ignored until the T2 user adds the first feature with that SiteID. The user may have manually matched the SiteID and EasyID before the script runs. Additionally, T2 users adding the first occurances of features in either table will skip adding EasyID and rely on unique ones being generated.
One other note is once a feature has all 3 ID'S it will never be changed by the users.
Perhaps if I pre filter T1 to T2 SiteID, the script would be simpler. I believe you have identified step 1 as redundant since it occurs later and max(EasyID) is gained across both tables.
In what way are you getting stuck trying to join against [SiteID] and [EasyID] at the same time? Are you getting errors? The basic approach to a combined key will work as shown for the first 25 lines of code (I can't follow your overall logic beyond that).
Null values in the key most likely cause most of problems, so I would restructure the code order to deal with the second part of your script first to fill in Null values in the EasyID field before worrying about the FeatureID field at all. That involves processing a list of EasyIDs in the single key dictionary of just SideID key values first to verifying the unique value assumption for the non-Null EasyID values as well as filling in the Null EasyID values. Don't build dictionaries for T2 at all until they can be used. In any case, the cursor dictionary approach is your best option and can handle this whole set of processes, but each must happen in the correct order to avoid faulty assumptions about what set of fields contain unique values at each stage of the script.
Clearly you are dealing with a high complex interrelationship between these two tables and a large set of rules that I have yet to understand. I have no context how these records and values came into existence of what uses they will serve in the future. More crucially, You have given me no information about the interrelationship this script has with user actions. Every step you expect a user to do or not do creates a point of failure for your script and any of your rules and assumptions. If the user has to manually set off the script, you must always start by verifying they did everything you expected them to do and didn't do anything you didn't expect them to do related to your script assumptions.
Also, Xander is correct that mentioning my full name in a post puts a message in my inbox, which is the reason I saw this post when he did that.
Richard,
I haven't gotten far enough to get any errors. I don't know how to build/filter the dictionary for T2.OID T2.SiteID string EasyID, then gather only integer EasyIDs, and return the 1st missing integer starting from 1. Without it I can't test updating Null EasyIDs in new T2 features.
Thank you for posting that link to stackexchange! It looks very similar the components of my scenario. I will update after testing the code.
Your approach still mystifies me and I still don't understand your business rules. Your rules may make sense to you and may be correct for your business needs, but on the surface they at least partially conflict with my experience in synchronizing data and matching tables. The picture in your head of how everything should work is not transferring into mine yet.
I highly recommend that you reconsider the rule that says:
The EasyID is anything but easy to understand or program as you have described it. Filling in these blanks this way seems arbitrary to me especially given that the natural sort of the strings is actually '1', '3', '4A', '55', '6-7', 'S'. Why fill in blanks at all? Over time that means any deleted records will have their SiteID + EasyID combination reused for an entirely unrelated record, and therefore that key is only unique within the snapshot in time before the script reuses it. In other words, you will never be able to use the SiteID + EasyID key if you ever have to compare two different data snapshots that were taken before and after the script ran. This rule may make sense to you, but in my experience this is a bad database practice. Unique keys (single or multi-field) are only valuable in my experience if they are unique to one record over all time or support actual data relationships and become a problem if they are ever reused for completely unrelated records. I personally don't want to help implement this rule, since its seems excessively complicated to me, and I believe from experience that a day will come when you will want to use that key to recover from a data corruption event and the code that implements this rule will make that recovery nearly impossible. You also will greatly increase the likelihood of creating data corruption if you accidentally link together two snapshots that reassigned the same keys to different records.
I have several other questions about this EasyID field. How many characters are allowed in this field? Why does it contain letters and what is the significance of those letters? Why are there dashes to combine two numbers? Since this field is a string field, how do your users handle the fact that it will never sort numerically in any table, since you don't include leading spaces or strings to right-justify them? What type of business are you working for where this business process was developed to track any of this data in either table?
So what little I do understand (or think I understand) I will try to present some code that should fit your needs. This code is more or less what I would start with. Key fields always should come first in the field list and value fields always should follow. I would incorporate the OID field into the code processes and dictionaries as a fail safe unique key for linking back to the original table where ever the user defined keys turn out to be duplicated and not unique.
The code below handles both a 1:1 and 1:M relationship possibility, so even if the key value is not unique you will be able to trap that and fix it.
import arcpy import sys #Tables T1 = r"C:\Python\Scratch.gdb\Table1" T2 = r"C:\Python\Scratch.gdb\Table2" fields = ["SiteID", "EasyID", "FeatureID", "OID@"] # Intialize T1 as a dictionary T1Dict = {} # Initialize a list to hold any concatenated key duplicates found T1KeyDups = [] # Open a search cursor and iterate rows with arcpy.da.SearchCursor(T1, fields) as searchRows: for searchRow in searchRows: # Build a composite key value from 2 fields keyValue = '{};{}'.format(searchRow[0], searchRow[1]) if not keyValue in T1Dict: # Key not in dictionary. Add Key pointing to a list of a list of field values T1Dict[keyValue] = [list(searchRow[2:])
] else: # Key in dictionary is not unique. T1KeyDups.append(keyValue) # Append a list of field values to the list the Key points to T1Dict[keyValue].append(list(searchRow[2:])
) del searchRows, searchRow # Sample of how to access the keys, record count, and record values of the dictionary for keyValue in T1Dict.keys(): for i in range(0, len(T1Dict[keyValue])): print "The SiteID;EasyID key is {} with {} record(s). Record {} has FeatureID {} and ObjectID {}.".format(keyValue, len(T1Dict[keyValue]), i+1, T1Dict[keyValue][0], T1Dict[keyValue][1]) if len(T1KeyDups) > 0: # Duplicate keys exist in T1 # Give a warning and either exit the script or else do a fix of T1 before proceeding print("Duplicate keys found! They are:") for keyValue in T1KeyDups: for i in range(0, len(T1Dict[keyValue])): print "The SiteID;EasyID key is {} with {} record(s). Record {} has FeatureID {} and ObjectID {}.".format(keyValue, len(T1Dict[keyValue]), i+1, T1Dict[keyValue][0], T1Dict[keyValue][1]) # Either exit or fix T1 here sys.exit(-1)
if you are determined to implement the rule as you have describe it here is some code that should work to get a siteID dictionary for T2 that should find the sorted list of integers that are not yet used for each SiteID and assign the next unused number to the Null EasyID records associated with each SiteID. Also, do not bother filtering cursors if you intend to put it in a dictionary. It is faster to put everything into the dictionary and then do all of the logic tests, type validations, and list tracking in code. You might filter for the Null EasyID values prior running the updateCursor, but even if you have 1 million records to process they will take only about 5 minutes for the update cursor to run through all of them (and the SQL that filters for Null values might take longer, since Null values queries run pretty slow, especially if EasyID is not indexed).
# Get the list of easyIDs associated with each SiteID in a dictionary for T1 T1SiteIDDict = {} with arcpy.da.SearchCursor(T1, fields) as searchRows: for searchRow in searchRows: keyValue = searchRow[0] if not keyValue in T1Dict: # Key not in dictionary. Add Key pointing to a list of a list of field values T1SiteIDDict[keyValue] = [searchRow[1]] else: # Append a list of field values to the list the Key points to T1SiteIDDict[keyValue].append(searchRow[1]) del searchRows, searchRow # Get the list of easyIDs associated with each SiteID in a dictionary for T2 T2SiteIDDict = {} with arcpy.da.SearchCursor(T1, fields) as searchRows: for searchRow in searchRows: keyValue = searchRow[0] if not keyValue in T1Dict: # Key not in dictionary. Add Key pointing to a list of a list of field values T2SiteIDDict[keyValue] = [searchRow[1]] else: # Append a list of field values to the list the Key points to T2SiteIDDict[keyValue].append(searchRow[1]) del searchRows, searchRow SideIDDict = {} for keyValue in T2SiteIDDict.keys(): intList = [] for easyID in T2SiteIDDict[keyValue]: if easyID.isnumeric(): # easyID is a number if float(easyID) == int(easyID): # easyID is an integer, so add it to the list intList.append(int(easyID)) if keyValue in T1SiteIDDict: for easyID in T1SiteIDDict[keyValue]: if easyID.isnumeric(): # easyID is a number if float(easyID) == int(easyID): # easyID is an integer, so add it to the list intList.append(int(easyID)) # remove already used numbers out of the numbers from 1 to 9999 # and get a sorted list stored for each SiteID SiteIDDict[keyValue] = sorted(set(range(1, 10000)) - set(intList))
- import arcpy
- import sys
- #Tables
- T1 = r"C:\Python\Scratch.gdb\Table1"
- T2 = r"C:\Python\Scratch.gdb\Table2"
- fields = ["SiteID", "EasyID", "FeatureID", "OID@"]
- with arcpy.da.UpdateCursor(T2, fields) as updateRows:
- for updateRow in updateRows:
- if updateRow[1] == None:
- templist = SiteIDDict[updateRow[0]]
- updateRow[1] = templist[0]
- # I believe updates of the list affect the list in the dictionary
- templist.remove(updateRow[1])
- updateRows.updateRow(updateRow)
Wow thank you for these excellent resources! I will try to digest this over the weekend, but this looks very promising. Sorry that I didn't provide all of the aspects of this dilemma. I tried to simplify it enough that python masters like yourself wouldn't have to read a novel. As you ascertained, this is complex.
The business table (T1) sits on a 2005 SQL server, unsupported since 10.2. EasyID is a 7 char string used for labeling at a site and it is the syncing bane of my existence. Users are allowed to assign "C5-C8" so that FeatureID represents numerous real world things. C5, C6, C7, and C8 could also exist in T1 individually. It is a data quality mess. To combat this the groupings will not make it to T2. Instead C5, C6, C7, and/or C8 will be digitized by the T2 user, if so desired.
Luckily, users are unable to delete rows and can't view or change the FeatureID or SiteID. The other table (T2) and a copy of the Sites table are layers in a hosted service for collecting new site features. I update the Sites with a scheduled script and thanks to Collector for ArcGIS 10.3 honoring relationships, all features maintain their SiteID.
In short, users are limited to picking an EasyID during feature creation in either table and they could possibly update EasyID on T2. On the off chance they do change EasyID on a feature existing in T1 and T2 there will be a script that joins by FeatureID and updates the one with an earlier edit date.
I could go on and on, but don't want to spoil your weekend. Thanks again Richard Fairhurst!