Find specific key-value tuples within a dictionary?

JohnLay · ‎01-08-2015

I am trying to build a process that will check a dictionary for an existing key-value combination within a dictionary where the key is a variable and the value is always going to be one of three different values (16, 17, or 18). I quickly came upon the issue of searching a dictionary with a dictionary and found someone's solution to turn the search into a frozenset. That removed the TypeError: unhashable type: 'dict' error message, but the code is still not doing what I want it to do.

I am trying to identify that a table with 10's of thousands of Building ID's contains exactly 3 occurrences of each Building ID and that each occurrence is accompanied with only either a 16, 17, or 18 value in the next column. So far I've only been playing around with snippets of code to see if I could make it work.

WIND = 'L_DAMAGE_RESULTS_WIND'
readList = ["BLDG_ID", "HAZARD_ID"]
PassFail = "PASS"
WINDDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(WIND, readList)}
with arcpy.da.SearchCursor(WIND, "BLDG_ID") as cursor:
    for row in cursor:
        lookup = {row[0]:(18)}
        key = frozenset(lookup.items())
        if key not in WINDDict:
           PassFail = "fail"
           
print PassFail
fail
print lookup
{u'370139999': (18,)}

I thought the above would result in a "PASS" because u'370139999': (18,) does exist in WINDDict, but it didn't.

I'm still wrapping my head around dictionaries, so I would appreciate any help that is offered.

UPDATE:

So the problem appears to be with the unicode within the dictionary.

WINDDict = {29184: (u'3701310027', 17), 1: (u'370131', 16), 2: (u'3701310', 16), 3: (u'37013100', 16), 4: (u'370131000', 16), 5: (u'3701310000', 16), 6: (u'3701310001', 16), ...}

if (u'3701310027', 17)in WINDDict:
    print "yes"
else:    
    print "no"
...     
no

if 17 in WINDDict:
    print "yes"
else:    
    print "no"
...     
yes

I have no idea how to handle this.

Message was edited by: John Lay to add more explanation.

ChrisSnyder · ‎01-08-2015

I agree with Mr. Bixby, no need to store row ids. This code using set() objects also does the job:

magicNumberSet = set([16,17,18]) #the building must have all of these codes to be in the yesList
buildingDict = {}
searchRows = arcpy.da.SearchCursor('L_DAMAGE_RESULTS_WIND', ["BLDG_ID", "HAZARD_ID"])
for searchRow in searchrows:
    buildingId, hazardId = searchRow
    if buildingId in buildingDict:
        buildingDict[buildingId].add(hazardId)
    else:
        buildingDict[buildingId] = set([hazardId])
yesList = [buildingId for buildingId in buildingDict if magicNumberSet.issubset(buildingDict[buildingId])]
noList = [set(buildingDict.keys()).difference(yesList)]

JohnLay · ‎01-09-2015

It's not quite that the building must have all three, is is more like there must be 3 buildings with the same ID that each have one of the three hazards. the table would look like this:

BLDG_ID HAZARD_ID OTHER FIELDS

37013 16 other info unique to HAZ_ID 16

37013 17 other info unique to HAZ_ID 17

37013 18 other info unique to HAZ_ID 18

37014 16 other info unique to HAZ_ID 16

...

Like I mentioned to Joshua Bixby‌ above, I will play with this some later this afternoon.

Thank you both for your suggestions.

JohnLay · ‎01-09-2015

OK, This almost does what I was looking for (Joshua Bixby‌ and James Crandall‌ I just haven't gotten to your examples yet. James--had some trouble with installing Panda, but am square now)

I'm a little lost with it though. Please walk me through the bits I'm missing so that I may apply the info instead of just copy it.

magicNumberSet = set([16,17,18]) 
buildingDict = {}  
searchRows = arcpy.da.SearchCursor('L_DAMAGE_RESULTS_WIND', ["BLDG_ID", "HAZARD_ID"])  
for searchRow in searchrows:  
    buildingId, hazardId = searchRow  
    if buildingId in buildingDict:  
        buildingDict[buildingId].add(hazardId)  
    else:  
        buildingDict[buildingId] = set([hazardId])  
yesList = [buildingId for buildingId in buildingDict if magicNumberSet.issubset(buildingDict[buildingId])]  
noList = [set(buildingDict.keys()).difference(yesList)]

Chris, I get a little lost around line 9. if buildingDict[buildingId].add(hazardId) is appending the value set, I assume is buildingDict[buildingId] = set([hazardId]) creating the first instance. I'm not really sure what "=" is supposed to mean here. My brain automatically goes to one is equal to the other which can't be the case.

In line 10, for each key in the dictionary if the value set is a subset of ([16,17,18]) add it to the list. This would mean that sets ([17,18]) and ([17,18,19]) would be excluded from the list, but ([16,17,18,19]) would be added. Using magicNumberSet.issuperset(buildingDict[buildingId]) would mean the reverse is true. By replacing it with magicNumberSet==set(buildingDict[buildingId]) I would essentially be saying that the sets must be equal before being added to the list. Correct? I need to check that there are always 3 buildings and only 3 buildings with a hazard ID of 16, 17, and 18. If there are only 2 buildings or 4 buildings whatever the hazard value, the table fails. Does the order the values appear in the set matter?

JamesCrandall · ‎01-09-2015

I know there's no interest in pandas, but it really simplifies things and supercharges performance. This along with the arcpy.da.TableToNumpyArrray method it makes it super easy to integrate.

Test data (I just created a .csv file but it could be a gdb table or just about anything else):

BLDG_ID,HAZARD_ID,OTHER FIELDS

37013,11,other info unique to HAZ_ID 11

37013,46,other info unique to HAZ_ID 46

37013,9,other info unique to HAZ_ID 9

37013,16,other info unique to HAZ_ID 16

37013,17,other info unique to HAZ_ID 17

37013,18,other info unique to HAZ_ID 18

37014,8,other info unique to HAZ_ID 8

37014,6,other info unique to HAZ_ID 6

37014,33,other info unique to HAZ_ID 33

37014,16,other info unique to HAZ_ID 16

37014,17,other info unique to HAZ_ID 17

37014,18,other info unique to HAZ_ID 18

This does exactly what you want OP:

dat = r"H:\pandas_testdat.csv"
df = pd.read_csv(dat)
df2 = df[df['HAZARD_ID'].isin([16,17,18])]  
print df2.values

Yeah. That's all that is necessary

Result:

BLDG_ID,HAZARD_ID,OTHER FIELDS

37013L 16L 'other info unique to HAZ_ID 16'

37013L 17L 'other info unique to HAZ_ID 17'

37013L 18L 'other info unique to HAZ_ID 18'

37014L 16L 'other info unique to HAZ_ID 16'

37014L 17L 'other info unique to HAZ_ID 17'

37014L 18L 'other info unique to HAZ_ID 18'