Compare two tables field values using wildcard

Per_Olav_Kaasa · ‎01-12-2018

Hello, I'm currently working with one feature class table containing 400 000 objects and one database table containing 700 objects. I'm currently working in ArcMap 10.4 and using a Geodatabase with Python 2.7. What I'm trying to do is match the database table containing 700 objects against the FC. What I want is to match them, treating all 700 objects as wildcars %700 ojects%, and match those towards the FC that contains text from the table. I have 2 tables with 1 column in each that im trying to match. I have not yet found a good way to treat all rows in a column as wildcard to search against another column.

This is my code so far. This only will print out what the database table values. Next i wanted to get the FC values and compare them. I've read about maybe using a dictionary to make it easier to extract values but I've yet to figure out how.

import arcpy, unicodedata, datetime, openpyxl 
arcpy.env.workspace =r"C:\Users\kaaper\Documents\DataDrivenPages\DataDrivenPages.gdb"

def rows_as_dicts(cursor):
    colnames = cursor.fields
    for row in cursor:
        yield dict(zip(colnames, row)

with arcpy.da.SearchCursor(r"C:\Users\kaaper\Documents\DataDrivenPages\DataDrivenPages.gdb\Ordbok", '*') as sc:
    for row in rows_as_dicts(sc):
        print row['field']‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Any input will be appreciated.

XanderBakker · ‎01-15-2018

To create a list from all the values in a field you can do something like this:

import arcpy

# settings
tbl1 = r"C:\Users\kaaper\Documents\DataDrivenPages\DataDrivenPages.gdb\YourTableName1"
fld1= "your field name from table 1"
tbl2 = r"C:\Users\kaaper\Documents\DataDrivenPages\DataDrivenPages.gdb\YourTableName2"
fld1= "your field name from table 2"

# create lists from tables
lst1 = [r[0] for r in arcpy.da.SearchCursor(tbl1, (fld1))]
lst2 = [r[0] for r in arcpy.da.SearchCursor(tbl2, (fld2))]
‍‍‍‍‍‍‍‍‍‍‍
# optionally you can use list(set(lst1)) and list(set(lst2)) to get lists with unique values‍‍‍‍‍‍‍‍‍‍‍‍‍

To compare the lists you can use something like this (it would be easier to have access to the data, this example creates random text values in both lists):

def main():
    import random

    # create list of table 1
    lst1 = [chr(i+65) for i in range(10)]
    print("\nContent Table 1:\n{}".format(lst1))

    lst2 = []
    for a in range(100):
        i = random.randrange(0, 10)
        chr1 = lst1[i]
        j = random.randrange(0, 26)
        chr2 = chr(j+65)
        txt = chr1 + chr2
        lst2.append(txt)
    print("\nContent Table 2:\n{}".format(lst2))

    # dict 1
    dct1 = {}
    for value1 in lst1:
        for value2 in lst2:
            if value1.upper() in value2.upper():
                if value1 in dct1:
                    data = dct1[value1]
                    data.append(value2)
                    data = list(set(data))
                    dct1[value1] = data
                else:
                    data = [value2]
                    dct1[value1] = data
    print("\nContent Dict 1:\n{}".format(dct1))

    # dict 2
    dct2 = {}
    for value2 in lst2:
        for value1 in lst1:
            if value1.upper() in value2.upper():
                if value2 in dct2:
                    data = dct2[value2]
                    data.append(value1)
                    data = list(set(data))
                    dct2[value2] = data
                else:
                    data = [value1]
                    dct2[value2] = data
    print("\nContent Dict 2:\n{}".format(dct2))

if __name__ == '__main__':
    main()‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

This will yield:

Content Table 1:
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

Content Table 2:
['CO', 'CA', 'GK', 'HT', 'HZ', 'IF', 'BJ', 'CD', 'GF', 'FS', 'FP', 'DO', 'BH', 'CW', 'CA', 'BV', 'IQ', 'DN', 'AC', 'IP', 'GY', 'EK', 'JH', 'EC', 'DN', 'CO', 'GM', 'FT', 'IV', 'AX', 'JO', 'HT', 'GK', 'JL', 'CX', 'HQ', 'EF', 'GM', 'CB', 'JB', 'DC', 'CE', 'AM', 'EA', 'CA', 'CA', 'EB', 'FI', 'HM', 'DO', 'FN', 'HI', 'AA', 'CL', 'HR', 'AS', 'HA', 'DI', 'BI', 'IJ', 'FN', 'IJ', 'GC', 'AQ', 'DF', 'HF', 'II', 'EP', 'AW', 'AP', 'FO', 'CY', 'FL', 'AS', 'BT', 'DD', 'BX', 'JH', 'JV', 'JN', 'JL', 'HR', 'IK', 'DU', 'BY', 'AP', 'FJ', 'EN', 'AT', 'IU', 'AL', 'EQ', 'DS', 'JP', 'HU', 'IN', 'IM', 'HV', 'CE', 'II']

Content Dict 1:
{'A': ['AA', 'AC', 'EA', 'CA', 'AM', 'AL', 'AQ', 'AP', 'AS', 'AT', 'AW', 'AX', 'HA'], 'C': ['AC', 'CO', 'CL', 'CB', 'CA', 'DC', 'EC', 'CD', 'CY', 'CX', 'GC', 'CE', 'CW'], 'B': ['CB', 'BH', 'BI', 'BJ', 'EB', 'BT', 'BV', 'BX', 'BY', 'JB'], 'E': ['EN', 'EK', 'EF', 'EA', 'CE', 'EB', 'EC', 'EQ', 'EP'], 'D': ['DN', 'DO', 'DI', 'DF', 'DD', 'DC', 'CD', 'DU', 'DS'], 'G': ['GC', 'GF', 'GM', 'GK', 'GY'], 'F': ['FP', 'FS', 'FT', 'DF', 'HF', 'EF', 'GF', 'FI', 'FJ', 'FO', 'FL', 'FN', 'IF'], 'I': ['DI', 'IQ', 'IP', 'IU', 'BI', 'IV', 'II', 'IK', 'IJ', 'HI', 'IN', 'IM', 'FI', 'IF'], 'H': ['HZ', 'HR', 'HV', 'HQ', 'BH', 'HT', 'HU', 'HI', 'JH', 'HM', 'HA', 'HF'], 'J': ['JO', 'JV', 'JP', 'BJ', 'JL', 'JN', 'IJ', 'JH', 'FJ', 'JB']}

Content Dict 2:
{'HZ': ['H'], 'FP': ['F'], 'FS': ['F'], 'FT': ['F'], 'DO': ['D'], 'HR': ['H'], 'EA': ['A', 'E'], 'AA': ['A'], 'HQ': ['H'], 'BH': ['H', 'B'], 'DC': ['C', 'D'], 'HT': ['H'], 'HU': ['H'], 'JL': ['J'], 'BV': ['B'], 'JO': ['J'], 'JH': ['H', 'J'], 'FL': ['F'], 'HM': ['H'], 'JV': ['J'], 'FI': ['I', 'F'], 'FJ': ['J', 'F'], 'HA': ['A', 'H'], 'HF': ['H', 'F'], 'BY': ['B'], 'JB': ['J', 'B'], 'FO': ['F'], 'HV': ['H'], 'DU': ['D'], 'GK': ['G'], 'DI': ['I', 'D'], 'IJ': ['I', 'J'], 'DS': ['D'], 'JP': ['J'], 'DF': ['D', 'F'], 'FN': ['F'], 'CB': ['C', 'B'], 'IK': ['I'], 'DD': ['D'], 'BX': ['B'], 'EC': ['C', 'E'], 'AC': ['A', 'C'], 'EN': ['E'], 'CO': ['C'], 'EK': ['E'], 'CL': ['C'], 'IQ': ['I'], 'IP': ['I'], 'CA': ['A', 'C'], 'EF': ['E', 'F'], 'AM': ['A'], 'AL': ['A'], 'GY': ['G'], 'CD': ['C', 'D'], 'AQ': ['A'], 'GF': ['G', 'F'], 'AS': ['A'], 'CX': ['C'], 'CE': ['C', 'E'], 'BJ': ['J', 'B'], 'AW': ['A'], 'IN': ['I'], 'AX': ['A'], 'GM': ['G'], 'IV': ['I'], 'CW': ['C'], 'EP': ['E'], 'EB': ['B', 'E'], 'IF': ['I', 'F'], 'II': ['I'], 'DN': ['D'], 'BT': ['B'], 'EQ': ['E'], 'AP': ['A'], 'CY': ['C'], 'HI': ['I', 'H'], 'JN': ['J'], 'IU': ['I'], 'IM': ['I'], 'GC': ['C', 'G'], 'BI': ['I', 'B'], 'AT': ['A']}‍‍‍‍‍‍‍‍‍‍‍

Per_Olav_Kaasa · ‎01-15-2018

Thank you for the example and help so far, I'm uploading 2 screenshots from the tables to give you a better understanding of what I'm trying to compare. Table1 is to the left and Table2 to the right.

XanderBakker · ‎01-15-2018

Can you share (a portion of) the actual data. At first sight I don't see many matches in the screenshot you provided.

Per_Olav_Kaasa · ‎01-17-2018

Uploaded the files in reply below.

JoshuaBixby · ‎01-15-2018

Since column1, Navn, can contain multiple terms/tags/tokens that you want to match; the SQL will get complicated enough that I don't know if every data store that ArcGIS supports will work. I think working within Python will be the way to go.

Instead of a screenshot, can you export those to a text or CSV file and attach it? As helpful as a screenshot can be, it is much easier to work with actual data than a picture of data.

JoshuaBixby · ‎01-15-2018

Whether Python or SQL, you have a lot of conditions you are trying to work with. For example, what does a comma and parentheses mean in column1? Do commas separate multiple terms? If so, do both terms need to be present to match column2? What about parentheses? What about the one row that is in quotes, is that one term or multiple terms?

JoshuaBixby · ‎01-15-2018

Will the new output, let's call it table3, just look like?

column1	column2
Mountain	Mountainrange
Mountain	Westmountainrange
pass	Donnors Pass
pass	Northwest Passage
pass	Dyatlov Pass

Per_Olav_Kaasa · ‎01-17-2018

The new output should show something like that yes, the text that is separated by comma in column1 is a diffraction of a word, that must both be checked against column2, as 1 object. So i need to have the code to treat comma as a delimiter to jump to next keyword. The ones in parenthesis is just a short explanation in other languages, not important for the actual search. Column1 is a dictionary, while column2 is a placename base for a large area. I'm uploading the files here, but only a portion(50 000) of the 400 000 in table2 since its too big.

JoshuaBixby · ‎01-17-2018

I realize part of our conversation is being lost in translation, so please bear with my questions....

Looking at Ordbok.csv, I assume ORDLISTE_–_GLOSSARY is the field/column containing terms for matching. Earlier, you stated that "both [must] be checked against column2, as 1 object." It is that last part, "as 1 object," that is throwing me. I think an example will be best for my question. Looking at Ordbok.csv:

OBJECTID	ORDLISTE_–_GLOSSARY
56	bru, brua
57	bru, brui
58	brun, brunen

Looking at ObjectID 56, is the match "bru OR brua" or "bru AND brua". I was assuming the first, but your "as 1 object" statement made me question my assumption.

XanderBakker · ‎01-17-2018

If I look at the data I kinda feel that the objective might be comparing "SNAVN" from the file Navn.csv against the field "field" from the file Ordbok.csv. That is just a guess and I don't we should be guessing...

There will be multiple hits either way and one should define what to do in that case. Lot of question marks before we can continue to write some code.