Python Lookup Table?

JohnDye · ‎12-02-2013

I'm trying to make a lookup table of sorts in my code.

I have a SearchBox in which a user can enter the name of any retailer to search for the retailer in our retailer database. If found, a layerfile will be created and populated in the dataframe for the given retailer.

What the user enters into that searchbox is instantiated under the variable 'SearchTerm'.

Since a user's 'SearchTerm' is going to be inherenly subjective, I'm trying to figure out how I can account for common variances in a 'SearchTerm' which should equate to the same retailer.

For Example:
If the user is searching for the Retailer 'TJ Maxx', they might enter into the SearchBox any one one of the following:

'TJ Maxx', 'TJ Max' or 'TJX'

All three of those search terms should resolve to 'TJMaxx', which is how it is identified in the database. I thought about perhaps using a python dictionary where the name of the dataset in the Database would be the key and the possible search terms could be the values, but when I played with it I couldn't find a way to associate multiple values with a single key in a super simple way.

Basically, I just need to figure out an efficient way to resolve multiple 'tags' to a single, unique return, if that makes any sense.

StacyRendall1 · ‎12-02-2013

Seems like you are on the right track, although it may be possible to do something with regular expressions too (unfortunately I do not understand them enough to know if they will help).

Something like this should do what you want:

d = {'TJMaxx': ['TJ Maxx', 'TJ Max', 'TJX', 'TJMaxx']}  SearchTerm = 'TJ Maxx'  selected = None  # iterate through dictionary, seeing if term in sub list for k in d:     if SearchTerm in d:         selected = k  if selected is not None:     ## do something else:     ## report invalid entry

It is potentially inefficient in that it has to search through all entries of the dictionary, but even with a hundred thousand retailers it takes less than 0.1 seconds to perform the search on my rubbish computer (with a million it takes about a second).

If you were really worried about the search efficiency you could index the dictionary into a list, like so:

d = {'TJMaxx': ['TJ Maxx', 'TJ Max', 'TJX', 'TJMaxx']}  # make lists of value, key pairs as items in separate lists l = [[],[]] for k in d:     for i in d:         l[0].append(i)         l[1].append(k)  SearchTerm = 'TJ Maxx'  # search for item in list, return 'key' value if SearchTerm in l[0]:     selected = l[1][l[0].index(SearchTerm)]     ## do something else:     ## report invalid entry

Creating the list takes about five times longer than searching the dictionary, but searching the list takes half the time of searching the dictionary. This means if you can find a way of only creating the list once per time you use the tool (rather than once per search entry), or create the list, print it, then hard code it (instead of the dictionary) you will be able to increase the efficiency. If you change the inputs you will need to recreate the list....

View solution in original post

StacyRendall1 · ‎12-02-2013

Seems like you are on the right track, although it may be possible to do something with regular expressions too (unfortunately I do not understand them enough to know if they will help).

Something like this should do what you want:

d = {'TJMaxx': ['TJ Maxx', 'TJ Max', 'TJX', 'TJMaxx']}  SearchTerm = 'TJ Maxx'  selected = None  # iterate through dictionary, seeing if term in sub list for k in d:     if SearchTerm in d:         selected = k  if selected is not None:     ## do something else:     ## report invalid entry

It is potentially inefficient in that it has to search through all entries of the dictionary, but even with a hundred thousand retailers it takes less than 0.1 seconds to perform the search on my rubbish computer (with a million it takes about a second).

If you were really worried about the search efficiency you could index the dictionary into a list, like so:

d = {'TJMaxx': ['TJ Maxx', 'TJ Max', 'TJX', 'TJMaxx']}  # make lists of value, key pairs as items in separate lists l = [[],[]] for k in d:     for i in d:         l[0].append(i)         l[1].append(k)  SearchTerm = 'TJ Maxx'  # search for item in list, return 'key' value if SearchTerm in l[0]:     selected = l[1][l[0].index(SearchTerm)]     ## do something else:     ## report invalid entry

Creating the list takes about five times longer than searching the dictionary, but searching the list takes half the time of searching the dictionary. This means if you can find a way of only creating the list once per time you use the tool (rather than once per search entry), or create the list, print it, then hard code it (instead of the dictionary) you will be able to increase the efficiency. If you change the inputs you will need to recreate the list....

ShaunWalbridge · ‎12-02-2013

Generally, it'll probably be worth figuring out a way to display multiple search results to your users, unless you do have a relatively limited set of terms, and can come up enumerate the alternatives. One way of handling the ambiguous input from users is to use fuzzy string matching. It's related to the regular expressions Stacy mentioned, but is specifically for the comparison of inexact terms.

There are a number of algorithms for fuzzy matching, many of them built in to the database, such as SOUNDEX and Double Metaphone. If you're using SQL Server, you might want to look into Full Text Indexing, which can be used to generate a fuzzy search automatically. By doing it in the database, you'll save the work of having to do as much custom Python code to interact with the results, you can just use the sorted database results instead.

If you want to stick to doing things in Python, one way is to build up a list of all of your search terms, and then do the double metaphone or other similarity search against the user's input. One library I've used in the past that is handy is jellyfish, which contains a number of different string comparison engines.

cheers,
Shaun

JohnDye · ‎12-03-2013

Stacy,
Thank you so much!!! I had my dictionary defined exactly as you did with values indexed to a list but could not figure out how to iterate through each item in the value list. The entire list was always returned as a single value for the key. This is exactly what I was trying to do.

JohnDye · ‎12-03-2013

Something like this should do what you want:

d = {'TJMaxx': ['TJ Maxx', 'TJ Max', 'TJX', 'TJMaxx']}

SearchTerm = 'TJ Maxx'

selected = None

# iterate through dictionary, seeing if term in sub list
for k in d:
    if SearchTerm in d:
        selected = k

if selected is not None:
    ## do something
else:
    ## report invalid entry

Stacy,
Thank you so much!!! I had my dictionary defined exactly as you did but could not figure out how to iterate through each item in the value list. The entire list was always returned as a single value for the key. This is exactly what I was trying to do.

DataAcquisition · ‎12-05-2013

Does Fuzzy Wuzzy work using addresses for this kind of thing? I was only recently exposed to it but it seems like it could return false positives/general confusion based on address numbers.