Slow Select By Attribute processing

MPickering · ‎07-31-2011

Hi there,

I have written some code that reads each line of a text file and adds the lines to a list. It then loops through each item in this list and applies the item to a definition query to select By attribute features in a feature layer. Once its finished looping through the list it copies all selected features to a shapefile.

My concern is that it is extremely slow to process and I am wondering if there is anyway I can do to speed the process up or perhaps a different way I could approach this to speed the script up?

Any ideas on how I could speed up my script would be greatly appreciated...

Here's my code:

# Text files
output = "D:\\Subblocks\\SubblocksOutput.txt"

# Set overwrite property
gp.OverWriteOutput = 1

# Make feature layer
gp.MakeFeatureLayer("D:\\SUBBLOCKS.shp", "feature_lyr")

# Open output file in read only mode
text_file = open(output, "r")

# Loop through text file and add to list
lines = text_file.readlines()
print len(lines)

for line in lines:
    print line
    query = "\"KEY\"" + " = " + "\'" + line.rstrip("\n") + "\'"
    gp.SelectLayerByAttribute_management("feature_lyr", "ADD_TO_SELECTION", query)

# Copy selected features to shapefile
gp.CopyFeatures_management("feature_lyr", "D:\\Subblocks\\SELECTED_SUBBLOCKS.shp")

# Close text file
text_file.close()

# Free memory
del gp

ThomMackey · ‎07-31-2011

Hi M,

Not certain it'd definitely fix it, but I would probably build the query as one string, and use the Select gp tool to extract the records all at once. This would remove the loop and prevent modifying the selection of the layer, which may speed it up. It would also mean you wouldn't need to create a feature layer of the shapefile in the first place I believe.

Assuming that your text file has a Key on every line, something like this might work (untested!):


# Text files 
output = "D:\\Subblocks\\SubblocksOutput.txt" 

# Set overwrite property
gp.OverWriteOutput = 1

# Make feature layer
gp.MakeFeatureLayer("D:\\SUBBLOCKS.shp", "feature_lyr")

# Open output file in read only mode
text_file = open(output, "r") 
# Read the file into a list
data = text_file.readlines()
# Strip the data of trailing newlines
cleandata = [x.strip() for x in data]
# Quick function to put the data into SQL syntax
querify = lambda st: '"KEY" = \'%s\''%st
# Apply that function to all data (transforming each element from "a1" to "KEY" = 'A1')
queryterms = map(querify,cleandata)
# Put " OR " between each of the queries
querystring = " OR ".join(queryterms)
# Use the Select tool once
gp.Select(input_shapefile,output_dataset,querystring)

# Close text file
text_file.close() 

# Free memory
del gp

So this is basically reading the text file into a list, and then building one long SQL query string which will be "KEY" = 'A1' OR "KEY" = 'B2' OR "KEY" = 'C3'...
It then passes that string as the argument to the select tool, which extracts records matching the query into a new dataset. Note that you might have to look up the actual syntax of the Select tool, I'm not certain of the order of arguments.

Let me know how it goes!
-Thom

MPickering · ‎08-01-2011

Hi Thom,

Thanks so much for your reply! It worked perfectly! 😄

I followed your example exactly and it processed the 12,000 lines in minutes which is brilliant. Before it was still processing after a full day and I would just give up on it.

Another change I made was converting my shapefile to a feature class in a file geodatabase and then using this to create a feature layer (gp.MakeFeatureLayer) to use in the processing and this also dramatically improved the process.

Thanks again!

m

ThomMackey · ‎08-01-2011

Glad it worked! Days to minutes is a pretty big improvement! I didn't realise that you had that many records, I've had the Select tool fail when trying to use an SQL string with more than ~10,000 conditions in it (using a similar process) so I'm glad it worked out for you 🙂 If you end up with more in future and it starts to break, I'd probably do something like

all_lines_to_query = [x.strip() for x in in_file.readlines()]

first_half = all_lines_to_query[:len(all_lines_to_query)/2]

second_half = all_lines_to_query[len(all_lines_to_query)/2:]

Then do the process on both halves, and use the Merge (or is it Append?) gp tool to merge the results. It's a bit messy but it's the only way I've found that worked on large sets.

Anyway, happy to help 🙂

-Thom

MPickering · ‎08-01-2011

Hi Thom,

I was a little worried it wouldn't work for that many records too but when I changed the shapefile to a feature class in a file geodatabase and run the script it worked. Before as a shapefile it just crashed.

I have read on other posts that people have had similar issues with the SQL string failing with too many characters, conditions in the statement. I did see one post that mentioned something about using the 'IN' keyword in the SQL syntax e.g. "KEY" IN ( 'COOK1729A' , 'COOK1729B', etc, etc) and building it that way. They had success with this apparently for a large number of conditions. But I haven't tested it ...

I'll definitely keep the batching code you posted in mind for the future. Its a tidy bit of code that I will no doubt use in the future.

Thanks again Thom.

m

ThomMackey · ‎08-01-2011

Good tip! I always forget about SQL's "IN" operator. I just tested it with one of my old scripts, and was able to extract >300,000 unique records in one step - no messing around with merging!

So doing it that way, your script would go something like:

# The field you're querying
queryfield = "KEY"

# Read the file into a list
data = text_file.readlines()

# Strip the data of trailing newlines
cleandata = [x.strip() for x in data]

# Put single quotes around each string for SQL-friendliness
queryterms = ["'%s'"%x for x in cleandata]

# Put ", " between each of the queries
comma_sep_terms = ", ".join(queryterms)

# Then make the full expression
querystring = '"%s" IN (%s)'%(queryfield,comma_sep_terms)

# Use the Select tool once
gp.Select(input_shapefile,output_dataset,querystring)

Much cleaner! It seems odd, because I've always understood that "IN" is procedurally equivalent to multiple "OR"s (if anything a little slower), but for whatever reason, the GP prefers it.

Thanks heaps for mentioning that 😄

MPickering · ‎08-01-2011

Good to know it worked and glad I could return the favour. I think I will use this method too. 😉

Thanks again Thom, you've saved me a lot of frustration and time.

m