Hello,
Someone wrote the code below which works wonderfully to select a given percentage of polygons. I have about 1 hour of experience with python, but would like to alter the code so that it selects a number of polygons (opposed to a percentage). In particular, i would like it to allow me to select just one polygon at random. Thanks in advance for your help.
# Given an input layer, return an output *.lyr layer file containing a random
# selection of features.
#
# Original author: Leah Saunders, ESRI Inc, on Arcscripts.com. July 2007
# Major modification by Stephen Lead, 21st Feb 2008
import os, sys, random, arcgisscripting
try:
gp = arcgisscripting.create()
gp.overwriteoutput = 1
# Specify input featureclass, output *.lyr file and the percentage of
# random points to return. Set these parameters in ArcToolbox as shown.
inputFC = sys.argv[1] # Feature Class or Feature Layer
outputLyr = sys.argv[2] # Layer File
inpct = sys.argv[3] # Long
# Ensure that the input percentage is between 1 and 100%
inpct = min(int(inpct),100)
inpct = max(int(inpct),1)
# Work out how many features to select
inputDirname = os.path.dirname(inputFC)
inputBasename = os.path.basename(inputFC)
gp.workspace = inputDirname
desc = gp.describe(inputFC)
totpnts = gp.getcount(inputFC)
numValues = int(round(totpnts * float(inpct) / 100.0))
gp.addmessage("Selecting " + str(numValues) + " random features")
# Generate a list of all features, and select randomly from this
inList = []
randomList = []
fldname = desc.OIDFieldName
rows = gp.SearchCursor(inputFC)
row = rows.next()
gp.addmessage ("Loading all IDs into a list")
while row:
id = row.GetValue(fldname)
inList.append(id)
row = rows.next()
selpnts = 0
gp.addmessage("Creating the list of randomly selected features")
while len(randomList) < numValues:
selpnts += 1
selItem = random.choice(inList)
randomList.append(selItem)
inList.remove(selItem)
# Select features whose OID value occurs in the random list, generate
# a *.lyr file from this selection. (Leading and trailing [ and ] marks
# need to be removed from the list object)
theLen = len(str(randomList))
sqlexp = '"' + fldname + '"' + " in " + "(" + str(randomList)[1:theLen - 1] + ")"
selectionLyr = inputBasename + " selection"
gp.MakeFeatureLayer_management(inputFC, selectionLyr, sqlexp)
gp.SaveToLayerFile_management(selectionLyr, outputLyr)
gp.addmessage("\nOutput layer " + outputLyr + " contains features randomly selected from " + inputBasename +
"\n")
except:
gp.adderror("Error running script. Try specifying the full path to the input layer")
# END OF FILE
I am a VB programmer,not a Python guru. The changes I made are untested.... but doing something along the lines of the changes below should accomplish what you want .... without having to completely re-write the code: (numPoly) is the number of randomly selected Polygons you wish between 1 and something less than the total # of possible polygons there are to select from.
# Given an input layer, return an output *.lyr layer file containing a random # selection of features. # # Original author: Leah Saunders, ESRI Inc, on Arcscripts.com. July 2007 # Major modification by Stephen Lead, 21st Feb 2008 import os, sys, random, arcgisscripting try: gp = arcgisscripting.create() gp.overwriteoutput = 1 # Specify input featureclass, output *.lyr file and the percentage of # random points to return. Set these parameters in ArcToolbox as shown. inputFC = sys.argv[1] # Feature Class or Feature Layer outputLyr = sys.argv[2] # Layer File #inpct = sys.argv[3] # Long numPoly = sys.argv[3] #Long Must be less than Actual Number of Polygons # Ensure that the input percentage is between 1 and 100% #inpct = min(int(inpct),100) #inpct = max(int(inpct),1) # Work out how many features to select inputDirname = os.path.dirname(inputFC) inputBasename = os.path.basename(inputFC) gp.workspace = inputDirname desc = gp.describe(inputFC) totpnts = gp.getcount(inputFC) #numValues = int(round(totpnts * float(inpct) / 100.0)) numValues = numPoly gp.addmessage("Selecting " + str(numValues) + " random features") # Generate a list of all features, and select randomly from this inList = [] randomList = [] fldname = desc.OIDFieldName rows = gp.SearchCursor(inputFC) row = rows.next() gp.addmessage ("Loading all IDs into a list") while row: id = row.GetValue(fldname) inList.append(id) row = rows.next() selpnts = 0 gp.addmessage("Creating the list of randomly selected features") while len(randomList) < numValues: selpnts += 1 selItem = random.choice(inList) randomList.append(selItem) inList.remove(selItem) # Select features whose OID value occurs in the random list, generate # a *.lyr file from this selection. (Leading and trailing [ and ] marks # need to be removed from the list object) theLen = len(str(randomList)) sqlexp = '"' + fldname + '"' + " in " + "(" + str(randomList)[1:theLen - 1] + ")" selectionLyr = inputBasename + " selection" gp.MakeFeatureLayer_management(inputFC, selectionLyr, sqlexp) gp.SaveToLayerFile_management(selectionLyr, outputLyr) gp.addmessage("\nOutput layer " + outputLyr + " contains features randomly selected from " + inputBasename + "\n") except: gp.adderror("Error running script. Try specifying the full path to the input layer") # END OF FILE
or if you have a Geostatistical Analyst license you could use the SubsetFeatures gp tool, viz.,
arcpy.SubsetFeatures_ga('myPolys', 'outPolys', '', 5, "ABSOLUTE_VALUE") will return 5 randomly selected polygons. and
arcpy.SubsetFeatures_ga('myPolys', 'outPolys', '', 5, "PERCENTAGE_OF_INPUT") will return a random 5% of the polygons.
or if you have n polygons, then generate 5 random numbers between 1 and n and use these as polygon id's and extract those features. Which is basically what the script above does.
-Steve
You could use this code:
def main(): import arcpy from random import randint fc_in = arcpy.GetParameterAsText(0) # input featureclass fl_out = arcpy.GetParameterAsText(1) # output layerfile cnt = arcpy.GetParameter(2) # number of features to select cnt_feats = int(arcpy.GetCount_management(fc_in).getOutput(0)) if cnt > 0 and cnt < cnt_feats: fld_oid = arcpy.Describe(fc_in).OIDFieldname lst_oids = [r[0] for r in arcpy.da.SearchCursor(fc_in, (fld_oid))] sel_oid = [] while len(sel_oid) < cnt: i = randint(0, cnt_feats-1) oid = lst_oids if not str(oid) in sel_oid: sel_oid.append(str(oid)) oids = ", ".join(sel_oid) where = "{0} IN ({1})".format(arcpy.AddFieldDelimiters(fc_in, fld_oid), oids) arcpy.MakeFeatureLayer_management(fc_in, "selection", where) arcpy.SaveToLayerFile_management("selection", fl_out) if __name__ == '__main__': main()
If we are already importing random, then we can rely on random.sample to do the heavy lifting for us, assuming we have already went to the effort of building an OID list and want sampling without replacement.
def main(): import arcpy from random import sample fc_in = arcpy.GetParameterAsText(0) # input featureclass fl_out = arcpy.GetParameterAsText(1) # output layerfile cnt = arcpy.GetParameter(2) # number of features to select fld_oid = arcpy.Describe(fc_in).OIDFieldname lst_oids = [oid for oid, in arcpy.da.SearchCursor(fc_in, (fld_oid))] oids = ", ".join(map(str, sample(lst_oids, cnt))) where = "{0} IN ({1})".format(arcpy.AddFieldDelimiters(fc_in, fld_oid), oids) arcpy.MakeFeatureLayer_management(fc_in, "selection", where) arcpy.SaveToLayerFile_management("selection", fl_out) if __name__ == '__main__': main()
If building an OID list in-memory becomes an issue, one could resort to using a reservoir sampling approach.
def stream_sample(iterator, k): from random import randint result = [next(iterator) for _ in range(k)] n = k for item in iterator: n += 1 s = randint(0, n) if s < k: result= itemreturn result def main(): import arcpy fc_in = arcpy.GetParameterAsText(0) # input featureclass fl_out = arcpy.GetParameterAsText(1) # output layerfile cnt = arcpy.GetParameter(2) # number of features to select fld_oid = arcpy.Describe(fc_in).OIDFieldname sample_oids = [oid for oid, in stream_sample(arcpy.da.SearchCursor(fc_in, "OID@"), cnt)] oids = ", ".join(map(str, sample_oids)) where = "{0} IN ({1})".format(arcpy.AddFieldDelimiters(fc_in, fld_oid), oids) arcpy.MakeFeatureLayer_management(fc_in, "selection", where) arcpy.SaveToLayerFile_management("selection", fl_out) if __name__ == '__main__': main()
A plus of using reservoir sampling is that the memory footprint can be quite modest to trivial when working with very large datasets. A minus of using reservoir sampling is that calling random so many times can add noticeable overhead; that said, I can still sample from a million records in a couple seconds.
The stream_sample function is taken from JesseBuesking on the StackExchange thread: pick N items at random. The code from JesseBuesking is basically just implementing Don Knuth's algorithm for picking random elements from a set whose cardinality is unknown.
A completely different approach to relying on application logic to select random records would be to rely on the database management system to select random records. If someone is using an enterprise DBMS (Oracle, SQL Server, PostgreSQL, etc..), the database will support some way of returning randomly selected records.
def main(): import arcpy import os fc_in = arcpy.GetParameterAsText(0) # input featureclass fl_out = arcpy.GetParameterAsText(1) # output layerfile cnt = arcpy.GetParameter(2) # number of features to select dbms = # use 'oracle' or 'sqlserver' oid_fld = arcpy.Describe(fc_in).OIDFieldName oracle_where_clause = ( "{} IN (SELECT {} FROM " "(SELECT {} FROM {} ORDER BY dbms_random.value) " "WHERE rownum <= {})".format( oid_fld, oid_fld, oid_fld, os.path.split(fc_in)[1], cnt ) ) sqlserver_where_clause = ( "{} IN (SELECT TOP {} {} FROM {} ORDER BY NEWID())".format( oid_fld, cnt, oid_fld, os.path.split(fc_in)[1] ) ) if dbms == 'oracle': where_clause = oracle_where_clause elif dbms == 'sqlserver': where_clause = sqlserver_where_clause arcpy.MakeFeatureLayer_management(fc_in, "tmpLayer") arcpy.SelectLayerByAttribute_management("tmpLayer", "NEW_SELECTION", where_clause) arcpy.MakeFeatureLayer_management("tmpLayer", "selection") arcpy.SaveToLayerFile_management("selection", fl_out) if __name__ == '__main__': main()
In many ways, this would not be a very good "general" approach for several reasons. One, relying on the DBMS makes the code more involved or less portable because every DBMS seems to have a different approach to selecting random records. Second, passing SQL through ArcGIS tools always seems to have a sketchiness about it. It does work, but I have definitely run into issues as well.
Looking at the code above, one might wonder why lines 29 and 30 exist, i.e., why not just pass the SQL directly to the MakeFeatureLayer tool. When I first ginned up this code, I tried doing just that, but I found an interesting/odd behavior. The SQL to select random records actually became embedded in the definition of the feature layer so every time ArcMap was refreshed, the records kept changing. ArcMap didn't like that, there would be issues with displaying polygons at times. It would be neat, though, with attribute-only data to load a dynamically, randomly changing table into ArcMap for testing at times.
Did some quick tests selecting 25 random records against a ~700,000 record non-versioned feature class in Oracle. Of the three different approaches I mentioned above (random.sample, reservoir sampling, and SQL sample), random.sample was the quickest taking about 4.5 seconds on average to make a feature layer. SQL sample took about 1.4x longer than random.sample while reservoir sampling took 5.1x longer.
It seems the overhead of calling random over N items becomes quite impactful with hundreds of thousands of records, and that impact will only grow as N grows. Also, the memory impact from fully populating an OID list was quite a bit smaller than I anticipated. The reservoir sampling code above is quite simple and isn't distributed, which I have seen some high-performing distributed reservoir sampling code, but it still demonstrates the trade-offs of different approaches.
The SQL sampling performed fairly well, a bit better than I expected. Given it wasn't that much slower than random.sampling, it seems a more efficient piece of SQL might make the approach more competitive. It would be interesting to see an SQL Server comparison, but that will have to wait for another day.
Wow, thanks everyone for your responses! So far I've only had a chance to try Ted's code. Ted, for some reason, it didn't work, but after seeing what you did, which made a lot of sense to me (my background is also in VB), I simply replaced this line from the original code:
numValues = int(round(totpnts * float(inpct) / 100.0))
with this:
numValues = int(inpct)
Which seems to work.
Thanks everyone else for your scripts, too; going through each of them, figuring out what you've done, and then tinkering with them will be very helpful as I try to familiarize myself with Python. Much appreciated!
-Jason
I just gave an approach based upon your code to get you unstuck.... I figured the experts would provide the "best practice" alternatives, which greatly helps coding development providing you understand the solution. I did enjoy the finesse displayed at the numerous ways the problem was attacked. You hit upon my gripe with python -- does not work consistently between machines.
The code I sent works on mine. I have used other code that works for another person but does not work in my environment. At least with VB it was getting the correct reference then the code worked on any machine. I do not have a lot of confidence stating the code will work for just because it works on my machine. So I enjoy lots of different ways of attacking the problem (give greater chance that one solution will work for me!)