Help altering some python code to randomly select one feature

JasonKelly1 · ‎02-13-2015

Hello,

Someone wrote the code below which works wonderfully to select a given percentage of polygons. I have about 1 hour of experience with python, but would like to alter the code so that it selects a number of polygons (opposed to a percentage). In particular, i would like it to allow me to select just one polygon at random. Thanks in advance for your help.

# Given an input layer, return an output *.lyr layer file containing a random

# selection of features.

#

# Original author: Leah Saunders, ESRI Inc, on Arcscripts.com. July 2007

# Major modification by Stephen Lead, 21st Feb 2008

import os, sys, random, arcgisscripting

try:

gp = arcgisscripting.create()

gp.overwriteoutput = 1

# Specify input featureclass, output *.lyr file and the percentage of

# random points to return. Set these parameters in ArcToolbox as shown.

inputFC = sys.argv[1] # Feature Class or Feature Layer

outputLyr = sys.argv[2] # Layer File

inpct = sys.argv[3] # Long

# Ensure that the input percentage is between 1 and 100%

inpct = min(int(inpct),100)

inpct = max(int(inpct),1)

# Work out how many features to select

inputDirname = os.path.dirname(inputFC)

inputBasename = os.path.basename(inputFC)

gp.workspace = inputDirname

desc = gp.describe(inputFC)

totpnts = gp.getcount(inputFC)

numValues = int(round(totpnts * float(inpct) / 100.0))

gp.addmessage("Selecting " + str(numValues) + " random features")

# Generate a list of all features, and select randomly from this

inList = []

randomList = []

fldname = desc.OIDFieldName

rows = gp.SearchCursor(inputFC)

row = rows.next()

gp.addmessage ("Loading all IDs into a list")

while row:

id = row.GetValue(fldname)

inList.append(id)

row = rows.next()

selpnts = 0

gp.addmessage("Creating the list of randomly selected features")

while len(randomList) < numValues:

selpnts += 1

selItem = random.choice(inList)

randomList.append(selItem)

inList.remove(selItem)

# Select features whose OID value occurs in the random list, generate

# a *.lyr file from this selection. (Leading and trailing [ and ] marks

# need to be removed from the list object)

theLen = len(str(randomList))

sqlexp = '"' + fldname + '"' + " in " + "(" + str(randomList)[1:theLen - 1] + ")"

selectionLyr = inputBasename + " selection"

gp.MakeFeatureLayer_management(inputFC, selectionLyr, sqlexp)

gp.SaveToLayerFile_management(selectionLyr, outputLyr)

gp.addmessage("\nOutput layer " + outputLyr + " contains features randomly selected from " + inputBasename +

"\n")

except:

gp.adderror("Error running script. Try specifying the full path to the input layer")

# END OF FILE

TedKowal · ‎02-13-2015

I am a VB programmer,not a Python guru. The changes I made are untested.... but doing something along the lines of the changes below should accomplish what you want .... without having to completely re-write the code: (numPoly) is the number of randomly selected Polygons you wish between 1 and something less than the total # of possible polygons there are to select from.

# Given an input layer, return an output *.lyr layer file containing a random
# selection of features.
#
# Original author: Leah Saunders, ESRI Inc, on Arcscripts.com. July 2007
# Major modification by Stephen Lead, 21st Feb 2008


import os, sys, random, arcgisscripting
try:
    gp = arcgisscripting.create()
    gp.overwriteoutput = 1


    # Specify input featureclass, output *.lyr file and the percentage of
    # random points to return. Set these parameters in ArcToolbox as shown.
    inputFC = sys.argv[1] # Feature Class or Feature Layer
    outputLyr = sys.argv[2] # Layer File
    #inpct = sys.argv[3] # Long
    numPoly = sys.argv[3] #Long Must be less than Actual Number of Polygons


    # Ensure that the input percentage is between 1 and 100%
    #inpct = min(int(inpct),100)
    #inpct = max(int(inpct),1)


    # Work out how many features to select
    inputDirname = os.path.dirname(inputFC)
    inputBasename = os.path.basename(inputFC)


    gp.workspace = inputDirname
    desc = gp.describe(inputFC)
    totpnts = gp.getcount(inputFC)
    #numValues = int(round(totpnts * float(inpct) / 100.0))
    numValues = numPoly
    gp.addmessage("Selecting " + str(numValues) + " random features")


    # Generate a list of all features, and select randomly from this
    inList = []
    randomList = []
    fldname = desc.OIDFieldName
    rows = gp.SearchCursor(inputFC)
    row = rows.next()
    gp.addmessage ("Loading all IDs into a list")
    while row:
        id = row.GetValue(fldname)
        inList.append(id)
        row = rows.next()


    selpnts = 0
    gp.addmessage("Creating the list of randomly selected features")
    while len(randomList) < numValues:
        selpnts += 1
        selItem = random.choice(inList)
        randomList.append(selItem)
        inList.remove(selItem)


    # Select features whose OID value occurs in the random list, generate
    # a *.lyr file from this selection. (Leading and trailing [ and ] marks
    # need to be removed from the list object)
    theLen = len(str(randomList))
    sqlexp = '"' + fldname + '"' + " in " + "(" + str(randomList)[1:theLen - 1] + ")"
    selectionLyr = inputBasename + " selection"
    gp.MakeFeatureLayer_management(inputFC, selectionLyr, sqlexp)
    gp.SaveToLayerFile_management(selectionLyr, outputLyr)


    gp.addmessage("\nOutput layer " + outputLyr + " contains features randomly selected from " + inputBasename + "\n")
except:
    gp.adderror("Error running script. Try specifying the full path to the input layer")


# END OF FILE

SteveLynch · ‎02-13-2015

or if you have a Geostatistical Analyst license you could use the SubsetFeatures gp tool, viz.,

arcpy.SubsetFeatures_ga('myPolys', 'outPolys', '', 5, "ABSOLUTE_VALUE") will return 5 randomly selected polygons. and

arcpy.SubsetFeatures_ga('myPolys', 'outPolys', '', 5, "PERCENTAGE_OF_INPUT") will return a random 5% of the polygons.

or if you have n polygons, then generate 5 random numbers between 1 and n and use these as polygon id's and extract those features. Which is basically what the script above does.

-Steve

XanderBakker · ‎02-14-2015

You could use this code:

def main():
    import arcpy
    from random import randint

    fc_in = arcpy.GetParameterAsText(0) # input featureclass
    fl_out = arcpy.GetParameterAsText(1) # output layerfile
    cnt = arcpy.GetParameter(2) # number of features to select

    cnt_feats = int(arcpy.GetCount_management(fc_in).getOutput(0))

    if cnt > 0 and cnt < cnt_feats:
        fld_oid = arcpy.Describe(fc_in).OIDFieldname
        lst_oids = [r[0] for r in arcpy.da.SearchCursor(fc_in, (fld_oid))]

        sel_oid = []
        while len(sel_oid) < cnt:
            i = randint(0, cnt_feats-1)
            oid = lst_oids
            if not str(oid) in sel_oid:
                sel_oid.append(str(oid))

        oids = ", ".join(sel_oid)
        where = "{0} IN ({1})".format(arcpy.AddFieldDelimiters(fc_in, fld_oid), oids)

        arcpy.MakeFeatureLayer_management(fc_in, "selection", where)
        arcpy.SaveToLayerFile_management("selection", fl_out)

if __name__ == '__main__':
    main()

JoshuaBixby · ‎02-15-2015

If we are already importing random, then we can rely on random.sample to do the heavy lifting for us, assuming we have already went to the effort of building an OID list and want sampling without replacement.

def main():  
    import arcpy  
    from random import sample  

    fc_in = arcpy.GetParameterAsText(0) # input featureclass  
    fl_out = arcpy.GetParameterAsText(1) # output layerfile  
    cnt = arcpy.GetParameter(2) # number of features to select

    fld_oid = arcpy.Describe(fc_in).OIDFieldname
    lst_oids = [oid for oid, in arcpy.da.SearchCursor(fc_in, (fld_oid))]

    oids = ", ".join(map(str, sample(lst_oids, cnt)))
    where = "{0} IN ({1})".format(arcpy.AddFieldDelimiters(fc_in, fld_oid), oids)

    arcpy.MakeFeatureLayer_management(fc_in, "selection", where)  
    arcpy.SaveToLayerFile_management("selection", fl_out)  

if __name__ == '__main__':  
    main()

If building an OID list in-memory becomes an issue, one could resort to using a reservoir sampling approach.

def stream_sample(iterator, k):
    from random import randint
    result = [next(iterator) for _ in range(k)]

    n = k
    for item in iterator:
        n += 1
        s = randint(0, n)
        if s < k:
            result = item
    return result
 
 def main():  
    import arcpy  
  
    fc_in = arcpy.GetParameterAsText(0) # input featureclass  
    fl_out = arcpy.GetParameterAsText(1) # output layerfile  
    cnt = arcpy.GetParameter(2) # number of features to select
    
    fld_oid = arcpy.Describe(fc_in).OIDFieldname
    sample_oids = [oid for oid, in stream_sample(arcpy.da.SearchCursor(fc_in, "OID@"), cnt)]
    oids = ", ".join(map(str, sample_oids))
   
    where = "{0} IN ({1})".format(arcpy.AddFieldDelimiters(fc_in, fld_oid), oids)  
    arcpy.MakeFeatureLayer_management(fc_in, "selection", where)  
    arcpy.SaveToLayerFile_management("selection", fl_out)  

if __name__ == '__main__':  
    main()

A plus of using reservoir sampling is that the memory footprint can be quite modest to trivial when working with very large datasets. A minus of using reservoir sampling is that calling random so many times can add noticeable overhead; that said, I can still sample from a million records in a couple seconds.

The stream_sample function is taken from JesseBuesking on the StackExchange thread: pick N items at random. The code from JesseBuesking is basically just implementing Don Knuth's algorithm for picking random elements from a set whose cardinality is unknown.

JoshuaBixby · ‎02-15-2015

A completely different approach to relying on application logic to select random records would be to rely on the database management system to select random records. If someone is using an enterprise DBMS (Oracle, SQL Server, PostgreSQL, etc..), the database will support some way of returning randomly selected records.

def main():  
    import arcpy  
    import os

    fc_in = arcpy.GetParameterAsText(0) # input featureclass  
    fl_out = arcpy.GetParameterAsText(1) # output layerfile  
    cnt = arcpy.GetParameter(2) # number of features to select
    dbms = # use 'oracle' or 'sqlserver'

    oid_fld = arcpy.Describe(fc_in).OIDFieldName

    oracle_where_clause = (
        "{} IN (SELECT {} FROM "
               "(SELECT {} FROM {} ORDER BY dbms_random.value) "
               "WHERE rownum <= {})".format(
            oid_fld, oid_fld, oid_fld, os.path.split(fc_in)[1], cnt
        )
    )
   sqlserver_where_clause = (
        "{} IN (SELECT TOP {} {} FROM {} ORDER BY NEWID())".format(
            oid_fld, cnt, oid_fld, os.path.split(fc_in)[1]
        )
    ) 
    if dbms == 'oracle':
        where_clause = oracle_where_clause
    elif dbms == 'sqlserver':
        where_clause = sqlserver_where_clause

    arcpy.MakeFeatureLayer_management(fc_in, "tmpLayer")
    arcpy.SelectLayerByAttribute_management("tmpLayer", "NEW_SELECTION", where_clause)
    arcpy.MakeFeatureLayer_management("tmpLayer", "selection")
    arcpy.SaveToLayerFile_management("selection", fl_out)


if __name__ == '__main__':  
    main()

In many ways, this would not be a very good "general" approach for several reasons. One, relying on the DBMS makes the code more involved or less portable because every DBMS seems to have a different approach to selecting random records. Second, passing SQL through ArcGIS tools always seems to have a sketchiness about it. It does work, but I have definitely run into issues as well.

Looking at the code above, one might wonder why lines 29 and 30 exist, i.e., why not just pass the SQL directly to the MakeFeatureLayer tool. When I first ginned up this code, I tried doing just that, but I found an interesting/odd behavior. The SQL to select random records actually became embedded in the definition of the feature layer so every time ArcMap was refreshed, the records kept changing. ArcMap didn't like that, there would be issues with displaying polygons at times. It would be neat, though, with attribute-only data to load a dynamically, randomly changing table into ArcMap for testing at times.

JoshuaBixby · ‎02-15-2015

Did some quick tests selecting 25 random records against a ~700,000 record non-versioned feature class in Oracle. Of the three different approaches I mentioned above (random.sample, reservoir sampling, and SQL sample), random.sample was the quickest taking about 4.5 seconds on average to make a feature layer. SQL sample took about 1.4x longer than random.sample while reservoir sampling took 5.1x longer.

It seems the overhead of calling random over N items becomes quite impactful with hundreds of thousands of records, and that impact will only grow as N grows. Also, the memory impact from fully populating an OID list was quite a bit smaller than I anticipated. The reservoir sampling code above is quite simple and isn't distributed, which I have seen some high-performing distributed reservoir sampling code, but it still demonstrates the trade-offs of different approaches.

The SQL sampling performed fairly well, a bit better than I expected. Given it wasn't that much slower than random.sampling, it seems a more efficient piece of SQL might make the approach more competitive. It would be interesting to see an SQL Server comparison, but that will have to wait for another day.

JasonKelly1 · ‎02-16-2015

Wow, thanks everyone for your responses! So far I've only had a chance to try Ted's code. Ted, for some reason, it didn't work, but after seeing what you did, which made a lot of sense to me (my background is also in VB), I simply replaced this line from the original code:

numValues = int(round(totpnts * float(inpct) / 100.0))

with this:

numValues = int(inpct)

Which seems to work.

Thanks everyone else for your scripts, too; going through each of them, figuring out what you've done, and then tinkering with them will be very helpful as I try to familiarize myself with Python. Much appreciated!

-Jason

TedKowal · ‎02-17-2015

I just gave an approach based upon your code to get you unstuck.... I figured the experts would provide the "best practice" alternatives, which greatly helps coding development providing you understand the solution. I did enjoy the finesse displayed at the numerous ways the problem was attacked. You hit upon my gripe with python -- does not work consistently between machines.

The code I sent works on mine. I have used other code that works for another person but does not work in my environment. At least with VB it was getting the correct reference then the code worked on any machine. I do not have a lot of confidence stating the code will work for just because it works on my machine. So I enjoy lots of different ways of attacking the problem (give greater chance that one solution will work for me!)