I have >1m points in California and I want to randomly select ~10% and retain their attributes. "Create random points" doesn't work because it is creating, not selecting. I found a python script but can't make it work. Does anyone know a way using existing tools?
Solved! Go to Solution.
How about adding a field and calculating a random number to it, then sorting by the new field and selecting the first 10%?
How about adding a field and calculating a random number to it, then sorting by the new field and selecting the first 10%?
SubsetFeatures, however, it requires a Geostatistical Analyst license
If you have the license, using Geostatistical Analyst is definitely the most robust and statistically sound method of working with random subsets.
add a numeric integer field....
I prefer to use a small code block, something like
import numpy as np
def rand_int():
return np.random.randint(0, 10)
expression box --- > rand_int()
Now the above processes row by row, returning a random integer in the form of
"""
Return random integers from `low` (inclusive) to `high` (exclusive).
Return random integers from the "discrete uniform" distribution in the
"half-open" interval [`low`, `high`). If `high` is None (the default),
then results are from [0, `low`).
"""
>>> a = np.random.randint(0, 10, 1000000)
>>> b = a[a==5]
>>> len(b)
100247
>>>
which is pretty good, you get about 100,000 plus change (it will never exact) out of 1 million of the values equal to 5.
So easy! Made even easier by just limiting the random numbers to 10 and then selecting by attribute for one of the numbers. This is definitely a case of me overthinking when such a simple and elegant solution became obvious as soon as Bill set me on it. Thank you, sir!
I'm sure there are some gotchas with this, but quick & dirty, here's how you can do it without adding a new field:
>>> import random
... fc = 'bc_geoname' # your feature class
... arcpy.SelectLayerByAttribute_management(fc,"CLEAR_SELECTION") # clear selection to consider all features
... feature_count = int(arcpy.GetCount_management(fc).getOutput(0)) # count features
... percent = 0.10 # enter your desired percentage
... rnd_set = set([]) # create a set
... while len(rnd_set) < (feature_count * percent): # do until your set is full
... rnd_set.add(random.randint(0,feature_count-1)) # make a random integer and try to add it to the set
... where = '"FID" in ({0})'.format(','.join(map(str,rnd_set))) # include set in SQL where clause
... arcpy.SelectLayerByAttribute_management(fc,"NEW_SELECTION",where) # select the FIDs in the set
This runs instantly on 40,000 features, so I assume it will work for you, unless it hits some memory limit.
Darren has a good point and raises another important distinction.
One advantage of adding a field, is that you have 10 (in this case) samples which don't overlap and will never choose the same record, if you had to pull out several samples of about 10%. This is sampling without replacement, for doing things like t-tests, Darren's approach should be used if you need sampling with replacement and those associated tests.
Both methods have their ups and downs and one should know both.
It is also important to note that the random.randint, randominteger are but 2 of a bunch of a bunch of distributions you can draw from ie (beta, binomial, chisquare, f, gamma, geometric, gumbel, hypergeometric, laplace, logistic, lognormal,.... poisson, Cauchy, exponential, gamma, standard_t, uniform Weibull) Allowing one to pull samples out for distribution testing.
explore
import numpy as np
dir(np.random)
I believe using a set (vs. list) ensures that this is essentially sampling without replacement, without the performance benefit of ignoring pre-selected values. Using a list would be sampling with replacement.
set reduces sample size however, so it depends upon the sample size needed and the set that is drawn, plus the record is permanent.
Another way is to produce the array (potentially larger) and save the array to disk (*.npy or .npz) then draw from than by appending to a featureclass.