# How can I randomly select a number of point features and retain their attributes?

6714
21
06-01-2016 08:55 AM
Highlighted
New Contributor

I have >1m points in California and I want to randomly select ~10% and retain their attributes. "Create random points" doesn't work because it is creating, not selecting. I found a python script but can't make it work. Does anyone know a way using existing tools?

Tags (3)
1 Solution

Accepted Solutions
Highlighted
Occasional Contributor III

How about adding a field and calculating a random number to it, then sorting by the new field and selecting the first 10%?

21 Replies
Highlighted
Occasional Contributor III

How about adding a field and calculating a random number to it, then sorting by the new field and selecting the first 10%?

Highlighted
Esri Regular Contributor

SubsetFeatures, however, it requires a Geostatistical Analyst license

Highlighted
MVP Esteemed Contributor

If you have the license, using Geostatistical Analyst is definitely the most robust and statistically sound method of working with random subsets.

Highlighted
MVP Esteemed Contributor

I prefer to use a small code block, something like

`import numpy as npdef rand_int():    return np.random.randint(0, 10)expression box --- > rand_int()`

Now the above processes row by row, returning a random integer in the form of

"""

Return random integers from `low` (inclusive) to `high` (exclusive).

Return random integers from the "discrete uniform" distribution in the

"half-open" interval [`low`, `high`). If `high` is None (the default),

then results are from [0, `low`).

"""

`>>> a = np.random.randint(0, 10, 1000000)>>> b = a[a==5]>>> len(b)100247>>>`

which is pretty good, you get about 100,000 plus change (it will never exact) out of 1 million of the values equal to 5.

Highlighted
New Contributor

So easy! Made even easier by just limiting the random numbers to 10 and then selecting by attribute for one of the numbers. This is definitely a case of me overthinking when such a simple and elegant solution became obvious as soon as Bill set me on it. Thank you, sir!

Highlighted
MVP Honored Contributor

I'm sure there are some gotchas with this, but quick & dirty, here's how you can do it without adding a new field:

`>>> import random... fc = 'bc_geoname' # your feature class... arcpy.SelectLayerByAttribute_management(fc,"CLEAR_SELECTION") # clear selection to consider all features... feature_count = int(arcpy.GetCount_management(fc).getOutput(0)) # count features... percent = 0.10 # enter your desired percentage... rnd_set = set([]) # create a set... while len(rnd_set) < (feature_count * percent): # do until your set is full...    rnd_set.add(random.randint(0,feature_count-1)) # make a random integer and try to add it to the set... where = '"FID" in ({0})'.format(','.join(map(str,rnd_set))) # include set in SQL where clause... arcpy.SelectLayerByAttribute_management(fc,"NEW_SELECTION",where) # select the FIDs in the set`

This runs instantly on 40,000 features, so I assume it will work for you, unless it hits some memory limit.

Highlighted
MVP Esteemed Contributor

Darren has a good point and raises another important distinction.

One advantage of adding a field, is that you have 10 (in this case) samples which don't overlap and will never choose the same record, if you had to pull out several samples of about 10%.  This is sampling without replacement, for doing things like t-tests, Darren's approach should be used if you need sampling with replacement and those associated tests.

Both methods have their ups and downs and one should know both.

It is also important to note that the random.randint, randominteger are but 2 of a bunch of a bunch of distributions you can draw from ie (beta, binomial, chisquare, f, gamma, geometric, gumbel, hypergeometric, laplace, logistic, lognormal,.... poisson, Cauchy, exponential, gamma, standard_t, uniform Weibull)  Allowing one to pull samples out for distribution testing.

explore

import numpy as np

dir(np.random)

Highlighted
MVP Honored Contributor

I believe using a set (vs. list) ensures that this is essentially sampling without replacement, without the performance benefit of ignoring pre-selected values. Using a list would be sampling with replacement.

Highlighted
MVP Esteemed Contributor

set reduces sample size however, so it depends upon the sample size needed and the set that is drawn, plus the record is permanent.

Another way is to produce the array (potentially larger) and save the array to disk (*.npy or .npz) then draw from than by appending to a featureclass.