topic Re: How can I randomly select a number of point features and retain their attributes? in Data Management Questions

How can I randomly select a number of point features and retain their attributes?

DarinJensen — Wed, 01 Jun 2016 15:55:36 GMT

I have >1m points in California and I want to randomly select ~10% and retain their attributes. "Create random points" doesn't work because it is creating, not selecting. I found a python script but can't make it work. Does anyone know a way using existing tools?

Re: How can I randomly select a number of point features and retain their attributes?

BillDaigle — Wed, 01 Jun 2016 16:02:23 GMT

How about adding a field and calculating a random number to it, then sorting by the new field and selecting the first 10%?

Re: How can I randomly select a number of point features and retain their attributes?

SteveLynch — Wed, 01 Jun 2016 16:12:14 GMT

SubsetFeatures, however, it requires a Geostatistical Analyst license

Re: How can I randomly select a number of point features and retain their attributes?

DanPatterson_Retired — Sat, 11 Dec 2021 14:22:12 GMT

add a numeric integer field....

I prefer to use a small code block, something like

import numpy as np
def rand_int():
    return np.random.randint(0, 10)

expression box --- > rand_int()

Now the above processes row by row, returning a random integer in the form of

"""

Return random integers from `low` (inclusive) to `high` (exclusive).

Return random integers from the "discrete uniform" distribution in the

"half-open" interval [`low`, `high`). If `high` is None (the default),

then results are from [0, `low`).

"""

>>> a = np.random.randint(0, 10, 1000000)
>>> b = a[a==5]
>>> len(b)
100247
>>>

which is pretty good, you get about 100,000 plus change (it will never exact) out of 1 million of the values equal to 5.

Re: How can I randomly select a number of point features and retain their attributes?

JoshuaBixby — Wed, 01 Jun 2016 16:58:34 GMT

If you have the license, using Geostatistical Analyst is definitely the most robust and statistically sound method of working with random subsets.

Re: How can I randomly select a number of point features and retain their attributes?

DarinJensen — Wed, 01 Jun 2016 17:06:39 GMT

So easy! Made even easier by just limiting the random numbers to 10 and then selecting by attribute for one of the numbers. This is definitely a case of me overthinking when such a simple and elegant solution became obvious as soon as Bill set me on it. Thank you, sir!

Re: How can I randomly select a number of point features and retain their attributes?

DarrenWiens2 — Sat, 11 Dec 2021 14:22:15 GMT

I'm sure there are some gotchas with this, but quick & dirty, here's how you can do it without adding a new field:

>>> import random
... fc = 'bc_geoname' # your feature class
... arcpy.SelectLayerByAttribute_management(fc,"CLEAR_SELECTION") # clear selection to consider all features
... feature_count = int(arcpy.GetCount_management(fc).getOutput(0)) # count features
... percent = 0.10 # enter your desired percentage
... rnd_set = set([]) # create a set
... while len(rnd_set) < (feature_count * percent): # do until your set is full
...    rnd_set.add(random.randint(0,feature_count-1)) # make a random integer and try to add it to the set
... where = '"FID" in ({0})'.format(','.join(map(str,rnd_set))) # include set in SQL where clause
... arcpy.SelectLayerByAttribute_management(fc,"NEW_SELECTION",where) # select the FIDs in the set

This runs instantly on 40,000 features, so I assume it will work for you, unless it hits some memory limit.

Re: How can I randomly select a number of point features and retain their attributes?

BillDaigle — Wed, 01 Jun 2016 17:13:50 GMT

May fail depending on the location of your data. Oracle has a limit of 1000 items inside an "IN" clause. I'm not sure what the limits are for file gdbs or other databases.

Re: How can I randomly select a number of point features and retain their attributes?

DarrenWiens2 — Wed, 01 Jun 2016 17:16:37 GMT

Shapefile, ftw.

Re: How can I randomly select a number of point features and retain their attributes?

DanPatterson_Retired — Wed, 01 Jun 2016 17:21:05 GMT

Darren has a good point and raises another important distinction.

One advantage of adding a field, is that you have 10 (in this case) samples which don't overlap and will never choose the same record, if you had to pull out several samples of about 10%. This is sampling without replacement, for doing things like t-tests, Darren's approach should be used if you need sampling with replacement and those associated tests.

Both methods have their ups and downs and one should know both.

It is also important to note that the random.randint, randominteger are but 2 of a bunch of a bunch of distributions you can draw from ie (beta, binomial, chisquare, f, gamma, geometric, gumbel, hypergeometric, laplace, logistic, lognormal,.... poisson, Cauchy, exponential, gamma, standard_t, uniform Weibull) Allowing one to pull samples out for distribution testing.

explore

import numpy as np

dir(np.random)

Re: How can I randomly select a number of point features and retain their attributes?

DarrenWiens2 — Wed, 01 Jun 2016 17:27:06 GMT

I believe using a set (vs. list) ensures that this is essentially sampling without replacement, without the performance benefit of ignoring pre-selected values. Using a list would be sampling with replacement.

Re: How can I randomly select a number of point features and retain their attributes?

DanPatterson_Retired — Wed, 01 Jun 2016 17:31:06 GMT

set reduces sample size however, so it depends upon the sample size needed and the set that is drawn, plus the record is permanent.

Another way is to produce the array (potentially larger) and save the array to disk (*.npy or .npz) then draw from than by appending to a featureclass.

Re: How can I randomly select a number of point features and retain their attributes?

DarrenWiens2 — Wed, 01 Jun 2016 17:36:57 GMT

I'm not sure how that applies to my example. I'm not making 10% number of choices, I'm growing the set until it reaches 10% of the size of the feature count. The sample size has not been reduced by the set.

Re: How can I randomly select a number of point features and retain their attributes?

DanPatterson_Retired — Wed, 01 Jun 2016 17:48:00 GMT

In brief...amongst other things, there is sampling with replacement, sampling without replacement, producing samples, producing separate samples that can be replicated, sampling within samples and sampling from know distributions in those cases. Too much detail for here, but Cross Validated has some great threads and Bill Huber (whuber for those that know him from the GIS days) has some quite insightful information in his threads if you are interested in statistics and its applications in GIS

Re: How can I randomly select a number of point features and retain their attributes?

DarrenWiens2 — Wed, 01 Jun 2016 18:01:11 GMT

Not trying to be too confrontational, but none of that relates to my example fitting under "sampling with replacement". Even if you technically replace the possible choices (like using random does), but ignore duplicates (like using a set does), that is still sampling without replacement (you can ignore the probability of selecting a duplicate). Maybe I'm wrong, but I'd like an explanation rather than a link to an entire stats forum.

Maybe the point you're missing and I need to be explicit about is that this only works using the FID field, which is guaranteed to be unique.

Re: How can I randomly select a number of point features and retain their attributes?

DanPatterson_Retired — Sat, 11 Dec 2021 14:22:17 GMT

Darren, by producing your framework once, you can ensure that you can draw from multiple samples in a variety of ways. That is the only point I was making. In the case of ensuring that you don't have duplication of a record, you divide your sample into almost equal sizes (there are exact ones, but we will go with an easy one).

Simple case

I just need 10% of the sample .. just do it once

I need to do it again duplicate observations doesn't matter ... use either of the two proposed methods

I need a 10%, but no duplicate observations... Your proposal requires checking that a record hasn't been selected... no biggy

Let's set up some scenarios:

>>> a = np.random.randint(0, 10, 1000)
>>> size = np.array([len(a[a==i]) for i in range(10)])
>>> size
array([ 93,  99, 100, 115, 108, 115,  86,  93,  94,  97])
>>> size.mean()
100.0

10% sample, draw 50% from 1 of our 10 samples and 50% from another of our samples...this ensures no duplicates

>>> d = np.concatenate((a[a==1][:50], a[a==3][:50]))   # draw 50 from a==1 and a==3 or
>>> d
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3])

or any combination of

np.concatenate((a[a==X][:50], a[a==Y][:50]))  # where X !=Y  lots of combinations, absolutely no duplication

Now you could vary the proportions if you want as well, here is an example 3 proportions are used

>>> #pull 30,60,10 out of a == 1, 3, 5 proportionally
>>> np.vstack((a[a==1][:30], a[a==3][60], a[a==5][10]))
9
>>> np.concatenate((a[a==1][:30], a[a==3][:60], a[a==5][:10]))
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5])
>>>

Now if you want replacement, you can just replace the == with either >=, <= or combinations thereof and you can vary your proportions in both cases by changing your slices. All of these have been for a uniform distribution, the principles also apply to other distributions as well.

Re: How can I randomly select a number of point features and retain their attributes?

DarrenWiens2 — Wed, 01 Jun 2016 18:57:51 GMT

I think I see what you're getting at. I was thinking of replacement within the initial sample, and you're talking about other parallel samples. It's possible that what I'm describing has a different term.

Re: How can I randomly select a number of point features and retain their attributes?

DanPatterson_Retired — Wed, 01 Jun 2016 19:16:10 GMT

The terminology is not consistent among disciplines.

Sampling without replacement, generally means, once an observation is pulled, it is out of the selection pool and not replaced either with an other record... or if the records (aka population) is fixed, it is outta there. So if you want to do another sample, that record is gone and you have to pull from the remainder.

Sampling with replacement basically means (for a finite population) a record can be selected more than once or not at all, if you draw many samples from the same number of records (population).

This introduces tricky stuff, if you want to compare the samples drawn from the same population and compare them... you don't want a record selected twice since it would appear in two (or more) samples. You just want to see if randomly drawn samples (since the assigning of the 1 to 10 is the only random part) share the same characteristics. If the samples don't have the same characteristics, then the population cannot be adequately described by a sample. In which case, you have to draw more samples from the population, increase your population size and/or find out what is causing the differentiation in the primary variable in the first place. Classic case... let's say precipitation in BC. I am sure you wouldn't just take a random number of locations (say 10%) and away you go... I would venture that even taking 50% of the stations would be dicey. So in a case like that, you might want to stratify your sample based on elevation and/or proximity to the coast and/or ad nauseum and then if you still had too many, you might want to sample from the stratified sample.

So in short, the approach one takes depends on what you have to do after. All approaches are valid, but some sampling strategies can be interesting like the varying proportions from a 10% random sample, You can really play around with the numbers in an quasi-automated to help determine appropriate sample size, whether the population is uniform etc etc.

I guess we should have branched this off to a discussion :

Re: How can I randomly select a number of point features and retain their attributes?

TimothyStoebner — Thu, 02 Jun 2016 15:59:13 GMT

Actually the Create Random Points tool does work. Just do a spatial join with your existing data and the new random points. Make sure you use your existing layer as a "Constraining Feature Class" in the creating random points tool. You could also use select by location instead of spatial join.

Re: How can I randomly select a number of point features and retain their attributes?

DarrenWiens2 — Thu, 02 Jun 2016 16:09:14 GMT

I believe you'd never be guaranteed to get 10% of the points using this method because some of the new random points would be closest to duplicate original points, or vice versa. But perhaps you could provide an example.

edit: fitting to the previous discussion, this would be an example of sampling with replacement.