I have >1m points in California and I want to randomly select ~10% and retain their attributes. "Create random points" doesn't work because it is creating, not selecting. I found a python script but can't make it work. Does anyone know a way using existing tools?
Solved! Go to Solution.
I believe using a set (vs. list) ensures that this is essentially sampling without replacement, without the performance benefit of ignoring pre-selected values. Using a list would be sampling with replacement.
set reduces sample size however, so it depends upon the sample size needed and the set that is drawn, plus the record is permanent.
Another way is to produce the array (potentially larger) and save the array to disk (*.npy or .npz) then draw from than by appending to a featureclass.
I'm not sure how that applies to my example. I'm not making 10% number of choices, I'm growing the set until it reaches 10% of the size of the feature count. The sample size has not been reduced by the set.
In brief...amongst other things, there is sampling with replacement, sampling without replacement, producing samples, producing separate samples that can be replicated, sampling within samples and sampling from know distributions in those cases. Too much detail for here, but Cross Validated has some great threads and Bill Huber (whuber for those that know him from the GIS days) has some quite insightful information in his threads if you are interested in statistics and its applications in GIS
Not trying to be too confrontational, but none of that relates to my example fitting under "sampling with replacement". Even if you technically replace the possible choices (like using random does), but ignore duplicates (like using a set does), that is still sampling without replacement (you can ignore the probability of selecting a duplicate). Maybe I'm wrong, but I'd like an explanation rather than a link to an entire stats forum.
Maybe the point you're missing and I need to be explicit about is that this only works using the FID field, which is guaranteed to be unique.
Darren, by producing your framework once, you can ensure that you can draw from multiple samples in a variety of ways. That is the only point I was making. In the case of ensuring that you don't have duplication of a record, you divide your sample into almost equal sizes (there are exact ones, but we will go with an easy one).
Simple case
I just need 10% of the sample .. just do it once
I need to do it again duplicate observations doesn't matter ... use either of the two proposed methods
I need a 10%, but no duplicate observations... Your proposal requires checking that a record hasn't been selected... no biggy
Let's set up some scenarios:
>>> a = np.random.randint(0, 10, 1000) >>> size = np.array([len(a[a==i]) for i in range(10)]) >>> size array([ 93, 99, 100, 115, 108, 115, 86, 93, 94, 97]) >>> size.mean() 100.0
10% sample, draw 50% from 1 of our 10 samples and 50% from another of our samples...this ensures no duplicates
>>> d = np.concatenate((a[a==1][:50], a[a==3][:50])) # draw 50 from a==1 and a==3 or >>> d array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
or any combination of
np.concatenate((a[a==X][:50], a[a==Y][:50])) # where X !=Y lots of combinations, absolutely no duplication
Now you could vary the proportions if you want as well, here is an example 3 proportions are used
>>> #pull 30,60,10 out of a == 1, 3, 5 proportionally >>> np.vstack((a[a==1][:30], a[a==3][60], a[a==5][10])) 9 >>> np.concatenate((a[a==1][:30], a[a==3][:60], a[a==5][:10])) array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]) >>>
Now if you want replacement, you can just replace the == with either >=, <= or combinations thereof and you can vary your proportions in both cases by changing your slices. All of these have been for a uniform distribution, the principles also apply to other distributions as well.
I think I see what you're getting at. I was thinking of replacement within the initial sample, and you're talking about other parallel samples. It's possible that what I'm describing has a different term.
The terminology is not consistent among disciplines.
Sampling without replacement, generally means, once an observation is pulled, it is out of the selection pool and not replaced either with an other record... or if the records (aka population) is fixed, it is outta there. So if you want to do another sample, that record is gone and you have to pull from the remainder.
Sampling with replacement basically means (for a finite population) a record can be selected more than once or not at all, if you draw many samples from the same number of records (population).
This introduces tricky stuff, if you want to compare the samples drawn from the same population and compare them... you don't want a record selected twice since it would appear in two (or more) samples. You just want to see if randomly drawn samples (since the assigning of the 1 to 10 is the only random part) share the same characteristics. If the samples don't have the same characteristics, then the population cannot be adequately described by a sample. In which case, you have to draw more samples from the population, increase your population size and/or find out what is causing the differentiation in the primary variable in the first place. Classic case... let's say precipitation in BC. I am sure you wouldn't just take a random number of locations (say 10%) and away you go... I would venture that even taking 50% of the stations would be dicey. So in a case like that, you might want to stratify your sample based on elevation and/or proximity to the coast and/or ad nauseum and then if you still had too many, you might want to sample from the stratified sample.
So in short, the approach one takes depends on what you have to do after. All approaches are valid, but some sampling strategies can be interesting like the varying proportions from a 10% random sample, You can really play around with the numbers in an quasi-automated to help determine appropriate sample size, whether the population is uniform etc etc.
I guess we should have branched this off to a discussion :
May fail depending on the location of your data. Oracle has a limit of 1000 items inside an "IN" clause. I'm not sure what the limits are for file gdbs or other databases.
Shapefile, ftw. 
