Select random group by DATE

RafaelRuas_Martins · ‎03-13-2017

Hi everyone.

I'd like to select randomly based on a case field.

I have a telemetry FC w/ more than 20k animal locations, w/ date/time field (yyyy-mm-dd hh:mm:ss).

I'd like to select points randomly, based on Date/Time field: ONE POINT RANDOMLY PER DAY.

If possible, the selected points must have more than 12 hours difference.

Can anyone help me with a code sample, please?

Tks.

DanPatterson_Retired · ‎03-13-2017

how many observations per day? one per hour?, per minute? per second? A batch query to sequentially select the records by day (Julian day if needed and I presume these are in sorted tabular form). Assign a random sequence to the day's records, then chose one record at random. so a Select by attributes, to get a list of records, assign a random sequence to those records and select one at random. ... move on to the next day. Which begs the question, you are going to all this trouble of randomizing the time in the day selection, why not go to a completely randomized approach or are you going to deal with separating out the stratified nature/issues as well?

RafaelRuas_Martins · ‎03-13-2017

Hi Dan.

As I sad, there are more than 20k records, coleted in 20 minutes step along almost one year. In order to minimize spatial dependence, the ideia is to select randomly 1 point per day. The amount of observations per day varies because GPS failed some times. We'd like one record per day randomly separately by 12h (minimum) to avoid the last record of a day and the first record of next day. Randomized all points approach will maybe compromise season sample to estimate home range. Sorry for my English, I hope you unterstand me.

RandyBurton · ‎03-13-2017

Just a rough idea of how you might do it, but it doesn't check separation by 12h.... Edit: I've added some code that should give you the 12h difference.

import time
from datetime import datetime
from random import randint

startDay = int(time.mktime(datetime.strptime('2010-01-01','%Y-%m-%d').timetuple()))
stopDay = int(time.mktime(datetime.strptime('2010-01-31','%Y-%m-%d').timetuple()))

randomTime = 0

for day in range(startDay, stopDay, 86400):
    if randomTime > 43200:
        randomTime = 43200 - (86400 - randomTime)
    else:
        randomTime =  0

    randomTime = randint(randomTime, 86399)
    dateString = datetime.fromtimestamp(day+randomTime).strftime('%Y-%m-%d %H:%M:%S')

    if randomTime < 43200:
        # morning
        print "SELECT TOP 1 column FROM table WHERE dateTimeField > '" + dateString + "' ORDER BY dateTimeField ASC"
    else:
        #afternoon
        print "SELECT TOP 1 column FROM table WHERE dateTimeField < '" + dateString + "' ORDER BY dateTimeField DESC"
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

RandyBurton · ‎03-13-2017

Another option might be to assume there are 72 records added per day. Then select at random one of those records. You could then use a "SELECT ... IN ( ...)" query. Edit: I've updated this to insure that records are separated by at least 36 (approximately 1/2 day's worth). It may not insure that one record per day is selected, but the average record is 1 day apart and should have a difference of at least 12h.

from random import randint

# assume 72 records added per day, aproximately 1 every 20 minutes
# 365 days * 72 records would total 26280 records
# record ID should be apart by at least 36

idList = []

randomNum = 0

for x in range(1, 26280, 72):
    if randomNum > 36:
        randomNum = 36 - (72 - randomNum)
    else:
        randomNum = 0

    randomNum = randint(randomNum, 72)
    idList.append(randomNum + x)
    # print randomNum + x
    
print "SELECT * FROM table WHERE ObjectID IN " + str(tuple(idList))
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

JoshuaBixby · ‎03-14-2017

There may be a single-pass solution, but my first thoughts went to a two-pass solution. The approach below relies on creating a new column (sel_field) that will be used to identify selected records using a value of 1. The first pass over the data set builds a dictionary containing selected records, the second pass updates the new field to reflect the selection.

from collections import defaultdict
from datetime    import datetime
from itertools   import groupby
from random      import sample, shuffle

fc =             # full path to feature class
dt_field =       # datetime field name
sel_field =      # selection field name
time_gap =       # minimum time gap between samples (seconds)

selected = defaultdict(int)

with arcpy.da.SearchCursor(fc, ["OID@", dt_field],
                           sql_clause=(None, "ORDER BY " + dt_field)) as cur:
    date_groups = groupby(cur, lambda x: datetime.date(x[1]))
    _, record_group = next(date_groups)
    (oid, dt), = sample(list(record_group), 1)
    selected[oid] = 1
    prev_dt = dt
    for _, record_group in date_groups:
        records = [(oid, dt) 
                   for oid, dt
                   in record_group
                   if (dt - prev_dt).total_seconds() > time_gap]
        if records:
            shuffle(records)
            oid, dt = next(iter(records))
            selected[oid] = 1
            prev_dt = dt
            
with arcpy.da.UpdateCursor(fc, ["OID@", sel_field]) as cur:
    for oid, _ in cur:
        cur.updateRow([oid, selected[oid]])‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍