Creating data for testing purposes

2249
0
04-04-2016 08:32 AM
Labels (1)
DanPatterson_Retired
MVP Emeritus
2 0 2,249

Have you every come across a situation like one of these:

  • you need to test out something but don't have the data
  • are you sick of trying to get a function to work in the field calculator
  • you want to test out one of ArcMap's functions but none of your data are suitable
  • all I need are some points which have a particular distribution
  • someone forgot to post a sample of their data on GeoNet for testing and you don't have a match
  • you forgot to collect something in the field

Well, this lesson is for you.  It is a culmination of a number of the previous lessons and a few
NumPy Snippets and Before I Forget posts.  I have attached  a script to this post below

There is also a GitHub repository that takes this one step further providing more output options... see Silly on GitHub

The following provides the basic requirements to operate a function should you choose not to
incorporate the whole thing.  Obviously, the header section enclosed within triple quotes
isn't needed but the import section is.

# -*- coding: UTF-8 -*-
"""
:Script:   random_data_demo.py
:Author:   Dan.Patterson AT carleton.ca
:Modified: 2015-08-29
:Purpose:
:  Generate an array containing random data.  Optional fields include:
:  ID, Shape, text, integer and float fields
:Notes:
:  The numpy imports are required for all functions
"""
#-----------------------------------------------------------------------------
# Required imports

from functools import wraps
import numpy as np
import numpy.lib.recfunctions as rfn
np.set_printoptions(edgeitems=5, linewidth=75, precision=2,
                    suppress=True,threshold=5,
                    formatter={'bool': lambda x: repr(x.astype('int32')),
                               'float': '{: 0.2f}'.format})
#-----------------------------------------------------------------------------
# Required constants  ... see string module for others
str_opt = ['0123456789',
           '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~',
           'abcdefghijklmnopqrstuvwxyz',
           'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
           'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
           ]
#-----------------------------------------------------------------------------
# decorator
def func_run(func):
    """Prints basic function information and the results of a run.
    :Required:  from functools import wraps
    """
    @wraps(func)
    def wrapper(*args,**kwargs):
        print("\nFunction... {}".format(func.__name__))
        print("  args.... {}\n  kwargs.. {}".format(args, kwargs))
        print("  docs.... \n{}".format(func.__doc__))
        result = func(*args, **kwargs)
        print("{!r:}\n".format(result))  # comment out if results not needed
        return result                    # for optional use outside.
    return wrapper
#-----------------------------------------------------------------------------
# functions
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Before I go any further, lets have a look at the above code.

  • line 14          - functools wraps module -  I will be using decorators to control output and wraps handles all the fiddly stuff in decorators (see Before I Forget # 14)
  • line 16 -        - numpy.lib.recfunctions is a useful module for working with ndarrays and recarrays in particular...it is imported as rfn
  • lines 17-20  - np.set_printoptions allows you to control how arrays are formatted when printing or working from the command line.  Most of the  parameters are self-explanatory or you will soon get the drift
  • lines 30 - 43 - the decorator function presented in BIF # 14.

Now back to the main point.  If you would like to generate data with some control on the output.

This will present some functions to do so and put it together into a standalone table or feature class.

An example follows:

Array generated....
array([(0, (7.0, 1.0), 'E0', '(0,0)', 'A', 'ARXYPJ', 'cat', 'Bb', 0, 9.380410289877375),
       (1, (2.0, 9.0), 'D0', '(4,0)', 'B', 'RAMKH', 'cat', 'Aa', 9, 1.0263298179133362),
       (2, (5.0, 8.0), 'C0', '(1,0)', 'B', 'EGWSC', 'cat', 'Aa', 3, 2.644448491753841),
       (3, (9.0, 7.0), 'A0', '(1,0)', 'A', 'TMXZSGHAKJ', 'dog', 'Aa', 8, 6.814471938888746),
       (4, (10.0, 3.0), 'E0', '(1,0)', 'B', 'FQZCTDEY', '-1', 'Aa', 10, 2.438467639965038)], 
       ............. < snip >
      dtype=[('ID', '<i4'), ('Shape', [('X', '<f8'), ('Y', '<f8')]), 
             ('Colrow', '<U2'), ('Rowcol', '<U5'), ('txt_fld', '<U1'), 
             ('str_fld', '<U10'), ('case1_fld', '<U3'), ('case2_fld', '<U2'), 
             ('int_fld', '<i4'), ('float_fld', '<f8')])
‍‍‍‍‍‍‍‍‍‍‍

Here are the code snippets...

Code snippets
def pnts_IdShape(N=10, x_min=0, x_max=10, y_min=0, y_max=10):
    """  Create an array with a nested dtype which emulates a shapefile's
    : data structure.  This array is used to append other arrays to enable
    :  import of the resultant into ArcMap.  Array construction, after hpaulj
    :  http://stackoverflow.com/questions/32224220/ 
    :    methods-of-creating-a-structured-array
    """
    Xs = np.random.randint(x_min, x_max, size=N)
    Ys = np.random.randint(y_min, y_max, size=N)
    IDs = np.arange(0, N)
    c_stack = np.column_stack((IDs, Xs, Ys))
    if simple:     # version 1
        dt = [('ID', '<i4'),('Shape', '<f8', (2,))]  # short version, optional form
        a = np.ones(N, dtype=dt)
        a['ID'] = c_stack[:, 0]
        a['Shape'] = c_stack[:, 1:]                  # this line too
    else:          # version 2
        dt = [('ID', '<i4'), ('Shape', ([('X', '<f8'),('Y', '<f8')]))]
        a = np.ones(N, dtype=dt)
        a['Shape']['X'] = c_stack[:, 1]
        a['Shape']['Y'] = c_stack[:, 2]
        a['ID'] = c_stack[:, 0]
    return a



‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
def colrow_txt(N=10, cols=2, rows=2, zero_based=True):
    """  Produce spreadsheet like labels either 0- or 1-based.
    :N  - number of records/rows to produce.
    :cols/rows - this combination will control the output of the values
    :cols=2, rows=2 - yields (A0, A1, B0, B1)
    :  as optional classes regardless of the number of records being produced
    :zero-based - True for conventional array structure,
    :             False for spreadsheed-style classes
    """

    if zero_based:
        start = 0
    else:
        start = 1; rows = rows + 1
    UC = (list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"))[:cols]  # see constants
    dig = (list('0123456789'))[start:rows]
    cr_vals = [c + r for r in dig for c in UC]
    colrow = np.random.choice(cr_vals,N)
    return colrow


Yields

array(['D0', 'E0', 'C0', 'E0', 'C0', 'C0', 'D0', 'D0', 'E0', 'D0'], 
      dtype='<U2')
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
def rowcol_txt(N=10,rows=2,cols=2):
    """  Produce array-like labels in a tuple format.
    """
    rc_vals = ["({},{})".format(r, c) for c in range(cols) for r in range(rows)]
    rowcol = np.random.choice(rc_vals, N)
    return rowcol


Yields

array(['(2,0)', '(2,0)', '(4,0)', '(0,0)', '(4,0)', '(2,0)', '(4,0)',
       '(0,0)', '(2,0)', '(0,0)'], 
      dtype='<U5')


‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
def rand_text(N=10,cases=3,vals=str_opt[3]):
    """  Generate N samples from the letters of the alphabet denoted by the 
    :  number of cases.  If you want greater control on the text and
    :  probability, see rand_case or rand_str.
    :
    : vals:  see str_opt in required constants section
    """
    vals = list(vals)
    txt_vals = np.random.choice(vals[:cases],N)
    return txt_vals

Yields
array(['C', 'C', 'C', 'B', 'A', 'B', 'A', 'C', 'C', 'C'], 
      dtype='<U1')
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
def rand_str(N=10,low=1,high=10,vals=str_opt[3]):
    """  Returns N strings constructed from 'size' random letters to form a string
    : - create the cases as a list:  string.ascii_lowercase or ascii_uppercase etc
    : - determine how many letters. Ensure min <= max. Add 1 to max alleviate low==high
    : - shuffle the case list each time through loop
    """
    vals = list(vals)
    letts = np.arange(min([low,high]),max([low,high])+1)  # number of letters
    result = []
    for i in range(N):
        np.random.shuffle(vals)   
        size = np.random.choice(letts, 1)
        result.append("".join(vals[:size]))
    result = np.array(result)
    return result

Yields
array(['ZDULHYJSB', 'LOSZJNB', 'PKECZOIJ', 'ZV', 'DENCBP', 'XRNITEJ',
       'HJMDLBNSEF', 'DWLYPQF', 'HZOUTBSLN', 'MOEXR'], 
      dtype='<U10')
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
def rand_case(N=10,cases=["Aa","Bb"],p_vals=[0.8,0.2]):
    """  Generate N samples from a list of classes with an associated probability
    :  ensure: len(cases)==len(p_vals) and  sum(p_values) == 1
    :  small sample sizes will probably not yield the desired p-values
    """
    p = (np.array(p_vals))*N   # convert to integer
    kludge = [np.repeat(cases[i], p[i]).tolist() for i in range(len(cases))]
    case_vals = np.array([val for i in range(len(kludge)) for val in kludge[i]])
    np.random.shuffle(case_vals)
    return case_vals

Yields
array(['cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'cat', 'dog', 'cat',
       'fish'], 
      dtype='<U4')
# or
array(['Aa', 'Bb', 'Aa', 'Aa', 'Aa', 'Aa', 'Bb', 'Aa', 'Aa', 'Aa'], 
      dtype='<U2')
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
def rand_int(N=10,begin=0,end=10):
    """  Generate N random integers within the range begin - end
    """
    int_vals = np.random.randint(begin,end,size=(N))
    return int_val

Yields
array([7, 1, 4, 1, 6, 4, 5, 2, 2, 2])
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
def rand_float(N=10,begin=0,end=10):
    """  Generate N random floats within the range begin - end
    Technically, N random integers are produced then a random
    amount within 0-1 is added to the value
    """
    float_vals = np.random.randint(begin,end-1,size=(N))
    float_vals = float_vals + np.random.rand(N)
    return float_vals

Yield
array([ 8.40,  9.09,  0.90,  9.64,  8.63,  5.05,  2.07,  8.13,  9.91,  0.22])
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The above functions can be used with the main portion of the script and your own function.

Sample function
# required imports
# required constants
# pnts_IdShape  function
# rand_case  function
# rand_int  function
def blog_post():
    """sample run"""
    N = 10
    id_shape = pnts_IdShape(N,x_min=300000,x_max=300500,y_min=5000000,y_max=5000500)
    case1_fld = rand_case(N,cases=['cat','dog','fish'],p_vals=[0.6,0.3,0.1])
    int_fld = rand_int(N,begin=0,end=10)
    fld_names = ['Pets','Number']
    fld_data = [case1_fld,int_fld]
    arr = rfn.append_fields(id_shape,fld_names,fld_data,usemask=False)
    return arr

if __name__ == '__main__':
    """create ID,Shape,{txt_fld,int_fld...of any number}
    """
    returned = blog_post()
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
array([(0, (300412.0, 5000473.0), 'dog', 4),
       (1, (300308.0, 5000043.0), 'cat', 4),
       (2, (300443.0, 5000170.0), 'dog', 5),
       (3, (300219.0, 5000240.0), 'cat', 0),
       (4, (300444.0, 5000067.0), 'cat', 9),
       (5, (300486.0, 5000106.0), 'cat', 3),
       (6, (300242.0, 5000145.0), 'cat', 5),
       (7, (300038.0, 5000341.0), 'dog', 7),
       (8, (300335.0, 5000495.0), 'cat', 9),
       (9, (300345.0, 5000108.0), 'fish', 7)], 
      dtype=[('ID', '<i4'), ('Shape', [('X', '<f8'), ('Y', '<f8')]), 
                  ('Pets', '<U4'), ('Number', '<i4')])
‍‍‍‍‍‍‍‍‍‍‍‍

You will notice in the above example that the rand_case function was to determine

the number of pets based upon p-values of 0.6, 0.3 and 0.1, with cats being favored, as they should be, and
this is reflected in the data.  The coordinates in this example were left as integers, reflecting a 1m resolution.

It is possible to add a random pertubation of floating point values in the +/- 0.99 to add centimeter values if you desire.

This is not shown here, but I can provide the example if needed.

The 'Number' field in this example simply reflects the number of pets per household.

Homework...

Using NumPyArrayToFeatureclass, create a shapefile using a NAD_1983_CSRS_MTM_9 projection

(Projected, National Grids, Canada, NAD83 CSRS_MTM_9)

Answer...

>>> import arcpy
>>> a = blog_post()  # do the run if it isn't done
>>> # ..... snip ..... the output
>>> # ..... snip ..... now create the featureclass
>>> SR_name = 32189  # u'NAD_1983_CSRS_MTM_9'
>>> SR = arcpy.SpatialReference(SR_name)
>>> output_shp ='F:/Writing_Projects/NumPy_Lessons/Shapefiles/out.shp'
>>> arcpy.da.NumPyArrayToFeatureClass(a, output_shp, 'Shape', SR)
‍‍‍‍‍‍‍‍

Result

NumPy_Lessons_06_1.png

That's all...

About the Author
Retired Geomatics Instructor at Carleton University. I am a forum MVP and Moderator. Current interests focus on python-based integration in GIS. See... Py... blog, my GeoNet blog...
Labels