Calculate custom statistics on a column in GDB

309
5
Jump to solution
08-08-2019 08:51 AM
DavidKulpanowski1
New Contributor II

I am using ArcGIS Pro 2.3 and Python 3.6
My goal is to extract a column of numeric data from my geodatabase and analyze it using numpy.

However I am not sure how to open a specific column in a geodatabase and make it available to analyze the data statistically. Typically I use PyODBC to query a database and then use numpy on that data. Enclosed at the bottom is the Python code for what I am hoping to achieve using PyODBC. But Geodatabases and arcpy are not my expertise.

Here is my pseudocode:

-) Use Arcpy to add a column and calculate values for that column
-.) Use Arcpy to open up the geodatabase and select that specific column.
-.) Use Numpy or Scipy on that column to generate descriptive statistics such as mean, median, mode, percentiles, etc. (NOTE: Pandas would also be a good option.)

-) After calculating statistics on this column, create a histogram,

-) further statistical analysis once I get past this road block.

The part I am stuck on is how to bridge the gap from the geodatabase and select out a specific column and analyze that data as a dataframe or using cursor.

Can you please point the way to solving this problem?

Shown below is the code I currently have which adds a response time column and calculates the fire truck response times. My next step is to analyze the data.

There is another post similar - but not quite what I am attempting to do.

This is what I currently have started:

# Create a column in the geodatabase that is numeric
# update that column with an equation DTMArriveOnScene - DTMDispatch in seconds
# Use the number of seconds of fire truck response times to generate descriptive statistics
# Use the number of seconds of fire truck response times for a histogram
# Write this information out to an html page so the user can have a report

import arcpy
 
# Set environment settings
arcpy.env.workspace = "C:/Users/kulpandm/Documents/ArcGIS/Projects/CFD/CFD.gdb"
 
# Set local variables
inTable = "CincinnatiCFD2"
fieldName = "ResponseT"
fieldPrecision = 9
fieldAlias = "Response Time"
 
# Execute AddField
##arcpy.AddField_management(inTable, fieldName, "LONG", fieldPrecision, field_alias=fieldAlias, field_is_nullable="NULLABLE")

## make a calculation for the response times
expression = "DateDiff($feature.DTMArriveOnScene, $feature.DTMDispatch,'seconds')"
 
# Execute CalculateField
arcpy.CalculateField_management(inTable, fieldName, expression, "ARCADE")

Shown above I have added a column and calculated the data. Now I want to analyze it using Numpy. this is the part where I am stuck.

Shown below is an example of how I solved this issue before using PyODBC and Pandas.

 -*- coding: utf-8 -*-
## David Kulpanowski
## 27 July 2018
## This is an example of how I do this using SQL Server and Python.

import pyodbc
import pandas as pd
import time
#import scipy
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy
#from scipy.stats import mannwhitneyu
#from scipy.stats import kruskal
#from statsmodels.graphics.gofplots import qqplot
#from matplotlib import pyplot

StartTime = time.clock()
cnxn = pyodbc.connect('DRIVER={SQL Server}; SERVER=MyServer; DATABASE=MyDatabase; Trusted_Connection=yes' )
cursor = cnxn.cursor()
Fires = pd.read_sql("""
SELECT ResponseTime
FROM FireTruckDetails
;
""", cnxn)
##print(Fires.head(n = 20))
##night.to_csv('c:/temp/FiresResponseTime.csv', sep = ',')
cursor.close()
cnxn.close()

#descriptive statistics
print('The current statistics ')
Fires.ResponseTime.describe()

fires_counts= Fires.ResponseTime.count()
print(fires_counts)

####
## create a histogram for the current time frame for all delta echo responses
####
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim(ymin=0, ymax=500)
plt.hist(current.ResponseTime, bins=numpy.arange(0, 900, 60), range=[0, 900], cumulative=False, color='Crimson', label='Fire Truck Response times', alpha=0.75)
plt.title('Current time frame responses')
plt.xticks(numpy.arange(0, 900, 60), rotation='vertical')
plt.savefig('C:/Temp/FiresHistogram.svg', format = 'svg', orientation = 'landscape')
plt.show()
plt.clf()

0 Kudos
1 Solution

Accepted Solutions
DanPatterson_Retired
MVP Esteemed Contributor

sample code to create/deal with nodata values in featureclass tables... fixing the <null> issue.  It uses the code below to 'fix' <null> to something more pythonic/numpy-friendly

# ---- featureclass section, arcpy dependent via arcgisscripting
#
def _make_nulls_(in_fc, int_null=-999):
"""Return null values for a list of fields objects, excluding objectid
and geometry related fields. Throw in whatever else you want.

Parameters
----------
in_flds : list of arcpy field objects
Use arcpy.ListFields to get a list of featureclass fields.
int_null : integer
A default to use for integer nulls since there is no ``nan`` equivalent
Other options include

>>> np.iinfo(np.int32).min # -2147483648
>>> np.iinfo(np.int16).min # -32768
>>> np.iinfo(np.int8).min # -128

>>> [i for i in cur.__iter__()]
>>> [[j if j else -999 for j in i] for i in cur.__iter__() ]
"""
nulls = {'Double': np.nan, 'Single': np.nan, 'Float': np.nan,
'Short': int_null, 'SmallInteger': int_null, 'Long': int_null,
'Integer': int_null, 'String': str(None), 'Text': str(None),
'Date': np.datetime64('NaT')}
#
desc = arcpy.da.Describe(in_fc)
if desc['dataType'] != 'FeatureClass':
print("Only Featureclasses are supported")
return None, None
in_flds = desc['fields']
shp = desc['shapeFieldName']
good = [f for f in in_flds if f.editable and f.name != shp]
fld_dict = {f.name: f.type for f in good}
fld_names = list(fld_dict.keys())
null_dict = {f: nulls[fld_dict[f]] for f in fld_names}
# ---- insert the OBJECTID field
return null_dict, fld_names

Sample code to read just the attribute data and a representation of the OBJECTID and shape properties (centroid)

def fc_data(in_fc):
"""Pull all editable attributes from a featureclass tables. During the
process, <null> values are changed to an appropriate type.

Parameters
----------
in_fc : text
Path to the input featureclass

Notes
-----
The output objectid and geometry fields are renamed to
`OID_`, `X_cent`, `Y_cent`, where the latter two are the centroid values.
"""

flds = ['OID@', 'SHAPE@X', 'SHAPE@Y']
null_dict, fld_names = _make_nulls_(in_fc, int_null=-999)
out_flds = flds + fld_names
a = arcpy.da.FeatureClassToNumPyArray(
in_fc, out_flds, skip_nulls=False, null_value=null_dict)
new_names = ['OID_', 'X_cent', 'Y_cent']
a.dtype.names = new_names + out_flds[3:]
return np.asarray(a)

View solution in original post

5 Replies
DanPatterson_Retired
MVP Esteemed Contributor
DavidKulpanowski1
New Contributor II

Hi Dan:

Thank you for all this information. I have downloaded the Python files for your toolbox and I am poring over them for ideas.

0 Kudos
DanPatterson_Retired
MVP Esteemed Contributor

If you get stuck, give a shout, I use numpy extensively... in fact a lot of the scipy stuff has a numpy base.  One thing to note is that you will probably see a lot of 'nan' functions eg. np.nanmean, nanthis, nanthat.  They are just the equivalent functions to account for nodata/missing values which may appear in an array (aka... field, column), You can create the nan masks yourself or you can create the appropriate nodata values when reading the featureclass table columns

0 Kudos
DanPatterson_Retired
MVP Esteemed Contributor

sample code to create/deal with nodata values in featureclass tables... fixing the <null> issue.  It uses the code below to 'fix' <null> to something more pythonic/numpy-friendly

# ---- featureclass section, arcpy dependent via arcgisscripting
#
def _make_nulls_(in_fc, int_null=-999):
"""Return null values for a list of fields objects, excluding objectid
and geometry related fields. Throw in whatever else you want.

Parameters
----------
in_flds : list of arcpy field objects
Use arcpy.ListFields to get a list of featureclass fields.
int_null : integer
A default to use for integer nulls since there is no ``nan`` equivalent
Other options include

>>> np.iinfo(np.int32).min # -2147483648
>>> np.iinfo(np.int16).min # -32768
>>> np.iinfo(np.int8).min # -128

>>> [i for i in cur.__iter__()]
>>> [[j if j else -999 for j in i] for i in cur.__iter__() ]
"""
nulls = {'Double': np.nan, 'Single': np.nan, 'Float': np.nan,
'Short': int_null, 'SmallInteger': int_null, 'Long': int_null,
'Integer': int_null, 'String': str(None), 'Text': str(None),
'Date': np.datetime64('NaT')}
#
desc = arcpy.da.Describe(in_fc)
if desc['dataType'] != 'FeatureClass':
print("Only Featureclasses are supported")
return None, None
in_flds = desc['fields']
shp = desc['shapeFieldName']
good = [f for f in in_flds if f.editable and f.name != shp]
fld_dict = {f.name: f.type for f in good}
fld_names = list(fld_dict.keys())
null_dict = {f: nulls[fld_dict[f]] for f in fld_names}
# ---- insert the OBJECTID field
return null_dict, fld_names

Sample code to read just the attribute data and a representation of the OBJECTID and shape properties (centroid)

def fc_data(in_fc):
"""Pull all editable attributes from a featureclass tables. During the
process, <null> values are changed to an appropriate type.

Parameters
----------
in_fc : text
Path to the input featureclass

Notes
-----
The output objectid and geometry fields are renamed to
`OID_`, `X_cent`, `Y_cent`, where the latter two are the centroid values.
"""

flds = ['OID@', 'SHAPE@X', 'SHAPE@Y']
null_dict, fld_names = _make_nulls_(in_fc, int_null=-999)
out_flds = flds + fld_names
a = arcpy.da.FeatureClassToNumPyArray(
in_fc, out_flds, skip_nulls=False, null_value=null_dict)
new_names = ['OID_', 'X_cent', 'Y_cent']
a.dtype.names = new_names + out_flds[3:]
return np.asarray(a)

View solution in original post

DavidKulpanowski1
New Contributor II

Thank you Dan:

I have downloaded the code you passed over and I am learning from it.

The missing piece that I could not figure out appears in line 18 in your post above.

a = arcpy.da.FeatureClassToNumPyArray(             in_fc, out_flds, skip_nulls=False, null_value=null_dict)

Using this line of code I was able to pass in the fields I needed, in this case ambulance response times, and then start calculating statistics on it using Numpy. Now that I have an array of data I can also start with hypothesis testing and writing up code for other statistical cases.

But this one line of code was the logjam that stopped me.

thanks,

David