# Calculate custom statistics on a column in GDB

Question asked by 1968Tigers on Aug 8, 2019
Latest reply on Aug 21, 2019 by 1968Tigers

I am using ArcGIS Pro 2.3 and Python 3.6
My goal is to extract a column of numeric data from my geodatabase and analyze it using numpy.

However I am not sure how to open a specific column in a geodatabase and make it available to analyze the data statistically. Typically I use PyODBC to query a database and then use numpy on that data. Enclosed at the bottom is the Python code for what I am hoping to achieve using PyODBC. But Geodatabases and arcpy are not my expertise.

Here is my pseudocode:

-) Use Arcpy to add a column and calculate values for that column
-.) Use Arcpy to open up the geodatabase and select that specific column.
-.) Use Numpy or Scipy on that column to generate descriptive statistics such as mean, median, mode, percentiles, etc. (NOTE: Pandas would also be a good option.)

-) After calculating statistics on this column, create a histogram,

-) further statistical analysis once I get past this road block.

The part I am stuck on is how to bridge the gap from the geodatabase and select out a specific column and analyze that data as a dataframe or using cursor.

Can you please point the way to solving this problem?

Shown below is the code I currently have which adds a response time column and calculates the fire truck response times. My next step is to analyze the data.

There is another post similar - but not quite what I am attempting to do.

This is what I currently have started:

# Create a column in the geodatabase that is numeric
# update that column with an equation DTMArriveOnScene - DTMDispatch in seconds
# Use the number of seconds of fire truck response times to generate descriptive statistics
# Use the number of seconds of fire truck response times for a histogram
# Write this information out to an html page so the user can have a report

import arcpy

# Set environment settings
arcpy.env.workspace = "C:/Users/kulpandm/Documents/ArcGIS/Projects/CFD/CFD.gdb"

# Set local variables
inTable = "CincinnatiCFD2"
fieldName = "ResponseT"
fieldPrecision = 9
fieldAlias = "Response Time"

##arcpy.AddField_management(inTable, fieldName, "LONG", fieldPrecision, field_alias=fieldAlias, field_is_nullable="NULLABLE")

## make a calculation for the response times
expression = "DateDiff(\$feature.DTMArriveOnScene, \$feature.DTMDispatch,'seconds')"

# Execute CalculateField

Shown above I have added a column and calculated the data. Now I want to analyze it using Numpy. this is the part where I am stuck.

Shown below is an example of how I solved this issue before using PyODBC and Pandas.

-*- coding: utf-8 -*-
## David Kulpanowski
## 27 July 2018
## This is an example of how I do this using SQL Server and Python.

import pyodbc
import pandas as pd
import time
#import scipy
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy
#from scipy.stats import mannwhitneyu
#from scipy.stats import kruskal
#from statsmodels.graphics.gofplots import qqplot
#from matplotlib import pyplot

StartTime = time.clock()
cnxn = pyodbc.connect('DRIVER={SQL Server}; SERVER=MyServer; DATABASE=MyDatabase; Trusted_Connection=yes' )
cursor = cnxn.cursor()
SELECT ResponseTime
FROM FireTruckDetails
;
""", cnxn)
##night.to_csv('c:/temp/FiresResponseTime.csv', sep = ',')
cursor.close()
cnxn.close()

#descriptive statistics
print('The current statistics ')
Fires.ResponseTime.describe()

fires_counts= Fires.ResponseTime.count()
print(fires_counts)

####
## create a histogram for the current time frame for all delta echo responses
####
fig = plt.figure()