Which statistical library

1062
11
08-09-2018 05:40 AM
by Anonymous User
Not applicable

I'd like opinions, no code necessary, on libraries for performing basic descriptive stats using the python tools that ship in the box with an ArcGIS 10.3 standard licence (so Python 2.7).  I'm behind a corporate firewall with no ability to install any additional libraries. 

I'm not planning in doing ML or predictive analytics just yet. I have a simple requirement to :

  • get simple aggregates (counts, count not null, count grouping by) 
  • measures of central tendency (mean, mode, median, range, variance & standard deviation)
  • box plots, histograms and pie charts
  • output results to XLS

Inputs would be file GDB feature classes ranging in volume from 5 records to 5 million records, with most  under 1 million records.   Execution would be infrequent, maybe once or twice a day to assist with generic data profiling.  I'm not hooking this up to a high volume web service, so speed is not critical.  The ability to write clear code is.

I suspect I could achieve my requirements with one or more of: 

  • arcpy.da.SearchCursor (yeah, the long way 'round)
  • arcpy.Statistics_analysis
  • scipy.stats
  • numpy
  • pandas
  • sqllite 
  • mapplotlib
  • xlwt

Getting into one or more of these will be a time investment that I'd hope to still be useful when I (hope) I will be able to upgrade to ArcGIS 10.6 and 'Pro with Python 3.6 in the next year. 

So, which do you think are worth learning for this purpose? 

0 Kudos
11 Replies
DanPatterson_Retired
MVP Emeritus

arcmap 10.5/6 ship with various python modules including matplotlib, scipy and numpy.  They contain various stats functionality and matplotlib for graphing

Statistical functions (scipy.stats) — SciPy v1.1.0 Reference Guide

statistics Examples — Matplotlib 2.0.2 documentation https://docs.scipy.org/doc/scipy/reference/stats.html 

Statistics — NumPy v1.14 Manual 

my blog has several stats and graphing examples

by Anonymous User
Not applicable

Thanks for responding Dan. Its good to know that SciPy, Matplotlib and NumPy will be available in a version of ArcGIS that I may have in the future.   If you were working in this problem now as described above in 10.3, would you start in SciPy, or NumPy?   And what are your thoughts on Pandas.  I see it did not make your short-list, is that for a reason?

0 Kudos
DanPatterson_Retired
MVP Emeritus

Michael... missed this

Pandas is largely a fancier jacket over numpy (over simplified).

Everything in the SciPy 'stack' depends on numpy, so if you are working with numbers, then that is way to go.

You can treat 'missing' data using 'masked arrays' and/or use what is called 'nan' functions (ie nan is Not a Number).

It is blazingly fast and can benefit from a variety of other modules that speed things up even further.

Pandas is pretty good for a person looking at working with array/tabular data, but at some stage, you will probably step back, particularly if you want to work with some of the graphics libraries like matplotlib, seaborn etc.

by Anonymous User
Not applicable

Thanks Dan.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

If you are using ArcGIS 10.3 and have pandas and scipy, then someone has already installed additional libraries because those were not bundled with ArcGIS until ArcGIS 10.4.x.

If "getting into one or more of these will be a time investment" for you, then I suspect you are new to Python.  You could use this as an opportunity to learn some Python packages like scipy and pandas (after you upgrade); but given your requirements, I think sticking with ArcGIS geoprocessing tools (e.g., Statistics) is likely your best bet until you get more familiar with Python. 

by Anonymous User
Not applicable

Thanks Joshua.  Perhaps I should have been clearer in saying I do not have the ability to install additional packages 

I am relatively new to Python.  But I have some experience in Java and .net, so I'm not new to code.  I am using this as an opportunity to dive into Python.  I'm not so much looking for an easy path, as a better path (if that makes sense?).  Do you have experience with any of those packages?  Did you find one more flexible or functional than another?

0 Kudos
DarrenWiens2
MVP Honored Contributor

The libraries you've listed are good at certain things, and don't do some things at all. At the beginning at least, I'd suggest being a jack of all trades and master of none. Learn that numpy is good at handling numbers (e.g. rasters), pandas is good at handling tables, arcpy handles ArcGIS-style data (but not as fast as numpy or pandas), and matplotlib draws your graphs. Do a lot of googling for how to do specific things when you need to do them. Honestly, I wouldn't try to learn everything about any of these libraries.

DanPatterson_Retired
MVP Emeritus

You should invest your time in python 3... ArcGIS Pro uses python 3.5/6 and there is a fundamental statistics package now built into python.

9.7. statistics — Mathematical statistics functions — Python 3.7.0 documentation 

import sys

sys.version
'3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]'

import statistics

dir(statistics)
 ['Decimal', 'Fraction', 'StatisticsError', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_coerce', '_convert', '_counts', '_exact_ratio', '_fail_neg', '_find_lteq', '_find_rteq', '_isfinite', '_ss', '_sum', 'bisect_left', 'bisect_right', 'chain', 'collections', 'decimal', 'groupby', 'harmonic_mean', 'math', 'mean', 'median', 'median_grouped', 'median_high', 'median_low', 'mode', 'numbers', 'pstdev', 'pvariance', 'stdev', 'variance']
0 Kudos
by Anonymous User
Not applicable

I agree, in principal.  But for now that is not an option. 

0 Kudos