I'd like opinions, no code necessary, on libraries for performing basic descriptive stats using the python tools that ship in the box with an ArcGIS 10.3 standard licence (so Python 2.7). I'm behind a corporate firewall with no ability to install any additional libraries.
I'm not planning in doing ML or predictive analytics just yet. I have a simple requirement to :
Inputs would be file GDB feature classes ranging in volume from 5 records to 5 million records, with most under 1 million records. Execution would be infrequent, maybe once or twice a day to assist with generic data profiling. I'm not hooking this up to a high volume web service, so speed is not critical. The ability to write clear code is.
I suspect I could achieve my requirements with one or more of:
Getting into one or more of these will be a time investment that I'd hope to still be useful when I (hope) I will be able to upgrade to ArcGIS 10.6 and 'Pro with Python 3.6 in the next year.
So, which do you think are worth learning for this purpose?
arcmap 10.5/6 ship with various python modules including matplotlib, scipy and numpy. They contain various stats functionality and matplotlib for graphing
Statistical functions (scipy.stats) — SciPy v1.1.0 Reference Guide
statistics Examples — Matplotlib 2.0.2 documentation https://docs.scipy.org/doc/scipy/reference/stats.html
Statistics — NumPy v1.14 Manual
my blog has several stats and graphing examples
Thanks for responding Dan. Its good to know that SciPy, Matplotlib and NumPy will be available in a version of ArcGIS that I may have in the future. If you were working in this problem now as described above in 10.3, would you start in SciPy, or NumPy? And what are your thoughts on Pandas. I see it did not make your short-list, is that for a reason?
Michael... missed this
Pandas is largely a fancier jacket over numpy (over simplified).
Everything in the SciPy 'stack' depends on numpy, so if you are working with numbers, then that is way to go.
You can treat 'missing' data using 'masked arrays' and/or use what is called 'nan' functions (ie nan is Not a Number).
It is blazingly fast and can benefit from a variety of other modules that speed things up even further.
Pandas is pretty good for a person looking at working with array/tabular data, but at some stage, you will probably step back, particularly if you want to work with some of the graphics libraries like matplotlib, seaborn etc.
Thanks Dan.
If you are using ArcGIS 10.3 and have pandas and scipy, then someone has already installed additional libraries because those were not bundled with ArcGIS until ArcGIS 10.4.x.
If "getting into one or more of these will be a time investment" for you, then I suspect you are new to Python. You could use this as an opportunity to learn some Python packages like scipy and pandas (after you upgrade); but given your requirements, I think sticking with ArcGIS geoprocessing tools (e.g., Statistics) is likely your best bet until you get more familiar with Python.
Thanks Joshua. Perhaps I should have been clearer in saying I do not have the ability to install additional packages
I am relatively new to Python. But I have some experience in Java and .net, so I'm not new to code. I am using this as an opportunity to dive into Python. I'm not so much looking for an easy path, as a better path (if that makes sense?). Do you have experience with any of those packages? Did you find one more flexible or functional than another?
The libraries you've listed are good at certain things, and don't do some things at all. At the beginning at least, I'd suggest being a jack of all trades and master of none. Learn that numpy is good at handling numbers (e.g. rasters), pandas is good at handling tables, arcpy handles ArcGIS-style data (but not as fast as numpy or pandas), and matplotlib draws your graphs. Do a lot of googling for how to do specific things when you need to do them. Honestly, I wouldn't try to learn everything about any of these libraries.
You should invest your time in python 3... ArcGIS Pro uses python 3.5/6 and there is a fundamental statistics package now built into python.
9.7. statistics — Mathematical statistics functions — Python 3.7.0 documentation
import sys
sys.version
'3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]'
import statistics
dir(statistics)
['Decimal', 'Fraction', 'StatisticsError', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_coerce', '_convert', '_counts', '_exact_ratio', '_fail_neg', '_find_lteq', '_find_rteq', '_isfinite', '_ss', '_sum', 'bisect_left', 'bisect_right', 'chain', 'collections', 'decimal', 'groupby', 'harmonic_mean', 'math', 'mean', 'median', 'median_grouped', 'median_high', 'median_low', 'mode', 'numbers', 'pstdev', 'pvariance', 'stdev', 'variance']
I agree, in principal. But for now that is not an option.