Select to view content in your preferred language

Table fields that are not simple

502
13
4 weeks ago
Labels (1)
DanPatterson
MVP Esteemed Contributor
3 13 502

It seemed like a simple idea at the time.

Take the outputs from another data source, concatenate the values to a string and toss them into a new field.

Looks nice eh?

table_of_data.png

Well, well, you can even query the whole stringy-thing.  Don't get me wrong, the whole Select By Attributes and the sql thing are pretty cool

table_query.png

Sorting is fun too.  Who would have thought that arcpy-base had so many friends (dependencies).

arcpy-base.png

That is when I asked myself what are the interdependencies/requirements/dependencies for Spyder, my favorite python IDE.   Sql away, the sort thing got me to roll.

That is when I thought .... what about this package?  how about that one?

Time for a big summary, which wasn't going to work with the existing options.

That is when I remembered those arc angels that coded the numpy stuff!!   (not the post stevie ray band or things divinical)

TableToNumPyArray  and NumPyArrayToTable.

Follow along. 

1    import arcpy of course   or  

2-4  specify your table of interest and its field(s).  We now have an array.  None of this "for row in cursor stuff".

        'vals' is a view of the data in 'arr', so we can explore it and manipulate it

6     list comprehensions can be fun

  1. for every value (v) in the array (vals)
  2. split the value at the commas  (note I used "," instead of ", " since there was a mixture
  3. strip off any leading/trailing spaces to be on the safe side

 7   flatten the whole dataset

  8   get the unique entries and their counts.  Remember the field was one big messy string/former list thing.

 9- 11  Create the output array and send the whole summary back to Pro

import arcpy
tbl = r"C:\arcpro_npg\Project_npg\npgeom.gdb\dep_required_by"
arr = arcpy.da.TableToNumPyArray(tbl, "Required_by")
vals = arr["Required_by"]

big = [i.strip() for v in vals for i in v.split(",")]
big_flat = npg.flatten(big)
uniq1, cnts1 = np.unique(big_flat, return_counts=True)
out_tbl = r"C:\arcpro_npg\Project_npg\npgeom.gdb\dep_summary"
out_arr = np.asarray(list(zip(uniq1, cnts1)), dtype=[('Package', 'U50'), ('Counts', 'i4')])
arcpy.da.NumPyArrayToTable(out_arr, out_tbl)

 

Now the next time I am interested in exploring the python package infrastructure for Pro, I have a workflow.

table_summary.png

Don't be afraid to use what was given to you.  Arcpy and NumPy play nice together.  

 

 

13 Comments
LindsayRaabe_FPCWA
MVP Regular Contributor

Mate, I look at this stuff and I'm amazed. I know there is sooooooo much that you can do in python and I'm barely scraping the surface!

DanPatterson
MVP Esteemed Contributor

lots I could write about...

I don't suppose there is much market for inner, outer, and left outer joins using numpy and arcpy?

production of the tables and the output  .... 20 minutes

production of the image below (eg rearranging tables and the fiddly stuff) 1 hour

bunchOfjoins.png

import arcpy
t0 = r"C:\arcpro_npg\Project_npg\npgeom.gdb\table0"
t1 = r"C:\arcpro_npg\Project_npg\npgeom.gdb\table1"
table0 = arcpy.da.TableToNumPyArray(t0, "*")
table1 = arcpy.da.TableToNumPyArray(t1, "*")
#
import numpy.lib.recfunctions as rfn
#
inner_ = rfn.join_by('f0', table0, table1, 'inner', usemask=False)
outer_ = rfn.join_by('f0', table0, table1, 'outer', usemask=False)
lft_outer_ = rfn.join_by('f0', table0, table1, 'leftouter', usemask=False)
#
t_in = r"C:\arcpro_npg\Project_npg\npgeom.gdb\t0t1_inner"
t_out = r"C:\arcpro_npg\Project_npg\npgeom.gdb\t0t1_outer"
t_lft_out = r"C:\arcpro_npg\Project_npg\npgeom.gdb\t0t1_lft_outer"

arcpy.da.NumPyArrayToTable(inner_, t_in)
arcpy.da.NumPyArrayToTable(outer_, t_out)
arcpy.da.NumPyArrayToTable(lft_outer_, t_lft_out)	  

I could have used arcpy.da.ExtendTable but I was on an 'rfn' roll

HaydenWelch
MVP Regular Contributor

numpy is such a magical package. And the NumPyArrayToTable function is probably the easiest way to create a new table. You don't even need to live in numpy land until that last step if you don't want:

 

from arcpy.da import NumPyArrayToTable
from arcpy.da import SearchCursor
import numpy as np

IN_TABLE = r"<in_table>"
SUMMARY_FIELD = 'Field_Name'
OUT_TABLE = r"<out_table>"

vals = [val.strip() for row in SearchCursor(IN_TABLE, [SUMMARY_FIELD]) for val in row[0].split(',')]
unique = np.array([(val, vals.count(val)) for val in set(vals)], dtype=[('Package', 'U50'), ('Counts', 'i4')])
NumPyArrayToTable(unique, OUT_TABLE)

 

DanPatterson
MVP Esteemed Contributor

Common imports and names

import arcpy
from arcpy.da import SearchCursor
import numpy.lib.recfunctions as rfn
#
t0 = r"C:\arcpro_npg\Project_npg\npgeom.gdb\table0"
t1 = r"C:\arcpro_npg\Project_npg\npgeom.gdb\table1"
table0 = arcpy.da.TableToNumPyArray(t0, "*")
table1 = arcpy.da.TableToNumPyArray(t1, "*")
#
IN_TABLE = r"C:\arcpro_npg\Project_npg\npgeom.gdb\dep_required_by"
SUMMARY_FIELD = "Required_by"

 

fixing row.split() to row[0].split()

%%timeit
arr = arcpy.da.TableToNumPyArray(tbl, "Required_by")
vals = arr["Required_by"]
big = [i.strip() for v in vals for i in v.split(",")]
big_flat = npg.flatten(big)
uniq1, cnts1 = np.unique(big_flat, return_counts=True)
out_arr = np.asarray(list(zip(uniq1, cnts1)), dtype=[('Package', 'U50'), ('Counts', 'i4')])


4.01 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
srch_cr = SearchCursor(IN_TABLE, [SUMMARY_FIELD])
vals = [val.strip() for row in srch_cr for val in row[0].split(',')]
unique = np.array([(val, vals.count(val)) for val in set(vals)], dtype=[('Package', 'U50'), ('Counts', 'i4')])


4.84 ms ± 785 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr = arcpy.da.TableToNumPyArray(tbl, "Required_by")
1.51 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit srch_cr = SearchCursor(IN_TABLE, [SUMMARY_FIELD])
76.7 µs ± 9.73 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

shows comparable times with TableToNumPyArray taking up about 20% of the time, however, it creates an object (line 20, 21) rather than a reference (lines 23, 24)

lines 4 and 14 are the clinch points.

Now to really have a look underneath do a 

dir(srch_cr)
[.... snip ...
'__subclasshook__',
 '_as_narray',
 '_dtype',
 'fields',
 'next',
 'reset']

# explore away

arr = srch_cr._as_narray()
type(arr)
numpy.ndarray

arr.shape
(252,)

so you can skip the for row in rows thing and skip directly to a numpy array.

 

HaydenWelch
MVP Regular Contributor

Wow, I can't believe that I never notices that da.*Cursor objects have a private `as_narray` attribute... I really want to exploit this in the future.

I honestly didn't expect my goofy pure python implementation to be that close to your numpy solution though. I was just trying to make a point about how you can still use numpy even if you would rather stick with pure python. Although I feel like at this point numpy is a subset of pure python lol.

I'm sure if you threw any appreciably sized dataset at there would be  no contest, especially because the .count() call on line 10 in my original post is O(n^2) in the worst case and likely not too far off that in the best case.

I think that could be solved by using a Counter:

 

from collections import Counter
unique = np.array(list(Counter(vals).items()), dtype=[('Package', 'U50'), ('Counts', 'i4')])

 

The slowest part is still the string parsing.

 

Also, thanks for catching my row[0]bug, as you can tell I tested this with a dictionary proxy and replaced that with the Cursor afterwards. Always forget to extract the 0th item from single field cursors...

 

Edit: I did some testing that isolated out the arcpy related stuff and I think the pure python solution is a bit faster (unless I'm messing up the numpy function calls)

import this
words = this.s*1000

vals = [val.strip() for row in words.split('\n') for val in row.split(' ') if val.strip()]

def python():
    return np.array(list(Counter(vals).items()), dtype=[('Package', 'U50'), ('Counts', 'i4')])

def numpy():
    v, u = np.unique(vals, return_counts=True)
    return np.array(list(zip(v, u)), dtype=[('Package', 'U50'), ('Counts', 'i4')])

numpy_list = sorted(numpy().tolist())
python_list = sorted(python().tolist())

assert numpy_list == python_list

I just used `this.s` my input string (multiplied by 1000 so it has some heft). the string parsing is shared between the two functions, the only difference is that the python function gets the unique counts from a Counter while the numpy version uses np.unique.

>>> %timeit python()
13.8 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit numpy()
72.5 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

When you check the time on np.unique vs collections.Counter, you can see exactly where all the time is spent:

%timeit v, u = np.unique(vals, return_counts=True)
69.4 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit Counter(vals)
13.1 ms ± 911 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I initially thought that np.unique would pull ahead with a larger dataset (this test is an array of ~140000 strings), but when I up the size by a factor of 100, I get the same results:

>>> words = this.s*100000
...
>>> %timeit Counter(vals)
1.43 s ± 95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit v, u = np.unique(vals, return_counts=True)
6.48 s ± 560 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

DanPatterson
MVP Esteemed Contributor

NumPy

SciPy -

Although I feel like at this point numpy is a subset of pure python lol.

Not so, a lot of the NumPy package is written in C-ish etc languages for speed.

Broadcasting — NumPy v2.2 Manual

is the key

I like it because of its ability to deal with geometry objects, although I had to create my own geometry class to avoid working with object/ragged arrays.

Dan-Patterson (Dan Patterson)

check out Dan-Patterson/numpy_geometry: A numpy geometry class and functions that work with arcpy and ESRI fea...

the code is a work in progress, (now that I am retired for a while)

There is a lot of work involving array and array standards as well, although esri isn't on the official list

Consortium for Python Data API Standards

Also check out the code in

C:\...install_folder...\Resources\ArcPy\arcpy

C:\...install_folder...\Resources\ArcToolBox\Scripts

C:\...install_folder...\Resources\ArcToolBox\toolboxes

You would be surprised how many times NumPy is imported and used in arc* code

Have fun

DanPatterson
MVP Esteemed Contributor

Also,  The current version of numpy that Pro uses is best suited to numbers

10,000,000 random intergers in the range of 0, 100 

%timeit python()
1.55 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit numpy()
546 ms ± 44.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You will have to create a separate conda environment if you want to work with more recent versions of python (3.12, 3.13) and numpy (2.2.0 )

and pandas if you work with text a lot

The newer version of NumPy handles string data better than the version you are currently using.

 

Addendum

don't do timing on battery power, a small but interesting fact 😉

this = np.random.randint(0, 100, size=10000000, dtype='int')

from collections import Counter
def python():
return np.array(list(Counter(this).items()))

def numpy():
v, u = np.unique(this, return_counts=True)
return np.array(list(zip(v, u)))

 


%timeit python()
807 ms ± 6.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit numpy()
312 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

HaydenWelch
MVP Regular Contributor

@DanPatterson  Thanks for the additional info! I was kinda joking with the numpy being pure python line, it's just so prevalent in code that its functionality might as well be part of the standard library lol. Python even has the __matmul__ (@) operator method that as far as I know is only implemented by numpy.

And I have read through a lot of your GitHub (been following you for a while now!), you have some amazing work on there. You also have helped me learn a lot more about numpy in general as before seeing how you used it I only used it for simple matrices and big math operations.

Definitely going to check out the Consortium documentation now, hadn't heard of them before.

As for the version with integers, I'm not surprised that numpy pulls ahead. A Hash Map is basically cheating when it comes to string counting, especially one that's built in a single iteration of the value list.

 

Final note, I still can't get the numpy timing you have. I was in fact on my laptop (with battery power) when I ran those first tests, but I just re-ran both the string and integer tests on a much more powerful desktop and got these results:

Numpy: 1.24.3
Python: 3.11.8
# this.s*100000 (str)
>>> %timeit python()
464 ms ± 28.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit numpy()
3.42 s ± 77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# vals = [random.randint(0,100) for _ in range(1000000)]
>>> %timeit python()
23.3 ms ± 895 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit numpy()
43 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
 
Numpy: 2.2.3
Python: 3.13.2
# this.s*100000 (str)
>>> %timeit python()
454 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit numpy()
3.91 s ± 58.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# vals = [random.randint(0,100) for _ in range(1000000)]
>>> %timeit python()
24.2 ms ± 805 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit numpy()
46.5 ms ± 905 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

 

I'm curious what I'm doing wrong to get such a different result from you...

DanPatterson
MVP Esteemed Contributor

my this  :   np.random.randint(0, 100, size=10000000, dtype='int')  # 10, 000, 000

your this :  random.randint(0,100) for _ in range(1000000)  # 1, 000, 000

factor of 10

HaydenWelch
MVP Regular Contributor

Totally missed that your randint was numpy flavored...

That alone would kill performance as now the . unique call has to go through the process of converting Python dynamic ints to numpy fixed ints which the Counter doesn't have to worry about as it's just looking at the hash of each input.

Goes to show that if you're going to use numpy, you should make sure you're using the correct dtype and not the Python type:

# vals = np.random.randint(0, 100, size=10000000, dtype='int') 
>>> %timeit python()
508 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit numpy()
184 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

 Thanks for going down this rabbit hole with me by the way, it was both fun and informative! Hopefully anyone reading along has learned something useful from this. I know I did!

DanPatterson
MVP Esteemed Contributor

Data types — NumPy v2.2 Manual lots in there and their association with python and C 

and if you have time

Datetimes and timedeltas — NumPy v2.2 Manual

HaydenWelch
MVP Regular Contributor

Numpy really is as close as you can get to just writing C code without writing C code.

Also I didn't know about the datetime64 library, seems like it would be really useful for geological work. Seeing as how the standard date time module only supports AD date ranges.

I haven't seen it used that much and am definitely going to look into examples of it used in the wild now.

DanPatterson
MVP Esteemed Contributor

There is also the new StringDType()

Working with Arrays of Strings And Bytes — NumPy v2.2 Manual

I feel a blog coming on 🤔

About the Author
Retired Geomatics Instructor (also DanPatterson_Retired). Currently working on geometry projects (various) as they relate to GIS and spatial analysis. I use NumPy, python and kin and interface with ArcGIS Pro.
Labels