The Iterable Cursor: Python Built-ins & Itertools

JoshuaBixby · ‎02-02-2016

This is the third in a multi-part series on ArcPy cursors; particularly, working with ArcPy cursors as iterable objects in Python. The first part in the series looks at some of the important components of iteration in Python. The second part in the series looks at iterating and looping over ArcPy Data Access cursors. The third part in the series looks at using several Python built-in and itertool functions with ArcPy Data Access cursors. The fourth part in the series will look at using generators or generator expressions to separate selection or filtering logic for code re-use. Fifth or following parts are unknown at this point.

The first two parts in this series cover iteration components of Python and iterating/looping in Python, both of which are crucial to understanding and working with iterables. Beyond manually iterating or looping over an iterable, there are numerous Python built-in functions that work with iterables and sequences. Starting back in version 2.3, an itertools module was introduced that contains "functions creating iterators for efficient looping."

The focus of this series is treating ArcPy Data Access cursors as Python iterables. Like most things in life, there is more than one way to answer a question using ArcGIS, and I want to provide some contrasting examples along with Pythonic examples. As much as I like writing idiomatic Python, there will be plenty of times when using native geoprocessing tools will outperform straight Python. Writing fast code is great, but it's a separate discussion for a different day.

There are so many built-in and itertool functions that work with iterables, I can't possibly demonstrate them all, but I will demonstrate a handful to illustrate how such functions work nicely with ArcPy Data Access cursors. For this and future parts in the series, I will leave the previous examples behind in favor of a real-world dataset that readers can download and experiment with themselves. Specifically, I will use the USA States layer package included with Esri Data & Maps and available on ArcGIS.com.

The same ArcPy Data Access SearchCursor will be used for most of the examples below:

>>> layer = r'USA States\USA States (below 1:3m)'
>>> fields = ["STATE_ABBR", "SHAPE@AREA", "SUB_REGION", "POP2010"]
>>> cursor = arcpy.da.SearchCursor(layer, fields)
>>>
‍‍‍‍

One question that comes up from time to time in the forums/GeoNet is how to count the number of records in a data set or selection set using cursors:

>>> # Example 1: Get record count using ArcGIS geoprocessing tool
>>> arcpy.GetCount_management(layer)
<Result '52'>
>>>
>>> # Example 2: Get record count using variable as counter
>>> with cursor:
...     i = 0
...     for row in cursor:
...         i = i + 1   # or i += 1
...     print i
...
52
>>>
>>> # Example 3: Get record count using built-in list and len functions
>>> with cursor:
...     print len(list(cursor))
...
52
>>>
>>> # Example 4: Get record count using built-in sum function
>>> with cursor:
...     print sum(1 for row in cursor)
...
52
>>>
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Looking over the record counting examples:

Example 1 uses ArcGIS's Get Count geoprocessing tool.
- The Get Count tool retrieves the number of records in a feature class, table, layer, or raster; it does not operate against cursors. It can also be used in the GUI.
Example 2 loops over the cursor using a variable as a counter.
- A counting loop structure is common across a wide range of programming languages.
Example 3 uses built-in list and len functions.
- The list function converts the entire cursor into a Python list, and the len function returns the length of the list, which corresponds to the number of records in the cursor.
- Creating a Python list does require copying the entire contents of the cursor into memory, which could become an issue for extremely large data sets.
Example 4 uses the built-in sum function along with a generator expression.
- Using a Python generator expression instead of a list does not copy the entire contents of the cursor into memory.

I included Example 1 because it is the highest performing approach to getting record counts of data sets and selection sets within the ArcGIS framework, but it doesn't deal with a cursor as a Python iterable, which is the focus of this series. Example 2 is functionally and syntactically correct, although I would argue it isn't the most Pythonic. Examples 3 and 4 both use Python built-in functions that treat a cursor as an iterable, but it is arguable whether Example 3 or 4 is more Pythonic because each approach has strengths and weaknesses.

Another question that comes up occasionally is how to retrieve a record from a data set based on the minimum or maximum values of one of the fields in the data set. This next set of examples will retrieve the record/row for the state with the highest population in 2010 (POP2010):

>>> # Retrieve table name for data source of layer
>>> desc = arcpy.Describe(layer)
>>> fc_name = desc.featureClass.baseName
>>>
>>> # Example 5: Get maximum population record using ArcGIS geoprocessing tools
>>> summary_table = arcpy.Statistics_analysis(layer,
...                                           "in_memory/summary_max",
...                                           "POP2010 MAX")
...                                      
>>> arcpy.AddJoin_management(layer,
...                          "POP2010",
...                          summary_table,
...                          "MAX_POP2010",
...                          "KEEP_COMMON")
...                     
>>> joined_fields = [(field if "@" in field else ".".join([fc_name, field]))
...                  for field
...                  in fields]
>>> cursor_sql_join = arcpy.da.SearchCursor(layer, joined_fields)
>>> print next(cursor_sql_join)
(u'CA', 41.639274447708424, u'Pacific', 37253956)
>>> del cursor_sql_join
>>> arcpy.RemoveJoin_management(layer, "summary_max")
<Result 'USA States\\USA States (below 1:3m)'>
>>>
>>> # Example 6: Get maximum population record using SQL subquery
>>> sql = "POP2010 IN ((SELECT MAX(POP2010) FROM {}))".format(fc_name)
>>> cursor_sql_subqry = arcpy.da.SearchCursor(layer, fields, sql)                                
>>> print next(cursor_sql_subqry)
(u'CA', 41.639274447708424, u'Pacific', 37253956)
>>> del cursor_sql_subqry
>>>
>>> # Example 7: Get maximum population record by looping and comparing
>>> with cursor:
...     max_row = next(cursor)
...     for row in cursor:
...         if row[3] > max_row[3]:
...             max_row = row
...        
>>> print max_row
(u'CA', 41.639274447708424, u'Pacific', 37253956)
>>>
>>> # Example 8: Get maximum population record using built-in max function
>>> from operator import itemgetter
>>> with cursor:
...     print max(cursor, key=itemgetter(3))
...
(u'CA', 41.639274447708424, u'Pacific', 37253956)
>>>
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Looking over the maximum record examples:

Examples 5 and 6 require the table name for the data source of the layer.
- The table name was found using the ArcPy Describe function and properties of two different Describe objects (the property lookups were chained together on line 03)
- Examples 5 and 6 both need the underlying table name but for different reasons.
  - With Example 5, the base table name is needed to re-create the cursor after the summary statistics table is joined to the layer's source table.
  - With Example 6, the base table name is needed to create the subquery used in the cursor's SQL WHERE clause.
Example 5 uses ArcGIS's Summary Statistics and Add Join geoprocessing tools.
- The Summary Statistics tool finds the maximum population value in the dataset. The Add Join tool allows the result of the Summary Statistics tool to be linked back to the original layer to find the corresponding record with the maximum population.
- Before creating a cursor against the newly joined layer, the original fields need to be updated to prepend the table name to the field name (line 16).
Example 6 uses an SQL subquery in the WHERE clause of the SQL query.
- The SQL subquery identifies the maximum population value. The value(s) returned from the SQL subquery are used to select record(s) in the original data set.
Example 7 loops over the cursor using a variable to hold the record with the maximum value.
- A loop structure is common across a wide range of programming languages.
Example 8 uses the built-in max function along with the operator.itemgetter function.
- The max function operates on an iterable, but it is necessary to use operator.itemgetter since the cursor represents an iterable of tuples and not an iterable of numeric data types.

I included Example 5 because it only uses ArcGIS geoprocessing tools and can be implemented in the GUI. Although it can be implemented in the GUI with no scripting skills, Example 5 is also the most cumbersome, i.e., it has the most steps, is the slowest, and creates intermediate products. With just a little bit of SQL or Python knowledge, the doors open to more eloquent and higher performing approaches. Similar to Example 2 above, Example 7 is functionally and syntactically correct, although I would argue Example 8 is more Pythonic.

Instead of getting just the State with the largest population in 2010, let's print States and their populations by descending population:

>>> # Example 9: Print State and population by descending population
>>> #            using Sort geoprocessing tool
>>> sorted_table = arcpy.Sort_management(layer,
...                                      "in_memory/sorted_pop",
...                                      "POP2010 DESCENDING")
...                                      
>>> cursor_sorted_table = arcpy.da.SearchCursor(sorted_table, fields)
>>> with cursor_sorted_table:
...     for state, area, sub_region, pop2010 in cursor_sorted_table:
...         print "{}, {}".format(state, pop2010)
...         
CA, 37253956
TX, 25145561
NY, 19378102
...
DC, 601723
WY, 563626
PR, -99
>>> del cursor_sorted_table
>>> 
>>> # Example 10: Print State and population by descending population
>>> #             appending SQL ORDER BY clause
>>> sql = "ORDER BY POP2010 DESC"
>>> cursor_orderby_sql = arcpy.da.SearchCursor(layer, fields, sql_clause=(None, sql))
>>> with cursor_orderby_sql:
...     for state, area, sub_region, pop2010 in cursor_orderby_sql:
...         print "{}, {}".format(state, pop2010)
...         
CA, 37253956
TX, 25145561
NY, 19378102
...
DC, 601723
WY, 563626
PR, -99
>>> del cursor_orderby_sql
>>> 
>>> # Example 11: Print State and population by descending population
>>> #             using built-in sorted function
>>> with cursor:
...     for state, area, sub_region, pop2010 in sorted(cursor,
...                                                    key=itemgetter(3),
...                                                    reverse=True):
...         print "{}, {}".format(state, pop2010)
...         
CA, 37253956
TX, 25145561
NY, 19378102
....
DC, 601723
WY, 563626
PR, -99
>>>
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Looking over the descending population examples:

Example 9 uses ArcGIS's Sort goeprocessing tool.
- The Sort tool sorts the original table into a new table. A new cursor needs to be defined using the newly sorted feature class.
Example 10 uses an SQL ORDER BY clause.
- The SQL ORDER BY clause is passed as part of the sql_clause while creating a new cursor.
Example 11 uses the built-in sorted function along with the operator.itemgetter function.
- The sorted function operates on an iterable, but it is necessary to use operator.itemgetter since the cursor represents an iterable of tuples and not an iterable of numeric data types.
- The sorted function converts the entire cursor into a Python list, and creating a Python list does require copying the entire contents of the cursor into memory, which could become an issue for extremely large data sets.

I included Example 9 because it uses ArcGIS geoprocessing tools and can be implemented easily in the GUI. Similar to Example 5 above, Example 9 is cumbersome if one is scripting instead of using the GUI. Example 10 uses some basic SQL for a straightforward solution, although it does involve having to create another cursor instead of recycling the existing cursor. Example 11 is idiomatic in that it uses the built-in sorted function and treats the cursor as an iterable. Since sorted does return a newly sorted Python list from an iterable, using the function could become an issue with extremely large data sets.

There are numerous other examples I thought up, but this post is already longer than I expected. I believe the 3 sets of examples above demonstrate how Python built-in functions that operate on iterables can be used with ArcPy Data Access cursors to write idiomatic Python for ArcGIS. The next part of the series looks at using generators and generator expressions with ArcPy Data Access cursors.