Select to view content in your preferred language

When EXISTS Doesn't: File Geodatabases and Correlated Subqueries

3768
9
10-04-2016 02:51 PM
Labels (1)
JoshuaBixby
MVP Esteemed Contributor
3 9 3,768

In /blogs/tilting/2016/09/27/performance-at-a-price-file-geodatabases-and-sql-support?sr=search&searchI..., I bring attention to the SQL support trade-off Esri made while developing the file geodatabase (FGDB) as a replacement for the personal geodatabase (PGDB).  In that post, several links are provided for those who are interested in learning more about SQL support in file geodatabases.  Although there is plenty of overlap in content between the various sources/links, there are also important statements that only exist in one place or another, and knowing that can be important when troubleshooting errors or trying to understand spurious results from data stored in file geodatabases.

One area where users can get themselves into trouble with file geodatabases and SQL is the EXISTS condition or operator.  Looking at the SQL Reference (FileGDB_SQL.htm) in the File Geodatabase API ( @Esri Downloads ) or the SQL reference for query expressions used in ArcGIS:

[NOT] EXISTS

Returns TRUE if the subquery returns at least one record; otherwise, it returns FALSE. For example, this expression returns TRUE if the OBJECTID field contains a value of 50:

EXISTS (SELECT * FROM parcels WHERE "OBJECTID" = 50)
EXISTS is supported in file, personal, and ArcSDE geodatabases only.

Seeing that SQL EXISTS is supported in the file geodatabase, how does someone get himself into trouble using it?  Unfortunately, it is easier than you might expect, and the answer is correlated subqueries.

Correlated subqueries are fairly common when working with EXISTS, so common in fact that Microsoft's EXISTS (Transact-SQL) documentation and Oracle's EXISTS Condition documentation both use a correlated subquery in at least one code example.  At its simplest, a correlated subquery is a subquery that relates back to one or more tables in the outer query.  The theory and empiricism of correlated subqueries goes well beyond this blog post, but I will mention that correlated subqueries do have some JOIN-like properties, or at least appearances.

The following example is adapted from Select Max value arcpy‌, the same GeoNet question that got me researching this issue many months back.  Let's start with 2 basic tables (TableA and TableB), each having a text and integer field, and one table containing a subset of records from the other table.

A screenshot showing 2 basic tables with a handful of records

>>> fgdb = # path to file geodatabase
>>> pgdb = # path to personal geodatabase
>>> egdb = # path to enterprise geodatabase, SQL Server used in example
>>> gpkg = # path to GeoPackage
>>> gdbs = (fgdb, pgdb, egdb, gpkg)
>>> 
>>> values = (("A", 1), ("A", 2), ("B", 1), ("B", 2),
...           ("C", 2), ("C", 3), ("D", None), ("E", 1))
... 
>>> table_names = ("TableA", "TableB")
>>> 
>>> for gdb in gdbs:
...     for i in xrange(2,0,-1):
...         table = arcpy.CreateTable_management(gdb, table_names[i-1])
...         arcpy.AddField_management(table, "id", "TEXT")
...         arcpy.AddField_management(table, "version", "LONG")   
...         with arcpy.da.InsertCursor(table, ("id", "version")) as cur:
...             for value in values[::i]:
...                 cur.insertRow(value)
...     
...     qry = ("EXISTS (SELECT 1"
...                    "  FROM TableB"
...                    " WHERE TableB.id = TableA.id"
...                    "  AND TableB.version = TableA.version)")
...     
...     table_view = arcpy.MakeTableView_management(table, "table_view", qry)
...     print arcpy.GetCount_management(table_view)
...     arcpy.Delete_management(table_view)
0
3
3
3
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The sample code above creates the two basic tables in a file geodatabase, personal geodatabase, enterprise geodatabase (I used SQL Server), and a GeoPackage.  The code then creates a table view for each of the geodatabases by using EXISTS with a simple correlated subquery to select the records from TableA that have a matching id and version in TableB.  The correct table view is shown in the screenshot above.  One can see from the results of the code that no records are returned from the file geodatabase.

So why doesn't the file geodatabase return any records?  Are subqueries not supported?  Seeing that the documentation for EXISTS mentions subqueries, it seems they have to be supported, at least to some degree.  Let's take a closer look at the Subqueries section of SQL reference for query expressions used in ArcGIS:

Subqueries

Note:

Coverages, shapefiles, and other nongeodatabase file-based data sources do not support subqueries. Subqueries that are performed on versioned ArcSDE feature classes and tables will not return features that are stored in the delta tables. File geodatabases provide the limited support for subqueries explained in this section, while personal and ArcSDE geodatabases provide full support. For information on the full set of subquery capabilities of personal and ArcSDE geodatabases, refer to your DBMS documentation.

A subquery is a query nested within another query. It can be used to apply predicate or aggregate functions or to compare data with values stored in another table....

Subquery support in file geodatabases is limited to the following:

  • IN predicate. For example:
"COUNTRY_NAME" NOT IN (SELECT "COUNTRY_NAME" FROM indep_countries)
  • Scalar subqueries with comparison operators. A scalar subquery returns a single value. For example:
"GDP2006" > (SELECT MAX("GDP2005") FROM countries)

For file geodatabases, the set functions AVG, COUNT, MIN, MAX, and SUM can only be used within scalar subqueries.

  • EXISTS predicate. For example:
EXISTS (SELECT * FROM indep_countries WHERE "COUNTRY_NAME" = 'Mexico')

The documentation states "limited support" for subqueries, but it also states the EXISTS predicate is supported.  If subqueries aren't the problem, maybe correlated subqueries are the problem.  After all, none of the EXISTS examples in all of the documentation uses a correlated subquery.  Unfortunately, guessing is what we are left with because correlated subqueries are not mentioned explicitly in any of the file geodatabase documentation I can find.

From this simple example, it is clear the file geodatabase gives incorrect results when using EXISTS with correlated subqueries.  Are the incorrect results a bug or limitation of SQL support in the file geodatabase?  Even if one tries to argue the latter, there is still the problem of the user getting incorrect results instead of an error message, which should be what the user gets if correlated subqueries are not supported.

9 Comments
LanceShipman
Esri Regular Contributor

File Geodatabase does not support correlated subqueries,

JoshuaBixby
MVP Esteemed Contributor

After publishing the blog post, I got to wondering how GeoPackages are handled.  It turns out, they work just fine, i.e., one can create the tables and table view from a GeoPackage and get the correct results.  I went ahead and updated the code to include using a GeoPackage.

AndrewWilson99
Frequent Contributor

lshipman-esristaff‌, thanks for the inside scoop.   Do you have any idea if that recognition of response differences in geodatabases means that the documentation will be updated to be a bit more explicit, that there will be fixes (next version, patches, hotfixes) to allow SQL Correlated Subqueries to work in fGDB like they work in other data formats in the future, or is it just that this as far as it goes for the time?   Thanks in advance.

LanceShipman
Esri Regular Contributor

I plan on updating the documentation Post 10.5. Support of SQL Correlated Subqueries is not currently planned for the fGDB, but if a persuasive business case is submitted we will consider it.

JoshuaBixby
MVP Esteemed Contributor

Although my blog post example involved two tables, one of our most common uses for correlated subqueries is actually self-correlated subqueries:

>>> fgdb = # path to file geodatabase
>>> pgdb = # path to personal geodatabase
>>> egdb = # path to enterprise geodatabase, SQL Server used in example
>>> gpkg = # path to GeoPackage
>>> gdbs = (fgdb, pgdb, egdb, gpkg)
>>> 
>>> values = (("A", 1), ("A", 2), ("B", 1), ("B", 2),
...           ("C", 2), ("C", 3), ("D", None), ("E", 1))
... 
>>> table_name = "TableA"
>>> 
>>> for gdb in gdbs:
...     table = arcpy.CreateTable_management(gdb, table_name)
...     arcpy.AddField_management(table, "id", "TEXT")
...     arcpy.AddField_management(table, "version", "LONG")   
...     with arcpy.da.InsertCursor(table, ("id", "version")) as cur:
...         for value in values:
...             cur.insertRow(value)
...     
...     qry = ("NOT EXISTS (SELECT 1"
...                    "  FROM TableA a"
...                    " WHERE TableA.id = a.id"
...                    "  AND TableA.version < a.version)")
...     
...     table_view = arcpy.MakeTableView_management(table, "table_view", qry)
...     arcpy.TableToTable_conversion(table_view, gdb, table_name + "_max")
...     print arcpy.GetCount_management(table_view)
...     arcpy.Delete_management(table_view)
8
5
5
5‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The use of a self-correlated subquery allows for selecting the maximum or minimum record for a given attribute.  In the example above, a self-correlated subquery is used to select the record with the maximum value of version for each value of id.

Bud
by
Esteemed Contributor

@LanceShipman  

Regarding: 

"File Geodatabase does not support correlated subqueries" 

Does that still apply to ArcGIS Pro 3.0.3? There seems to be a lot of stuff supported in FGDBs these days, including INNER JOIN (SQL for reporting and analysis on file geodatabases) and database views. I'd like to think correlated subqueries would be supported in this modern age. 

I ask because of a related post I wrote here: File Geodatabase SQL expression to get greatest n per group 

Bud
by
Esteemed Contributor

BUG-000156143 - An SQL query containing the EXISTS predicate validates successfully but returns incorrect results from a file geodatabase.

ENH-000156158 - Raise exception when using unsupported SQL expressions in file geodatabases.

Esri Canada Case 03264937 - Exception not raised when unsupported FGDB SQL used (correlated subquery)

Bud
by
Esteemed Contributor

@LanceShipman 

10-25-2016 12:35 PM

I plan on updating the documentation Post 10.5. Support of SQL Correlated Subqueries is not currently planned for the fGDB, but if a persuasive business case is submitted we will consider it.

I would say the "greatest 1 per group" scenario is a worthwhile business case for supporting correlated subqueries in file geodatabases. Currently, there doesn't seem to be a way to do that in FGDB SQL expressions.

It's a common requirement. For example: Selecting the most recent records based on unique values in another field

About the Author
I am currently a Geospatial Systems Engineer within the Geospatial Branch of the Forest Service's Chief Information Office (CIO). The Geospatial Branch of the CIO is responsible for managing the geospatial platform (ArcGIS Desktop, ArcGIS Enterprise, ArcGIS Online) for thousands of users across the Forest Service. My position is hosted on the Superior National Forest. The Superior NF comprises 3 million acres in northeastern MN and includes the million-acre Boundary Waters Canoe Area Wilderness (BWCAW).