LIKE and SIMILAR: File Geodatabases and Pattern Matching

1060
0
07-19-2018 09:10 AM
Labels (1)
JoshuaBixby
MVP Esteemed Contributor
2 0 1,060

As I raise in /blogs/tilting/2016/09/27/performance-at-a-price-file-geodatabases-and-sql-support, Esri made trade offs when developing the file geodatabase as a replacement or upgrade to the personal geodatabase.  One of the biggest trade offs was and continues to be support for SQL.  In /blogs/tilting/2016/10/04/when-exists-doesnt-file-geodatabases-and-correlated-subqueries I provide one example where Esri partially implements an SQL operator.

While partially implementing a standard isn't unusual, partially implementing a distinct component or aspect of a standard is unusual.  In Esri's case with EXIST and SQL, it would have been better to not implement EXIST at all than partially implement it and allow for common SQL patterns with it to generate incorrect results.

Another part of SQL support lacking in the file geodatabase is expanded pattern matching for strings.  I have lamented the lack of such functionality in GeoNet responses from time to time, and I have even went so far as to log enhancement requests with Esri Support and submit an ArcGIS Idea (Expanded Pattern Matching in File Geodatabases (and documentation would be nice ), but it all seemed for naught, until a few months ago....

It turns out Esri has implemented SQL support for expanded pattern matching in file geodatabases, they just haven't told anyone about it. So how long has this support been around and what all is supported?  Both good questions.  It would be great to get answers directly from Esri, but let's not hold our breaths given that the file geodatabase has been around for almost 12 years and Esri has never said a peep yet about expanded pattern matching support.

With respect to the first question, i.e., how long has expanded pattern matching support been included, I experimented back to ArcGIS 9.3 (I could not find ArcGIS 9.2 installer files to go all the way back to the introduction of the file geodatabase).  As interesting as my experimentation was, there is too much to include here.  Suffice it to say, what I share below only works properly in ArcGIS Desktop 10.6 or greater and ArcGIS Pro 2.0 and greater.

Unlike basic pattern matching using LIKE, expanded pattern matching is implemented using a range of operators/predicates:

In terms of expanded pattern matching and the file geodatabase, Esri has chosen the SIMILAR predicate, which was originally codified in ISO/IEC 9075-1:1999.  Although PostgreSQL implements SIMILAR as well (see PostgreSQL: Documentation: 9.3: Pattern Matching), I have found several important cases where supported functionality in PostgreSQL does not work in the file geodatabase.  To date, the Firebird documentation on SIMILAR TO comes the closest of describing the behavior I have seen with the file geodatabase.

So, what does this all mean for querying feature classes and tables in a file geodatabase?  Instead of making up some data to demonstrate the functionality, let's use someone's data and their question from a GeoNet post:  ArcGIS Select Query help.

First step, build a scratch table containing the data to query:

>>> import arcpy
>>> 
>>> field = {'field_name':'Field1', 'field_type':'TEXT', 'field_length':24}
>>> values = [
...     'A1,F4,A10',
...     'A1,A17',
...     'F1,G6, A1',
...     'A10, A1',
...     'D1, A17',
...     'D2, D4',
...     'A1 ,D2',  # value not included in original post, added for demonstration 
...     'G6, A10'  # value not included in original post, added for demonstration
... ]
... 
>>> scratchGDB = arcpy.env.scratchGDB
>>> table = arcpy.CreateTable_management(scratchGDB, "tmpTable")
>>> arcpy.AddField_management(table, **field)
>>> with arcpy.da.InsertCursor(table, field['field_name']) as cur:
...     for value in values:
...         cur.insertRow([value])
...         
>>> 
>>> with arcpy.da.SearchCursor(table, "*") as cur:
...     for row in cur:
...         print(row)
...         
(1, u'A1,F4,A10')
(2, u'A1,A17')
(3, u'F1,G6, A1')
(4, u'A10, A1')
(5, u'D1, A17')
(6, u'D2, D4')
(7, u'A1 ,D2')
(8, u'G6, A10')
>>> 
>>> del row, cur, value
>>> field = field['field_name']
>>> 
>>> # create helper function to select and print records using cursor
>>> def select_and_print(table, where_clause):
...     with arcpy.da.SearchCursor(table, "*", where_clause) as cur:
...         for row in cur:
...             print(row)
...             
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Now the question:

I am trying to select records showing A1 value.  I have tried using query  Field1 LIKE '%A1%', it is also selecting A10. Is there any way I can select only value A1?

Let's try what the user originally tried and see the result:

>>> sql = "{} LIKE '%A1%'".format(field)
>>> print(sql)
Field1 LIKE '%A1%'
>>> select_and_print(table, sql)
(1, u'A1,F4,A10')
(2, u'A1,A17')
(3, u'F1,G6, A1')
(4, u'A10, A1')
(5, u'D1, A17')
(7, u'A1 ,D2')
(8, u'G6, A10')
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

We can see the problem the user was facing, i.e., basic SQL pattern matching (LIKE) using a wildcard (%) in front and behind the query value (A1) results in false-positive selections (A10 and A17). 

So if %A1% selects A10 and A17 in addition to A1, what if we put a comma after the value and before the wildcard:

>>> sql = "{} LIKE '%A1,%'".format(field)
>>> print(sql)
Field1 LIKE '%A1,%'
>>> select_and_print(table, sql)
(1, u'A1,F4,A10')
(2, u'A1,A17')
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Adding the comma definitely dropped the false-positive selections from before, but now certain records we want are not being selected, e.g., OID/record 3 and 7.

As you can start to see, there isn't a single basic pattern that will work in this situation.  In order to use basic pattern matching with LIKE, several predicates will need to be chained together.

>>> sql = ("{0} LIKE 'A1,%' OR\n"
...        "{0} LIKE 'A1 %' OR\n"
...        "{0} LIKE '% A1 %' OR\n"
...        "{0} LIKE '% A1,%' OR\n"
...        "{0} LIKE '%,A1 %' OR\n"
...        "{0} LIKE '%,A1,%' OR\n"
...        "{0} LIKE '% A1' OR\n"
...        "{0} LIKE '%,A1'").format(field)
...        
>>> print(sql)
Field1 LIKE 'A1,%' OR
Field1 LIKE 'A1 %' OR
Field1 LIKE '% A1 %' OR
Field1 LIKE '% A1,%' OR
Field1 LIKE '%,A1 %' OR
Field1 LIKE '%,A1,%' OR
Field1 LIKE '% A1' OR
Field1 LIKE '%,A1'
>>> select_and_print(table, sql)
(1, u'A1,F4,A10')
(2, u'A1,A17')
(3, u'F1,G6, A1')
(4, u'A10, A1')
(7, u'A1 ,D2')
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The possibility of spaces either before or after the value being queried means more conditions need to be addressed, which means more LIKE predicates.  For this data set, a total of 8 LIKE statements are needed to select the records of interest.

Since basic pattern matching with LIKE is implemented the same across database platforms, the code above is quite portable, even if it is a bit unwieldy and inefficient from a query execution perspective.  One can imagine the number of predicate statements getting really large, really quickly if the requirements for the query are more involved than our example here.

As the complexity of the data, search conditions, or both grows; expanded/complex pattern matching can be easier to implement and sometimes more efficient to execute.  Using the SIMILAR TO documentation from Firebird that I reference above, the following SQL statement can be used to query the records:

>>> sql = "{} SIMILAR TO '(A1[ ,]%|%[ ,]A1[ ,]%|%[ ,]A1)'".format(field)
>>> print(sql)
Field1 SIMILAR TO '(A1[ ,]%|%[ ,]A1[ ,]%|%[ ,]A1)'
>>> select_and_print(table, sql)
(1, u'A1,F4,A10')
(2, u'A1,A17')
(3, u'F1,G6, A1')
(4, u'A10, A1')
(7, u'A1 ,D2')
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Since SIMILAR is based on regular expression searching and the Firebird documentation appears to be fairly accurate/consistent with Esri's implementation, I will forgo explaining the structure of the SIMILAR statement above.  My intent with demonstrating the use of SIMILAR with file geodatabase data isn't to explain how regular expressions work, although I strongly encourage anyone who works with text data to learn regular expressions.

My goal with this blog post is to point out that Esri does support the use of SIMILAR in the file geodatabase, and that Esri really needs to document that support.

P.S. 

I would be remiss if I didn't point out Vince Angelo‌'s comment in ArcGIS Select Query help.  Although one can use text-based pattern matching to query values from lists that are stored in a field/column, such an approach is sub-optimal when it comes to working with relational data stores.  A more robust approach to structuring the data and querying it would be what Vince suggests:

If you had a second table organized:

keycolattrval
1A1
1F4
1A10
2A1
2A17
3F1
3G6
3A1
4A10
4A1

 

then you could fashion a query:

SELECT *

FROM mytable

WHERE keycol in (

   SELECT keycol

   FROM secondtab

   WHERE attrval = 'A1')

 

- V

About the Author
I am currently a Geospatial Systems Engineer within the Geospatial Branch of the Forest Service's Chief Information Office (CIO). The Geospatial Branch of the CIO is responsible for managing the geospatial platform (ArcGIS Desktop, ArcGIS Enterprise, ArcGIS Online) for thousands of users across the Forest Service. My position is hosted on the Superior National Forest. The Superior NF comprises 3 million acres in northeastern MN and includes the million-acre Boundary Waters Canoe Area Wilderness (BWCAW).