Fix the ArcGIS "Query Layer" framework and make it scalable to ultra large databases.

Idea created by mboeringa2010 Champion on Aug 16, 2018

    Despite the high potential for accessing non-geodatabase spatial databases, Query Layer's currently have some serious limitations and issues that limit their value. Especially problematic is that Query Layers in the current implementation cannot properly scale to be used against ultra large spatial database, like e.g. an OpenStreetMap database for the entire globe. This ArcGIS Idea outlines a number of changes that I think are necessary to have Query Layers truly live up to their potential.


    For reference, the Make Query Layer tool has the following syntax according to the ArcGIS Help:


    MakeQueryLayer_management (input_database, out_layer_name, query, {oid_fields}, {shape_type}, {srid}, {spatial_reference})


    What are the current issues?:


    1) - ArcGIS needs to know the spatial extent of a layer before it can properly use it. For layers registered with an ESRI Geodatabase, the spatial extent is stored and known, but this is not the case for Query Layers. In the current implementation of Query Layers,  ArcGIS Pro will attempt to calculate the spatial extent based on a full scan of the WHERE clause records, which may be dozens or even hundreds of millions of records for a global database.


    This is problematic, as, in order to calculate it, it needs to pull all of these records to the local ArcGIS instance running on your computer. This means huge and prolonged network traffic, as ArcGIS transfers each and every record of the table over the network. I have witnessed ArcGIS pulling > 50Mbs network traffic for hours on end while creating a single Query Layer and calculating the Spatial Extent, meaning that potentially dozens of GB of data are being transferred over the local network just to calculate the spatial extent. This behavior was confirmed by a support call to ESRI.


    Even more problematic is that this behavior is even triggered when accessing dataset properties using "arcpy.Describe". Even if you do not create a Query Layer, but want to access the dataset properties using arcpy and Describe, a similar huge performance bottle neck is hit, with again huge network traffic and the application locking up.


    So what needs to change?: While the current behavior gives an accurate calculation result for the spatial extent, it seems to me to be an an overkill in the context of globally sized databases with hundreds of millions of records. In my opinion, ESRI should take a random sample of the records to determine an estimate of the spatial extent. Considering users are allowed to manually enter a spatial extent, the exact size seems non-critical and an estimate would likely do. This would be very much equivalent to what a database ANALYZE command does, which also only takes a sample of records, and would strongly reduce the amount of required IO and time needed to create a Query Layer, or access some of its properties.


    2) - In the current implementation, there is a fundamental difference between Query Layers having a Definition Query set in their Layer properties, or the Definition Query appended to the actual SQL statement defining the Query Layer. I naively assumed ESRI's Query Layer implementation would concatenate the Definition Query to the SQL of the Query Layer, and send this of to the database. This is not the case.


    E.g. a Query Layer with the following properties:


    "SELECT * FROM <TABLE_NAME>" as Query Layer SQL statement


    "<FIELD> = X" as Definition Query

     is NOT the same as / equivalent to a Query Layer with:

    "SELECT * FROM <TABLE_NAME> WHERE <FIELD> = X" as Query Layer SQL statement


    from a performance and database point of view in the current ArcGIS implementation of accessing non-geodatabase spatial tables. The latter is much faster if the WHERE condition is selective, as the former is in essence considered one giant table without any selection, and the Definition Query will only be evaluated locally on your own computer once the full dataset has been pulled over the network, which is a costly and time consuming operation.


    For many operations in ArcGIS, there could be a considerable performance boost for Query Layers, if ArcGIS by default always concatenated any Definition Query set in the layer's properties to the existing Query Layer SQL statement when accessing layer properties or sending the SQL to the database to create a recordset for a Query Layer or for opening the Attribute Table.


    After all: that is what a Definition Query is about: it is supposed to "define" what records belong to a layer, yet ArcGIS ignores this "definition-of-a-layer" when accessing the Query Layer and/or its properties if it is set as a layer property Definition Query, instead of concatenated to the Query Layer's SQL statement.


    3) - In the current implementation, if a Query Layer's SQL statement does not return any results (0 records), the Query Layer's geometry type defaults to Point. To be honest, I was gobsmacked to learn this when I read the Help, especially since the Query Layer command actually has an option to specify the geometry type ({shape_type}), yet the current implementation does not use it to set the geometry type if there are no records returned by the SQL, but defaults to Point always. This is a serious hindrance to automation using arcpy and Python.


    Why is this such a problem? Just imagine a programming language, that supports integers and strings. Now it has one peculiarity: if the integer value happens to be negative, it will be magically turned into a string value... I don't think I have to explain why this is wrong, yet this is what the current Query Layer implementation essentially does!


    What are the consequences for automation using arcpy? As a concrete example: I have a script that sets symbology using the ApplySymbologyFromLayer_management command. This command requires the geometry of the input feature class to match the geometry type of the layer file's symbology. In my code, I know which geometry type I am handling, yet, if I attempt to run this command and the SQL of a Query Layer happens to return 0 records, my script for updating the symbology suddenly fails for any Line and Polygon layer as the Query Layer being created magically turned into a Point feature layer!


    Having 0 records being returned by a Query Layer is very well possible if dealing with a variety of data across the globe and complex SQL statements. This should be supported.


    So what needs to change?: the Query Layer framework should use the Query Layer command's {shape_type} variable to use as default whenever a Query Layer returns 0 records and when the parameter is specified (it is currently "optional"). Only when {shape_type} is not specified, should it default to Point geometry.


    4) - A similar issue exist for the Spatial Reference. If 0 records, it will default to NAD1983, yet, as with the geometry type, the command includes the ability to set the SRID through code ({srid}, {spatial_reference}), but this is ignored.


    So what needs to change?: if specified, use the {srid} or {spatial_reference} parameter input of the Make Query Layer command to set the Spatial Reference of the resulting layer, even if it has 0 records.