Select to view content in your preferred language

Impact of Disabled Fields on SQL Server Queries in ArcGIS Pro

131
2
Jump to solution
2 weeks ago
gisarchitect
Regular Contributor

Hello Everyone,

I have a question about data performance in ArcGIS Pro when using SQL Server as a data source.

Specifically, when I disable certain fields in a layer before adding it to the map, how does this impact the underlying database query? Does SQL Server still read all data pages and fetch all fields from disk, or is the query optimized to retrieve only the enabled fields that are actively being displayed?

My expectation is that the query should be optimized since fewer fields are being queried. However, I am uncertain about this because of how SQL Server stores data. For instance, if the data uses row-store storage, the database pages would still contain all fields, and the I/O operations on the disk would read all fields regardless of what is needed. This behavior would contradict the expected optimization.

I would greatly appreciate it if anyone could shed light on how ArcGIS Pro handles such scenarios and whether this affects database performance.

0 Kudos
1 Solution

Accepted Solutions
VinceAngelo
Esri Esteemed Contributor

There's quite a bit wrong with the assumptions behind this, so let's make sure the ground is clear:

  • ArcGIS Pro is a database client. Changing display properties on a layer has no impact on the data within the database server, only on the queries which are generated (and results returned)
  • Queries against a database server have no impact on the data storage of the rows, which are usually bundled up into blocks in the database I/O subsystem.
  • Databases read blocks, not rows, so if the DBMS thinks it needs any row in the block, it has to read the entire block. Blocks are cached, so that supplemental reads of any other rows in the block will likely not cost more (see fragmentation, later)
  • Rows which span blocks generally require chaining to the supplemental block(s)
  • Databases often "read ahead" once they detect a sequence of blocks is being queried, flushing out least recently used blocks, so that subsequent reads are "faster" (which isn't always the case, especially if the read-ahead wasn't needed, and the flushed blocks could have been needed instead)

Now then:

Specifically, when I disable certain fields in a layer before adding it to the map, how does this impact the underlying database query?


It changes the column list in the SELECT clause, and therefore the content of rows which are returned, no more.


Does SQL Server still read all data pages and fetch all fields from disk, or is the query optimized to retrieve only the enabled fields that are actively being displayed?

Usually the former, though it's really "all fields from all rows from all blocks (or the subset of all blocks indicated by the index), plus all of the chained blocks". In order to optimize for selected columns, you'd need to store every combination of the row contents for every possible query, using orders of magnitude more disk space, and making updates an I/O nightmare.


My expectation is that the query should be optimized since fewer fields are being queried. However, I am uncertain about this because of how SQL Server stores data. For instance, if the data uses row-store storage, the database pages would still contain all fields, and the I/O operations on the disk would read all fields regardless of what is needed. This behavior would contradict the expected optimization.


Which indicates the expectation is incorrect.  The key piece you're missing, though, is the optimizer, which might be able to take the data from the skinnier pages (and therefore more densely packed blocks) associated with indexes. This is why a covering index is created. 


[Could anyone] shed light on how ArcGIS Pro handles such scenarios and whether this affects database performance.

ArcGIS Pro is a database client. It fashions SQL queries. It doesn't play any role in how the database responds to the queries.

The other pieces missing here are fragmentation and partitioning. If you load imagery footprint data in a database, and add to it each day, for years, then do a query for "All images between time x and time y", and have indexes on the image timestamp, the RDBMS is going to find the starting block pretty quickly, and probably will be able to plow though the subsequent blocks to get the remaining rows.  But, if you instead query on imagery platform,  without an index, then the database needs to do a full table scan to inspect each block for rows, and each row for data. Even if you have an index, which identifies the blocks where all Landsat7 images are, the database still needs to read a lot of blocks, and depending on the size of the rows, and how they are allocated across the blocks, might indeed need to read ALL the blocks. This, in essence, is fragmentation -- the need to read "unnecessary" data because it is stored in the same block as the data you want.  

Now, you could build a clustered index on the platform, which instructs the database to build an index, and order the rows in the table so that consecutive rows all share the same platform (until it changes), but then the footprint data would be fragmented with respect to date. 

The other thing you could do is make several virtual tables, partitions, and store all the consecutive date data for each platform separately, so a "platform and time" query is likely to process the least amount of rows. Note that  you could also partition by date and cluster for platform, if that made more sense in your overall  query set. And because the partitions are very nearly tables themselves, the database could leverage multiple CPUs to search the partitions in parallel, assembling the final result set after reducing the number of blocks  as it can.

Finally, there's spatial queries, which depend on the storage of an indication of where the features are, either by a grid algorithm (Esri ST_Geometry) or a corner+envelope size mechanism (R-trees).  Data can also be spatially fragmented, like in the years-long sequential storage of imagery footprints example above, and can be defragmented by creating a clustered index on something spatial, like UTM Zone, or country code, or province code (or combination, like UTM+Admin1, then centroid Y value). Though, again, this optimization will reduce the performance of temporal queries (it really is a zero sum game -- optimizations have costs, like UPDATEs on indexed fields needing to update the index values in addition to the rows, and are subject to the law diminishing returns -- too many partitions will overfill the catalog and eventually slow queries to a crawl).

Since the majority of the work is done in packing the result stream back to the client as rows, if wide, unnecessary field values can be left out, then there will be less transmission delay across the network from the database.

- V

View solution in original post

2 Replies
VinceAngelo
Esri Esteemed Contributor

There's quite a bit wrong with the assumptions behind this, so let's make sure the ground is clear:

  • ArcGIS Pro is a database client. Changing display properties on a layer has no impact on the data within the database server, only on the queries which are generated (and results returned)
  • Queries against a database server have no impact on the data storage of the rows, which are usually bundled up into blocks in the database I/O subsystem.
  • Databases read blocks, not rows, so if the DBMS thinks it needs any row in the block, it has to read the entire block. Blocks are cached, so that supplemental reads of any other rows in the block will likely not cost more (see fragmentation, later)
  • Rows which span blocks generally require chaining to the supplemental block(s)
  • Databases often "read ahead" once they detect a sequence of blocks is being queried, flushing out least recently used blocks, so that subsequent reads are "faster" (which isn't always the case, especially if the read-ahead wasn't needed, and the flushed blocks could have been needed instead)

Now then:

Specifically, when I disable certain fields in a layer before adding it to the map, how does this impact the underlying database query?


It changes the column list in the SELECT clause, and therefore the content of rows which are returned, no more.


Does SQL Server still read all data pages and fetch all fields from disk, or is the query optimized to retrieve only the enabled fields that are actively being displayed?

Usually the former, though it's really "all fields from all rows from all blocks (or the subset of all blocks indicated by the index), plus all of the chained blocks". In order to optimize for selected columns, you'd need to store every combination of the row contents for every possible query, using orders of magnitude more disk space, and making updates an I/O nightmare.


My expectation is that the query should be optimized since fewer fields are being queried. However, I am uncertain about this because of how SQL Server stores data. For instance, if the data uses row-store storage, the database pages would still contain all fields, and the I/O operations on the disk would read all fields regardless of what is needed. This behavior would contradict the expected optimization.


Which indicates the expectation is incorrect.  The key piece you're missing, though, is the optimizer, which might be able to take the data from the skinnier pages (and therefore more densely packed blocks) associated with indexes. This is why a covering index is created. 


[Could anyone] shed light on how ArcGIS Pro handles such scenarios and whether this affects database performance.

ArcGIS Pro is a database client. It fashions SQL queries. It doesn't play any role in how the database responds to the queries.

The other pieces missing here are fragmentation and partitioning. If you load imagery footprint data in a database, and add to it each day, for years, then do a query for "All images between time x and time y", and have indexes on the image timestamp, the RDBMS is going to find the starting block pretty quickly, and probably will be able to plow though the subsequent blocks to get the remaining rows.  But, if you instead query on imagery platform,  without an index, then the database needs to do a full table scan to inspect each block for rows, and each row for data. Even if you have an index, which identifies the blocks where all Landsat7 images are, the database still needs to read a lot of blocks, and depending on the size of the rows, and how they are allocated across the blocks, might indeed need to read ALL the blocks. This, in essence, is fragmentation -- the need to read "unnecessary" data because it is stored in the same block as the data you want.  

Now, you could build a clustered index on the platform, which instructs the database to build an index, and order the rows in the table so that consecutive rows all share the same platform (until it changes), but then the footprint data would be fragmented with respect to date. 

The other thing you could do is make several virtual tables, partitions, and store all the consecutive date data for each platform separately, so a "platform and time" query is likely to process the least amount of rows. Note that  you could also partition by date and cluster for platform, if that made more sense in your overall  query set. And because the partitions are very nearly tables themselves, the database could leverage multiple CPUs to search the partitions in parallel, assembling the final result set after reducing the number of blocks  as it can.

Finally, there's spatial queries, which depend on the storage of an indication of where the features are, either by a grid algorithm (Esri ST_Geometry) or a corner+envelope size mechanism (R-trees).  Data can also be spatially fragmented, like in the years-long sequential storage of imagery footprints example above, and can be defragmented by creating a clustered index on something spatial, like UTM Zone, or country code, or province code (or combination, like UTM+Admin1, then centroid Y value). Though, again, this optimization will reduce the performance of temporal queries (it really is a zero sum game -- optimizations have costs, like UPDATEs on indexed fields needing to update the index values in addition to the rows, and are subject to the law diminishing returns -- too many partitions will overfill the catalog and eventually slow queries to a crawl).

Since the majority of the work is done in packing the result stream back to the client as rows, if wide, unnecessary field values can be left out, then there will be less transmission delay across the network from the database.

- V

gisarchitect
Regular Contributor

@VinceAngelo 

I want to extend my sincerest thanks for your detailed and thorough explanation in response to my question. Your breakdown of how ArcGIS Pro interacts with SQL Server and the database principles underlying these interactions has been incredibly insightful.

0 Kudos