API does not always update spatial index

__11 · ‎11-28-2011

I'm having a problem with the Spatial Index (Grid 1) when I add data to a table.

Starting with an empty table, Grid 1 Index is zero (in ArcCatalog), as expected.

I open the database and table, add some data (one row, a polylineZM, with 33 points), close the table and database.

ArcCatalog now shows that Grid1 index is 1800. If I click Recalculate, it doesn't change, so I presume 1800 is correct.

Open the database and table, add another row (polylineZM, 42 points), close the table and database.

Reload in ArcCatalog which now shows both rows, but still says Grid1 index is 1800. Click Recalculate and it changes to 2000. Ie adding the second row with the API did not update the index.

Why does the API update the index when I add the first row, but not the second? Is there an API function I need to call to ensure the index is updated after adding rows?

Extracts from the code, showing the relevant functions that I call:

 OpenGeodatabase(databasePath, geodatabase);
 geodatabase.OpenTable(tableName, table);

 table.SetWriteLock();
 table.LoadOnlyMode(true);

 Row row();
 table.CreateRowObject(row);

 MultiPartShapeBuffer polyline;
 polyline.Setup(shapePolylineZM, 1, ...);

 Point* points;
 polyline.GetPoints(points);
  
 double* zs;
 polyline.GetZs(zs);

 double* ms;
 polyline.GetMs(ms);

 for (...)
 {
  points->x = ...
  points->y = ...
  points++;

  *zs++ = ...
  *ms++ = ...
 }

 polyline.CalculateExtent();
 row.SetGeometry(polyline);

 table.Insert(row))

 table.LoadOnlyMode(false);
 table.FreeWriteLock();

 geodatabase.CloseTable(table);
 CloseGeodatabase(geodatabase);

VinceAngelo · ‎11-28-2011

It's probably not wise to assume that the auto-generated index from one record is the optimal
grid size. It would probably take a few hundred records before the calculation algorithm has a
large enough sample size to take a reasonable guess, and it's still just that, a guess. In reality,
it generally takes ten to fifteen *thousand* records before the grid size makes any measureable
difference in performance. Generally speaking, you don't want any index at all until the table
has all records, since the performance cost of loading features with the spatial index enabled is
relatively large (this is the reason to use Table::LoadOnlyMode). You certainly wouldn't want the
API changing the index size as you add features , since that would mean an O(N^2) or O(N^3)
performance cost, and you might never complete a million row table load.

- V

__11 · ‎11-28-2011

For now I'm not concerned with efficiency, so much as understanding how the index works, and getting the code to generate consistent output. The example I quoted was the simplest one I could create that would reproduce the problem. In other cases I would be adding many rows in between the calls to LoadOnlyMode. However, I do have a requirement to open an existing table, add rows, and recalculate the existing spatial index. How do I do that?

The description of the LoadOnlyMode command itself in less than helpful, merely saying

/// Begin or End load only mode.
/// @param[in] loadOnlyMode true to begin LoadOnlyMode, false to end.
fgdbError LoadOnlyMode(bool loadOnlyMode);

The comments in samples\ShapeBuffer\ShapeBuffer.cpp are more helpful, saying

// End load only mode. This rebuilds all indexes including the spatial index.
// If all grid size values in the spatial index are zero, the values will be
// calculated based on the existing data.
multipoint2Table.LoadOnlyMode(false);

The statement that "if all grid size values ... are zero the values will be calculated ..." is consistent with what I have observed - in particular it appears that if the grid size values are not zero (because I previously calculated them) they are NOT updated.

So how do I update the spatial index if it exists, ie if the value is non-zero? I need to be able to:

Add some rows to an empty table and create the spatial index.

Add some more rows to the table and update the existing/non-zero spatial index.

Item 1 seems possible, but what about item 2? These two additions may occur in separate independent runs of the application code, and it not feasible to just not create the index until the end of the second run (because I don't know at the end of the first run whether there will be more data added later).

Is it possible to clear/zero the existing spatial index so that LoadOnlyMode(false) will create a new correct index?

VinceAngelo · ‎11-29-2011

[my longer, more complete reply got eaten by a post failure]

It is not necessary to change the index size to index new features. If a spatial index exists,
then newly inserted features are reflected in the index, and are immediately available.

I question the wisdom of a requirement to rebuild the index with every load. Any table that
is small enough so that this is is not a very expensive task is unlikely to see measurable benefit.

I also question the assumption that the grid selected by "AUTOMATIC" calculation is in fact the
best possible grid size. Optimal grid calculation must incorporate knowledge of the queries to
be performed, something which "bouncing" the spatial index (load-only true, index size 0, load-
only false) cannot provide. Its unlikely that even a poor grid size selection will result in a measurable
performance hit with a feature count under 50k. By the time you reach a million rows, the grid
size *can* have measurable impact, but I wouldn't trust the AUTOMATIC-calculated index size
with that many features without further experimentation.

In my experience, spatial reference parameter selection and spatial defragmentation have had
more than an order of magnitude greater impact than changing the grid size, so you probably
don't want to spend too much time on grid size optimization
.
- V

DavidSousa · ‎12-20-2011

It would be very undesirable for the spatial index grid sizes to be recalculated automatically each time you added more features. That would add an enormous amount of overhead, and would not be likely to yield any improvement in the index itself.

Ideally, the grid sizes that are calculated should achieve a balance between a grid size that is too large or too small. If the grid size is too large, the result will be an index that is not sufficiently selective, that is, each grid cell contains an excessive number of features which means that a large number of false positives could be returned from the index scan. On the other hand, too small of a grid size causes the index to become very large, due to the fact that each grid cell intersected by the feature's envelope must have a separate reference for that feature. This slows down the index scan because it has to do a much larger amount of file i/o to read through the index.

The grid index calculation algorithm works best when the size distribution of the features currently contained in the feature class is representative. Ideally, all the features should be present, and then the algorithm is run. If only some of the features are present, and they are not representative of the true size distribution, the calculation will be thrown off. For that reason, the best workflow is to enter load-only mode, add all of your features, and then exit load-only mode. Exiting load-only mode triggers the grid size calculation and the loading of the index.

Here is a brief description of how the algorithm works:

The envelope of each feature is obtained. We look at the delta X and delta Y of the envelope, and use the larger of the two.

The resulting delta values undergo a logarithmic transformation, and the resulting values are used to create a histogram which represents the size distribution of the features.

The histogram is statistically analyzed and the initial grid size that is selected is approximately at the 66th percentile of the size distribution.

Depending on the number of features that are beyond the 66th percentile, it is possible that a second and even a third grid size might be calculated.

The algorithm seems to do a reasonably good job, and is vastly superior to the situation that existed prior to 9.2.

I hope this helps in understanding how the index works.