Geodatabase size increases dramatically when closing a table

UlrichEgger · ‎01-20-2014

I observed the issue that file geodatabases written with the File Geodatabase API are relatively large. When opening a feature class written with the File Geodatabase API e.g. in ArcMap and then adding a field followed by deleting the same field, the size is reduced very much (about 1/10th of the original size).

I tried to debug the problem and I found that at the end of writing data into the file geodatabase, the size is still o.k.
The size grows during the call of FileGDBAPI::Geodatabase::CloseTable (Table &table) (see code snipped below)

In our case the problem is that we deal with a large number of polygons quite often. So it makes a big difference if the result database is 1GB or 10 GB.

int GeodatabaseResultWriter::Terminate()
{
    if(GdbWrapper::CheckResult(gdb->waterLevelTable.FreeWriteLock()) != 0)
        return 1;
    //size of geodatabase still small
    if(GdbWrapper::CheckResult(gdb->geodatabase.CloseTable(gdb->waterLevelTable))!= 0)
        return 1;
    //geodatabase has grown !
    if(GdbWrapper::CheckResult(CloseGeodatabase(gdb->geodatabase))!= 0)
        return 1;

    return 0;
}

VinceAngelo · ‎01-31-2014

Keep in mind that the storage increase is due to fragmentation caused by updates.
The "compact" function rewrites the file, causing storage to initially grow by the
size of the compacted data before the fragmented file is deleted.

If your updates are so voluminous that a 2Gb table grows to 100Gb, you might want
to look at using memory-based techniques for doing the updates, and only flushing
the results when the changes have been completed.

- V

DavidSousa · ‎02-03-2014

I thought I would run it once before the table is closed (= the time when by experience the size of the geodatabase on the disk grows)

Compact is an operation on the Geodatabase object, and it does not matter if the table is open or closed.

As I explained before, the table does not increase in size when you close it. The file size increases because of inserts or updates. But, there can be a delay in in updating the file system metadata to reflect the newly increased size. When the file is closed, that process is accelerated.

UlrichEgger · ‎02-06-2014

Keep in mind that the storage increase is due to fragmentation caused by updates.
The "compact" function rewrites the file, causing storage to initially grow by the
size of the compacted data before the fragmented file is deleted.

If your updates are so voluminous that a 2Gb table grows to 100Gb, you might want
to look at using memory-based techniques for doing the updates, and only flushing
the results when the changes have been completed.

- V

Ok, but then I need to know how to flush. I would expect to have methods of the table
to either flush results at the end of writing or alternatively reserve the needed space at
the beginning. I don't see any benefit of keeping the values in our own arrays until the
end of the simulation if we do the same thing at a later stage.

Currently, we create all columns at the beginning. Do you think it might be worth
trying to create the culumns not during initialization but directly before writing it?

So we would get a sequence like "create column, write values, create column, write values ..."
instead of "create all columns, write values, write values, ...."

And is there any such method to write an array of values to a column as 1 block instead of
writing single values?

VinceAngelo · ‎02-07-2014

Currently, we create all columns at the beginning. Do you think it might be worth
trying to create the culumns not during initialization but directly before writing it?

No.

And is there any such method to write an array of values to a column as 1 block instead of
writing single values?

No.

I think the root problem here is that you're using the wrong API. File geodatabase, like the SQL
databases it models, is row-oriented framework -- row insert is dirt cheap, but column update
is ruinously expensive -- and you're trying to use it to fill a table in column-major order. I think
a spreadsheet API might be more appropriate to your use case. If you must use FGDBAPI, then
you're going to have to create one table per "column" (a join key and the column data), finish
your modelling, then open the N tables and join them on the fly to produce the final table.

- V

UlrichEgger · ‎02-20-2014

No.

No.

I think the root problem here is that you're using the wrong API. File geodatabase, like the SQL
databases it models, is row-oriented framework -- row insert is dirt cheap, but column update
is ruinously expensive -- and you're trying to use it to fill a table in column-major order. I think
a spreadsheet API might be more appropriate to your use case. If you must use FGDBAPI, then
you're going to have to create one table per "column" (a join key and the column data), finish
your modelling, then open the N tables and join them on the fly to produce the final table.

- V

I am afraid a spreadsheet won't solve our needs because we want to display a feature class. The values from the feature classes attribute table shall be shown in multible layers (one for each time step) and we want to use ArcGIS symbology.

But I think we will check the option with the separate tables. This case, we would make a join in
each time step table with the table that contains the information for this time step.

VinceAngelo · ‎02-20-2014

If you used a spreadsheet to accumulate your data, then you could transfer the
completed table at little cost. The same goes for memory. But you will not ever
achieve efficiency in column-oriented update of an FGDB.

- V