Slow performance reading FGDB

ahuarte · ‎06-26-2011

Hello, I'm using this wonderful API to read FGDBs. I'm checking the performance and I am not getting very favorable results compared to the same data in ShapeFile.

For a layer ~500,000 records, the FGDB reader needs ~5sg, but the SHAPE reader less than half.

The sourcecode is equivalent, but if it is true that I use FileMapping in the SHAPE reader and I load the DBF values only when needed. The FGDB reader is implemented in a pair C# and C++ .NET DLLs.

It In order to accelerate the FGDB API, I wonder if FileMapping is used internally in the API.

Is it possible to raise some of these optimizations?

1. Access the values ??????in a row, also for field index.
2. Capable of maintaining data in a row so disconnected from their 'EnumRows' father, to load data only when needed, and not have to read all the row values ??????forever. Or access the data using a byte stream recovered from the row, or something

ahuarte · ‎07-01-2011

For example I've detailed.

Or an very configurable fast map viewer or a GIS format converter .... where the application does not know the input format of the geometries, and the format reader does not know the use geometries that will make them.

Not to lose, all I ask to see if feasible, is there a method on the Row that offers a byte[] of the attributes similar to what is the geometry.

Thank you very much

VinceAngelo · ‎07-01-2011

The problem with your #1-B example is that it ignores the possibility of NULLs. If you want
to submit an ER at ideas.esri.com, you'll need to add a bitmap of length (ncols+7)/8 bytes
to the front of the bytestream. I still don't think this will result in significant time savings.

- V

ahuarte · ‎07-03-2011

ok, thanks

I will try in ideas.esri.com

VinceAngelo · ‎08-08-2011

Just to close up this thread, I did some experimentation with some custom
tools to measure access performance of a random-generated dataset...

First I generated a 100k row shapefile with roughly the same size .shp as
.dbf (50.8Mb and 49.1Mb, respectively). Then I used ArcGIS 10 to populate
it in a file geodatabase (66.7Mb).

Next I used a 'C' app to time how long it took to read the files (pure I/O,
without any semantic parsing). On my reference machine (a 4-CPU/4Gb RAM
laptop running 64-bit Windows 7) [and after an initial read pass to cache
the I/O], it took 220, 188, and 265 milliseconds to access the .shp, .dbf,
and a0000000a.gdbtable files (respectively). Each time represents ~1 micro-
second per 512-byte block (kicking the blocksize up to 256k with setvbuf
dropped all access times to under 100ms).

I then wrote a custom app with the File Geodatabase API, and applied the
millisecond timer to break down the processing times on a FGDB query into
three components -- Open (OpenGeodatabase + Geodatabase.OpenTable +
table.GetFieldInformation + various FieldInfo calls), Read (table.Search +
EnumRows.Next), and Get (row.Get*), plus a total:

FGDBa 0 1229 1673 2902

Parsing just the geometry data (without a change to the selection list)
was much faster:

FGDBg 0 1170 0 1170

As was limiting the selection list to just "SHAPE" (though, curiously, not
as fast as a "*" column list and just calling GetGeometry):

FGDBs 16 1122 48 1186

I then instrumented a shapefile reader with the same millisecond timer and
compared the access performance using five different access methodologies:
A) Bind query on all columns
B) Bind query on *only* the .shp component (simulating an empty .dbf)
C) Using getter by column number on all columns
D) Using getter by column name on all columns
E) Using a custom getStream function

Timing is in milliseconds, with values Open (includes column describe),
Read, Get, and Total --

10 cols:
shpA 0 2387 0 2387
shpB 0 1622 0 1622
shpC 0 2372 280 2652
shpD 16 2480 484 2980
shpE 0 2449 125 2574

To determine the impact of column count I reorganized the dBase attributes
into more columns of shorter length (to preserve the overall size), and then
re-ran the measurements with 20, 50 and 100 columns using the FGDB and
shape methodologies above:

20 cols:
FGDBa 0 1240 4096 5336
FGDBg 16 1310 94 1420
FGDBs 16 1404 0 1420
shpA 0 2480 0 2480
shpB 0 1638 0 1638
shpC 0 2434 405 2839
shpD 0 2355 1202 3557
shpE 0 2464 188 2652

50 cols:
FGDBa 15 2200 17628 19843
FGDBg 0 2090 31 2121
FGDBs 16 2105 16 2137
shpA 0 2574 0 2574
shpB 0 1638 0 1638
shpC 0 2399 736 3135
shpD 0 2577 6003 8580
shpE 0 2620 219 2839

100 cols:
FGDBa 0 3819 61669 65488
FGDBg 0 3370 46 3416
FGDBs 16 3417 16 3494
shpA 0 2652 0 2652
shpB 0 1638 0 1638
shpC 0 2840 1014 3854
shpD 0 2844 24846 27690
shpE 0 2620 485 3105

Since the time progression on both FGBDa and shpD suggest a "big O
N-squared" algorithm issue (doubling the columns increases duration by
roughly four times), I re-wrote my shapefile findColByName function
to use a circular search algorithm (each search starts with the column
after the last found position), and then re-tested (methodology F):

10 cols:
shpF 0 2299 447 2746
20 cols:
shpF 0 2465 468 2933
50 cols:
shpF 0 2780 808 3588
100 cols:
shpF 0 2574 2028 4602

In conclusion, it appears that the getter calls on attributes with the
FILEGDB API do have a significant performance cost (this is especially
true when there are a large number of attributes in the table). Fortunately
for your case, you have the option of recoding to delay the use of 'Get'
attribute accessors until they are actually necessary.

I've recommended that a circular search algorithm be used in the getter
and setter functions, and that consideration be given to overloading the
Row accessors using an integer radix (vice name) which should give uniform
access cost without regard to the attribute retrieval order.

- V

ahuarte · ‎08-11-2011

Thank you very much for your repply!

The fact that the testing you have done is impressive. There are certain details that are noticed only with managing large volumes of data.

I look forward to seeing if implemented access to the attributes of a faster 🙂