Slow performance reading FGDB

ahuarte · ‎06-26-2011

Hello, I'm using this wonderful API to read FGDBs. I'm checking the performance and I am not getting very favorable results compared to the same data in ShapeFile.

For a layer ~500,000 records, the FGDB reader needs ~5sg, but the SHAPE reader less than half.

The sourcecode is equivalent, but if it is true that I use FileMapping in the SHAPE reader and I load the DBF values only when needed. The FGDB reader is implemented in a pair C# and C++ .NET DLLs.

It In order to accelerate the FGDB API, I wonder if FileMapping is used internally in the API.

Is it possible to raise some of these optimizations?

1. Access the values ??????in a row, also for field index.
2. Capable of maintaining data in a row so disconnected from their 'EnumRows' father, to load data only when needed, and not have to read all the row values ??????forever. Or access the data using a byte stream recovered from the row, or something

VinceAngelo · ‎06-27-2011

What are you using to read the shape file?

What is "~5sg"?

Is the data point, line, or polygon?

How large is the file that corresponds to the table in the FGDB? How large are the .shp and .dbf
files of the shapefile?

Are you saying that you cache 500K rows of geometry in RAM, and only access attributes when
you need them? What is your access criteria?

- V

ahuarte · ‎06-27-2011

Sorry my bad English!

~5sg -> approximately 5 seconds

The data is a polygon feature class, about 500,000 records with aprox 500mb shapefile (SHP+DBF+SHX).

I dont cache the data, I am reading the data as a IDataReader ADO.NET.

Speaking in the context of a map viewer, most of the time only need to read the geometry. If a user sets a color theme based on an attribute, then if that is necessary to read any of the attributes. But the "expensive" read the rest may be postponed. This would be possible if you could decouple row of your reader, or if it could get a "byte stream" of data to be treated as necessary

VinceAngelo · ‎06-27-2011

Five seconds doesn't sound slow to do a full table scan on 500Mb of data. Can you provide
the code you're using in both cases?

There are a number of advantages of file geodatabases over shapefiles, among them:
+ Support for numeric nulls
+ Second resolution date type (vice day resolution of dBase)
+ Variable width attribute rows (vice dBase fixed-width)
+ Unlimited width string fields (vice 254 in dBase)
+ 64 character attribute field names (vice 11 in dBase)
+ Full support for all types supported by ArcGIS (NSTRING, UUID, CLOB, NCLOB)
+ SQL query access
+ Open access to spatial and attribute index queries

- V

ahuarte · ‎06-27-2011

I agree, this time nor does it seem slow :-), but if the user are accustomed to wait 2 seconds rendering a layer, with the format change this time is twice and the user does not mind the advantage of new format.

I think that the key is the "expensive" calling to FileGDBAPI::Row::GetXXXX() in order to recover the attribute values of a row, n fields -> n calls to these functions.

Most of the time the calls to these functions are unnecessary, but I have to make if you need to retrieve later any alphanumeric values �??of a geometry. If possible get a single "byte stream" of the attribute values of a row like your "ShapeBuffer" for the geometry, the time savings will be greater, the greater the number of fields (n fields -> the best one call, in the worst case n+1 calls).

Thank you very much for your replies!

VinceAngelo · ‎06-28-2011

You haven't provided your code, and I haven't done performance benchmarking on FGDBAPI,
but I'm willing to wager my son (with a 103.5 fever and cranky as all get-out) that the real
cost is in reading the row from disk, not in the accessor function to copy the row members
from memory.

I'm quite fond of the ArcSDE 'C' API which uses row accessor functions *or* bind variables --
It's so much cleaner to organize a "copy" as:

 
while ((sr = SE_stream_fetch(stream1))== SE_SUCCESS) {
    if ((sr = SE_stream_execute(stream2)) != SE_SUCCESS) {
        /* handle write error */
}    }
if (sr != SE_FINISHED) {
    /* handle read error */
}

but the accessor design pattern is well-established (especially in Java, where bind variables
are considered poor form).

The only time that accessor functions are really inefficient is when transferring large BLOBs
(and CLOBs, NCLOBs,..., and geometries). The majority of your cost difference is in the fact
that you're *not* reading the .dbf component (and therefore doing half the I/O).

You can certainly file an enhancement request for an alternate method to access the row
buffer, but since ArcGIS doesn't even use bind variables with ArcSDE, it's probably going
to be a hard sell.

- V

ahuarte · ‎06-28-2011

The important thing, I hope you son gets better soon!

Regarding the rest, I know what to say. It is clear that the cost is reading the disc. It is also clear that the best optimization of a code is not to make unnecessary calls.

I do not know if I've managed to explain the particular casuistry.

I do not mean to be read data from disk twice, but that all the attributes of the row can be obtained in one byte[] or something.

Note that now I must make n calls C#/C++ to obtain the values â??â??of the attributes and then in the render of the geometry on a map are not required.

I think adding this functionality can be interesting, defer the interpretation of attributes, as read from the disk to a single byte stream to be interpreted by the client code when they are strictly necessary.


FgdbDataRow row = ...

// #0) Geometry.
Element tempElement = Element.FromShape( row.GetGeometry() );

// #1) Attributes.
element.pProperties = new Property[10];

// #1-A) -> Current Option.
for (int i = 1, icount = 10; i < icount; i++)
{
   object value = null;

   switch (m_oFeatureClass.GetEsriPropertyType(field))
   {
      case EsriPropertyType.esriFieldTypeInteger:
      {
         int valueT = 0;
         if (row.GetInteger(property.Name, ref valueT)) value = valueT; //-> ### 10 calls C#/C++ to Row::GetInteger!
         break;
       }
       ...
   }
   property.Value = value;
}

// #1-B) -> Proposed Option.
byte[] pMstream = row.GetAttributesBuffer(); //-> One call!
element.BufferAttribs = pMstream;
...
for (int i = 1, icount = 10, offsetPos = 0; i < icount; i++)
{
   // If I need de i-attributte.
   int value = BitConverter.ToInt32(element.BufferAttribs, offsetPos+=4);
   property.Value = value;
}

Thank you very much!

VinceAngelo · ‎06-29-2011

My son's doing fine now, thanks.

There is no reason to iterate the attribute list if you don't need the attributes. I doubt it's
required even if you do select all the columns, but I know it's not required if you select
only the geometry column in the Search request that generates the EnumRows iterator.
(If nothing else, you should specify the geometry column first.)

File Geodatabase format uses a compression algoritm on the geometries, but if the
"a000*.gdbtable" that corresponds to your table is larger than the .shp on the shapefile,
there is no chance for the FGDB to be faster in a full-table scan query (and even if it is
smaller, there's still the decompression cost to factor into the equation).

I'd recommend you modify the Search, then instrument your code with a millisecond
resolution timer, so you can see where the application is spending its time while reading
in the two different formats.

- V

ahuarte · ‎06-30-2011

The cost of time I already know which are the n calls Row::GetXXX () via C#/C++, my SHAPE reader does not make a priori.

The problem is that in reading the geometry do not know what attributes are required in the subsequent use of it will make the application, therefore I load all or do I have everything necessary for that geometry.

In the painting of the map, the Renderer can use some of the attributes of the geometries, but a priori in reading of the element I do not know.

In the case of reading from SHAPE I only get the byte[] of the row of DBF, and if when the Renderer is painting and requires a value, then I parse the stream.

For heavy layers this behavior saves precious time (~6sec -> ~3.6sec for 500,000 records).

So suggested a similar approach it was possible for FGDBs.

Thank you very much

VinceAngelo · ‎06-30-2011

I can't think of a single application I've ever written that needed a full table scan for every query
(and didn't know exactly what attributes would be needed). FGDB would be much faster than
shapefile with a spatial subset, but because of your particular requirements, shapefile format
is going to be your best option.

- V