Poor performance with FileGDB and AddField_management

KyleShannon · ‎11-28-2011

I am getting poor performance adding fields to a feature class in a FileGDB. This is a ArcGIS Server geoprocessing service and for each call, AddField is taking between 1.3 and 1.8 seconds per call. With 25 fields to add (I don't know what they are ahead of time) it takes way too long to run for a service. If I use in memory or shapefiles, it takes about 1/10 the time. Is there a way to add multiple fields at once to avoid the overhead (I assume) of opening and closing the FileGDB? Shape files and in memory features don't support all the functionality I need. Below is a quick example straight from my arcpy console (the exec calls are for more accurate times):

>>> fc = 'C:\\Users\\kshannon\\Documents\\ArcGIS\\Default.gdb\\output'
>>> shp = C:\\Users\\kshannon\\Documents\\ArcGIS\\output.shp'
>>> gdb_code = 'start = time.clock()\narcpy.AddField_management(fc,new_field,"DOUBLE")\nprint("AddField:{0}".format(time.clock()-start))'
>>> shp_code = 'start = time.clock()\narcpy.AddField_management(fc,new_field,"DOUBLE")\nprint("AddField:{0}".format(time.clock()-start))'
>>> new_field = "NEW_FIELD"
>>> exec(shp_code)
AddField:0.683299860168

>>> exec(gdb_code)
AddField:1.4100630674

Neither feature is loaded in ArcMap. The time gap grows quite a bit with 25-30 new Fields. Any suggestions?

AndrewBrown · ‎11-29-2011

Hi Kyle, what functionality do you need to keep?

ChrisSnyder · ‎11-29-2011

Could it be that there is something different/special about Default.gdb?

If you copy your table or feature class to a brand new FGDB, does it still take a long time?

What if you run a Compact on Default.gdb?

It is probably a bad practice (performace-wise) to use Default.gdb, as there is a lot of opportunity for it to get quite fragmented.

KyleShannon · ‎11-30-2011

I am using a 'scratch' geo database, as described in this example:

http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Guide_to_the_geoprocessing_service_exa...

So, I am using a brand new geodatabase each time. The basic process is this:

1) Client submits a feature set of points that may or may not contain attributes, likely to have an id.
2) I create a new feature class and copy the attributes from the input feature set into the new feature class.
3) I call a secondary service and get more attributes for the point features. I add fields to the new feature class based on the result of the call to the other service.
4) I use an update cursor to fill in the new attributes.
5) I return the feature class to the client.

I imagine that every time I call add field on the gdb, it aquires a lock or mutex or whatever they call it for a file gdb, and then add the field and release the lock. If this is the case, exposing an AddFields(list()) would probably improve performance quite a bit. But I really have know idea. Thanks for the replies.

k

ChrisSnyder · ‎11-30-2011

So my point is that the particular FGDB that you are using (C:\Users\\kshannon\Documents\ArcGIS\Default.gdb) is a new and evil ESRI invention in v10.0. By "default" all the geoprocessing tools write their results to this specific FGDB. My point is that there is a large potential for this specific FGDB and the data within it to become quite fragmented over time - the more fragmented it gets, the slower the perfomance becomes (like adding fields to a table stored there). If performance is important, I would recomend not using the Deafult.gdb.

When you say "I am using a brand new geodatabase each time", is it that you are deleting and recreating the Default.gdb everytime (which would certaily take a while)?

What functionality offered in a FGDB precludes using an in_memory featureclass?

An interesting feature of an in_memory table is that field add and deletes are virtually instantanious, which contrasts with a FGDB. Depending on the number of records, field deletes in a FGDB can take a very long time. Field adds however, in a "fresh" FGDB should take less than a second even when the table has a lot of records.

KyleShannon · ‎11-30-2011

So my point is that the particular FGDB that you are using (C:\Users\\kshannon\Documents\ArcGIS\Default.gdb) is a new and evil ESRI invention in v10.0. By "default" all the geoprocessing tools write their results to this specific FGDB. My point is that there is a large potential for this specific FGDB and the data within it to become quite fragmented over time - the more fragmented it gets, the slower the perfomance becomes (like adding fields to a table stored there). If performance is important, I would recomend not using the Deafult.gdb.

When you say "I am using a brand new geodatabase each time", is it that you are deleting and recreating the Default.gdb everytime (which would certaily take a while)?

I don't use Default.gdb, that was only for the example. I have a scratch.gdb that the service generates. I timed the creation/copy of the feature set into Scratch.gdb, and the time was small compared to a single call to AddField().

What functionality offered in a FGDB precludes using an in_memory featureclass?

Nothing yet. I have found work arounds for most cases. In one service, we have to have support for field aliases, bot File gdb and in_memory support this, but shapefiles do not. Another service has to reproject the points to EPSG:4326(in some cases) and File gdb and shapefiles can be the output of Reproject_management(), but in_memory can't, according to the docs. So if a case comes up where I have to reproject the input feature set and use aliases, File GDB is the only answer.

An interesting feature of an in_memory table is that field add and deletes are virtually instantanious, which contrasts with a FGDB. Depending on the number of records, field deletes in a FGDB can take a very long time. Field adds however, in a "fresh" FGDB should take less than a second even when the table has a lot of records.

It doesn't. I created an empty FGDB and added a single point feature class with no fields. I also created a shape file the same way. I tested it with this script:

import time

import arcview
import arcpy

gdb = "c:/data/temp.gdb/test_gdb"
shp = "c:/data/test_shp.shp"
mem = "in_memory/test"

arcpy.CreateFeatureclass_management("in_memory","test","POINT")

start = time.clock()
for i in range(25):
    field = "FIELD_{0}".format(i)
    arcpy.AddField_management(gdb,field,"DOUBLE")
print("GDB time: {0}".format(time.clock() - start))

start = time.clock()
for i in range(25):
    field = "FIELD_{0}".format(i)
    arcpy.AddField_management(shp,field,"DOUBLE")
print("SHP time: {0}".format(time.clock() - start))

start = time.clock()
for i in range(25):
    field = "FIELD_{0}".format(i)
    arcpy.AddField_management(mem,field,"DOUBLE")
print("MEM time: {0}".format(time.clock() - start))

Output:

C:\Users\k\Desktop>c:\Python26\ArcGIS10.0\python.exe add_field_test.py
GDB time: 14.9272383107
SHP time: 0.505886972475
MEM time: 0.53527747266

That is a pretty big time difference. This was run out of the arcmap environment, so no drawing or any other overhead I can think of. The only thing I can think of is the lock issue. Is there a way to open a gdb and keep it open and act on it? Like I said, for now I don't need gdbs, but it seems that performance could be improved. I may need it in the future. Am I doing anythin glaringly wrong?

ChrisSnyder · ‎11-30-2011

This was run out of the arcmap environment

That might be the issue... What if you run it via PythonWin (or some other IDE)?

I remeber a while back someone had determined that the ArcMap Python window interface (for whatever reason), perfomed quite a bit slower than other IDE's when running a searchcursor (I think that was what it was). I can't seem to find that post... Anyway.

I run my stuff almost exclusivly in Pythonwin, and AddField for a FGDB has pretty good perfomance.

KyleShannon · ‎11-30-2011

This was run out of the arcmap environment

Worded that wrong. out meaning not in arcmap. I just ran it from the command line:

C:\Users\k\Desktop>c:\Python26\ArcGIS10.0\python.exe add_field_test.py
GDB time: 14.9272383107
SHP time: 0.505886972475
MEM time: 0.53527747266

I ran another test creating the feature classes for each type and still performance is terrible:

from osgeo import ogr
import time

import arcview
import arcpy

gdb = "c:/data/temp.gdb/test_gdb"
shp = "c:/data/test_shp.shp"
mem = "in_memory/test"
ogr_shp = "c:/data/test_ogr.shp"

print("Creating feature classes...")
arcpy.CreateFeatureclass_management("c:/data/temp.gdb","test_gdb","POINT")
arcpy.CreateFeatureclass_management("c:/data","test_shp.shp","POINT")
arcpy.CreateFeatureclass_management("in_memory","test","POINT")
arcpy.CreateFeatureclass_management("c:/data","test_ogr.shp","POINT")
print("Feature classes created.")

n = 25

start = time.clock()
for i in range(n):
    field = "FIELD_{0}".format(i)
    arcpy.AddField_management(gdb,field,"DOUBLE")
t = time.clock() - start
gt = t
print("GDB time: {0}({1} sec per call)".format(t,t / n))

start = time.clock()
for i in range(n):
    field = "FIELD_{0}".format(i)
    arcpy.AddField_management(shp,field,"DOUBLE")
t = time.clock() - start
st = t
print("SHP time: {0}({1} sec per call)".format(t,t / n))

start = time.clock()
for i in range(n):
    field = "FIELD_{0}".format(i)
    arcpy.AddField_management(mem,field,"DOUBLE")
t = time.clock() - start
print("MEM time: {0}({1} sec per call)".format(t,t / n))

start = time.clock()
ds = ogr.Open(ogr_shp, 1)
lyr = ds.GetLayer()
for i in range(n):
    field = "FIELD_{0}".format(i)
    field_defn = ogr.FieldDefn(field,ogr.OFTReal)
    lyr.CreateField(ogr.FieldDefn(field,ogr.OFTReal))
ds.Destroy()
t = time.clock() - start
print("OGR time: {0}({1} sec per call)".format(t,t / n))

print("Deleting feature classes...")
arcpy.Delete_management(gdb)
arcpy.Delete_management(shp)
arcpy.Delete_management(mem)
arcpy.Delete_management(ogr_shp)
print("Feature classes deleted.")

print("AddField is on shp is {0} times faster than on gdb".format(gt/st))

ogr is a spatial features access library with python bindings. Results are the same:

C:\Users\k\Desktop>c:\Python26\ArcGIS10.0\python.exe add_field_test.py
Creating feature classes...
Feature classes created.
GDB time: 15.0151810344(0.600607241376 sec per call)
SHP time: 0.521969915836(0.0208787966335 sec per call)
MEM time: 0.550496636539(0.0220198654616 sec per call)
OGR time: 0.657494203585(0.0262997681434 sec per call)
Deleting feature classes...
Feature classes deleted.
AddField on shp is 28.7663725032 times faster than on gdb

Are you getting better performance? Can anyone post times for a call to AddField on a simple gdb?

ChrisSnyder · ‎11-30-2011

Okay - your're right, FGDB are slow...

I don't have ogr installed... So no OGR times...

Creating feature classes...
Feature classes created.
GDB time: 23.3795888001(0.935183552003 sec per call)
SHP time: 0.984445672708(0.0393778269083 sec per call)
MEM time: 0.992119302926(0.039684772117 sec per call)
Deleting feature classes...
Feature classes deleted.
AddField is on shp is 23.7489883375 times faster than on gdb

Go shapefiles!!!

How about doing all the field adding and calcing using in_memory, and then copy top FGDB? I use this method quite a bit for the added speed in calcing and dropping fields. I never paid that much atention to how long it took to add the field in a FGBD, but yep, it is pretty slow relative to the competition.

ChrisSnyder · ‎11-30-2011

Maybe you could pre-build a FGDB FC (maybe several differnt schemas depending on the data needs), and the script could then just copy and populate the template(s)?