AnsweredAssumed Answered

Why is arcpy.append_management so slow?

Question asked by agelfert on Jan 24, 2019
Latest reply on Jan 24, 2019 by agelfert

So I was working on a little project merging a large number of datasets into one. Irritatingly, each partial dataset was residing in a feature class of the same name, scattered across several hundred file geodatabases, also of the same name but with an added publication date in the filename, and each stuffed into a ZIP archive.

My process:

  • Iterate over a list of archives
  • Unzip each GDB's, while logging the occasional bad ZIP file
  • With all GDB in the same directory, iterate over GDB's
  • Access the feature class
  • Copy all features into a in-memory "temp" feature class
  • Add a date field and populate it with the the publication date of GDB.

So much for background. --- But which arcpy method to use for writing all these partial datasets to an SDE feature class? All that was needed was to append all features from source to the target feature class with matching list of fields (identical schema). So I tried:

arcpy.append_management(['in_memory\\temp'], target_fc, "TEST")

This seemed to work fine for a smaller test dataset. With approx. 1200 features per feature class, the process took about 6 sec fer feature class. That's slower than 1200 inserts in SQL but I could live with that. So I added a logger from the Python standard library logging module to get some info on processing time info and pointed my script at the directory with 800+ GDB's, an 80 GB directory with all the database overhead.

 

And it ran... and ran... and ran...

 

The 1st run:
Start 01/22/2019 04:14:14 PM
Finish 01/22/2019 11:41:36 PM
Total: 7h:27m:22s

 

I was not prepared for that kinda of lengthy ordeal and started researching. I stumbled across this thread here (Loop processing gradually slow down with arcpy script) and the mention arcpy.SetLogHistory caught my attention. I had no idea that arcpy logs to an ESRI history file even when you're running a code outside of any application.

 

So I set the flag to False, and ran my script again.

arcpy.SetLogHistory(False)

The 2nd run:
01/23/2019 05:03:23 PM
01/24/2019 12:49:21 AM
Total: 7h:45m:58s

Even worse!

 

Here is my code:

arcpy.CopyFeatures_management(fc_source,'in_memory\\temp')
arcpy.AddField_management('in_memory\\temp',"date_reported",'DATE')

#this is the date portion of the GDB name
datestring = gdb[-12:-4]
updatedate = datetime.strptime(datestring, '%Y%m%d').strftime('%m/%d/%Y')

fields = "date_reported"

with arcpy.da.UpdateCursor('in_memory\\temp', fields) as cursor:
   for row in cursor:
      row[0] = updatedate
      cursor.updateRow(row)
   try:
      arcpy.Append_management(['in_memory\\temp'], SDE_target_fc, "TEST")
      logger.info("Successfully wrote data for {0} to SDE".format(datestring))
   except:
      logger.info("Unable to write data to SDE...")
arcpy.Delete_management("in_memory\\temp")

This was driving me crazy, so I pulled the timestamps from my log and plotted the diff's for each feature class using matplotlib/pandas for both runs. X is a count of all the iterations over feature classes, Y is the time in seconds each INSERT (or the portion of the code doing the append) took

Two questions that come to mind:

1) Why does each INSERT take longer than the one before?

2) What the heck happens when I get to feature class #500 or so where that steady slowing trend goes berserk?

 

Just for sake of completeness, I also ran the script using the following ("NO_TEST") - with the same result.

arcpy.Append_management(['in_memory\\temp'], SDE_target_fc, "NO_TEST")

 

Since arcpy.append clearly wasn't performing, I followed the advice from this thread (Using arcpy.da.InsertCursor to insert entire row that is fetched from search cursor?) and replaced arcpy.append with:

dsc = arcpy.Describe(target_fc)
fields = dsc.fields
fieldnames = [field.name for field in fields]
...
with arcpy.da.SearchCursor('in_memory\\temp',fieldnames) as sCur:
   with arcpy.da.InsertCursor(target_fc,fieldnames) as iCur:
      for row in sCur:
         iCur.insertRow(row)

Th 3rd run:

01/24/2019 01:40:31 PM
01/24/2019 02:19:09 PM
Total: 38m:38s

 

Now we're talking. So much faster! In fact, 2318 seconds for 850 INSERTS, comes out to about 3 sec per transaction, and when you plot it:

Now, that's the kind of behavior I'd expect. That no matter how many INSERTS you do, they always take about 2-4 seconds. So my question: What in the world is append_management doing? Clearly, it's the wrong tool for what I was doing despite its name, or is it not?

Outcomes