very bad performance when setting extents using arcpy.env.extent

NikMartin

I have very large datasets that I process using arcpy. These datasets are broken into geographical regions, so with each run of my ETL scripts, I set arcpy.env.extent to a polygon that defines the region I'm processing. If I don't set env, a function like

arcpy.management.GetCount("sdefile.sde\myFeature_class")

will take maybe 8-10 seconds, but if I have arcpy.env.extent set to a ~~relatively simple polygon~~ bounding box, the same function takes 20 MINUTES! The computer I'm running the code on is a different machine than the SDE database, which is a very large Oracle Enterprise cluster. The feature class I have the most trouble with has about 5 million records per region. The features all have spatial indexes on them.

JoshuaBixby

Calling Get Count on an entire dataset is fast because it is basically a metadata lookup and not actually counting records. Twenty minutes does seem a bit long for Get Count, even on 5 million records, but enterprise geodatabase connections are very sensitive to latency as well as the usual factors that impact database performance. How long does a Select Layer By Attribute take when you run a query like "WHERE region = 'name'" ?

NikMartin

Ran:

from timeit import default_timer as timer
import arcpy

start = timer()
print(arcpy.management.GetCount(r"Q:/ARCGIS-DEV.REGION1.sde/REGION1_BUILDINGS").getOutput(0))
end = timer()
print("Time taken: {:.2f} seconds".format(end - start))

start = timer()
arcpy.management.SelectLayerByAttribute(r"Q:/ARCGIS-DEV.REGION1.sde/REGION1_BUILDINGS", "NEW_SELECTION", "REGION='REGION1'")
end = timer()
print("Time taken for SelectLayerByAttribute: {:.2f} seconds".format(end - start))

And the first call ran in 13.55 seconds, the second was still running 2 hours when I killed the process

JoshuaBixby

The results indicate the issue isn't with using arcpy.env.extent with a polygon. The fact a polygon filter took twenty minutes while a simple "REGION='REGION1'" query took two or more hours tells me: 1) the dataset is quite large (which you already indicated), and 2) the REGION field probably isn't indexed or indexed correctly if the operation is taking that long.

At this point I would run the same attribute test outside of ArcGIS. If you see comparable relative differences when connecting to the database from a client outside of ArcGIS, it points to more a data structure and database performance issue than an ArcGIS software issue.

NikMartin

Using Dbeaver,

select count (*) from REGION_BUILDINGS where REGION='REGION1';

took 53 seconds, so definitely not a database issue, and that column isn't even indexed;

🤔

Robert_LeClair

Major caveat for me - I do not write any Python code at all but I did find a related support case regarding poor performance with versioned data and the user running a select very similar to yours of REGION_BUILDINGS where REGION='REGION1'. The long and the short of it (and resolution for the customer) was:

Resolution: It is not efficient to use layers and layer selections in the client's context.
GP Intersect and arcpy search cursor recommended.

Now how to rewrite the Python from GetCount to Intersect and Search Cursor, I do not know but something to explore/experiment. Hope this helps.

JoshuaBixby

Is the data versioned? If so, the query that ran for 53 seconds is not comparable to the query from ArcGIS software because ArcGIS software will be using the versioned data and not base tables.

If the data is versioned, what does the state tree look like?

Robert_LeClair

And adding to Joshua's question about whether the data is traditionally versioned or not, have the eGDB GP performance tools been performed recently? Compress? Analyze? Analyze Datasets? Rebuild Indexes? If not, then it would not surprise me that there is poor performance.