Zonal Statistics takes 1.5 hours in AGPro, in Python the same code takes a few days

2103
18
Jump to solution
03-02-2022 03:08 AM
MichaelLedwith
New Contributor III

Hi,

I've been pulling out my hair trying to find a solution to this enigma. Zonal Statistics as Table takes less than two hours if I run it using the AG Pro 2.9.2 tool. If I copy the python code and run it in a 64-bit IDLE shell (3.7.11), it takes almost two days.

I've tested this on my workstation and three different virtual desktops. All are fairly well equipped Win10 environments with plenty of disk space, RAM, graphic card or graphic support, etc. There's plenty of available cpu- and gpu-resources during the process.

Here's the long and short of the situation:

1. The in_zone_data is comprised of slightly more than 4000 polygons that have been converted to a raster with 10 m resolution and a unique ID. The raster has a defined coordinate system and is approximately 5000 X 5000 pixels, unsigned 16-bit;

2. The in_value_raster is the result of two pre-processes. First, the red band is extracted from a RGB-image via Raster Functions (in AG Pro) or arcpy.ia.ExtractBand (in Python). Focal Statistics is then run on the red band via Raster Functions (in AG Pro) or arcpy.ia.FocalStatistics (in Python). The two processes are done almost instantaneously in both AG Pro and Python. The raster has a defined coordinate system, 0.16 m resolution, 625 000 X 625 000 pixels, unsigned 8-bit. In AG Pro, the raster layer source for the Focal Stats raster is on a separate drive (D:\Temp\ArcGISProTemp9836\);

3. Zonal Statistics as Table. Running this tool in AG Pro on the Raster Layers takes about one hour and forty-five minutes. Running the same process (arcpy.ia.ZonalStatisticsAsTable) in Python takes days. 

4. I've set up the environments so that they are identical (workspace, scratch, extent,  etc). The processing times are the same.

5. I've converted the in_zone_data to a raster layer via arcpy.management.MakeRasterLayer. The processing times are essentially the same.

Any explanation? Anyone experience something similar? This snippet is part of a larger program that iterates through tons of different areas so I really need to be able to run this via a script and not manually.

No, I cannot provide anyone with the input data but certainly all help appreciated.

Thanks!

0 Kudos
18 Replies
by Anonymous User
Not applicable

Agreeing with DanPatterson here. It's a likely possibility that a standalone script may not be applying the parallel processing factor. Here is a link to documentation with the scripting portion highlighted:

https://pro.arcgis.com/en/pro-app/2.8/tool-reference/environment-settings/parallel-processing-factor...

0 Kudos
MichaelLedwith
New Contributor III

Well, I certainly want it to be understood that I do appreciate any and all help. 

I have been using ESRI products (starting with ArcInfo) since the late 80's so I do know a bit about things. What would be great would be an explanation that goes along with any suggestions. For example, why would running the process in a 32-bit IDLE shell speed up the time?

Now I don't know what the default parallel processing parameter is for Pro and I didn't change it when I ran the original process. The CPU on my workstation is a Intel i7-6700K with four cores (it's also hyperthreaded, so that makes eight possible cores). I'll re-run the script with a parallel process value of 75% and see what happens. However, I still can't imagine that this parameter could have the effect of changing a process time from 36 hours to less than 2.

0 Kudos
DonMorrison1
Occasional Contributor III

This is very curious and I hope somebody from ESRI can explain it.  To answer your question there are some cases when memory is constrained that running in 32 bit mode can be more efficient due to the smaller pointer size (4 bytes vs  8 bytes) - this results in smaller stack and heap sizes and less swapping to disk. But I'm quite sure that is not the problem here.  You could try to run the Python Profiler (cProfile) which will tell you which python functions are consuming the most time but this gives you no visibility into the underlying C code which is where I'd guess the real work is happening.

0 Kudos
MichaelLedwith
New Contributor III

I increased the parallel processing to 75% and re-started the process at 18:42 last evening. It finished at 09:33 this morning. So that's about 15 hours. Definitely better than 36 hours but the improvement might have something to do with less activity on the server during the night.

Update: I ran ZonalStatisticsAsTable again this morning in ArcGIS Pro with parallel processing set to 75% and the process time was 1:57:32 (slightly longer).

0 Kudos
SarmisthaChatterjee
Esri Contributor

Hi Michael, thank you for reporting this. I am a Senior Product Engineer from the Spatial Analyst team, I have a few questions for you:

  1. What is the environment Cell size within Pro and the standalone Python environment?
  2. I am assuming you are setting Parallel processing factor to 75% as you mentioned in your comment for both within Pro and Python. Are you setting it as a percentage or as a number of processors.
  3. Have you saved the output of Focal statistics before using it as input in Zonal Statistics tools in Pro? If so, what is the output format? And what is the cell size of this output?
  4. Are you setting the Output Coordinate System environment in either Pro or Python? If so, is it same as the value_raster or something else?
0 Kudos
MichaelLedwith
New Contributor III

1. The cell size is unchanged, which means that it's 10m (Maximum if Inputs);

2. Percentage: (75%);

3, Saving the output from the Focal Statistics (in this case16 cm resolution, 625 000 X 625 000 pixels) takes hours and hours. The whole reason for using Raster Functions is to greatly increase the speed of the operation;

4. All the inputs have the same defined coordinate system (EPSG: 3006). No settings have been changed or given.

Honestly, I'm running the same exact commands. In ArcGIS Pro, the process goes fast, in Python 3.7.11 (IDLE, 64-bit), it goes amazingly slow. 

0 Kudos
MichaelLedwith
New Contributor III

Update:

I converted a single polygon to raster (10m res, 55 X 66 pixels) and reran ZonalStatisticsAsTable. In ArcGIS Pro, it took two (2) seconds! In Python, the process took slightly less than 12 hours.

N.B.: it's only ZonalStatisticsAsTable that's giving me trouble. Other arcpy commands run at typical speeds in Python.

I've tested this in three different Virtual Desktop environments and on my workstation. The results are all similar. My next step is to completely remove all ESRI traces from one of the VD instances and re-install Pro. 

0 Kudos
SarmisthaChatterjee
Esri Contributor

Thank you for your response.

Recommendations:

Based on your response and your script, I have the following recommendations:

  1. Before executing Focal Statistics, specify
    1. arcpy.env.cellSize to your zone raster (10m)
    2. arcpy.env.snapRaster to your value raster
  2. Before executing Zonal Statistics or Zonal Statistics as Table, you can save the output from Focal Statistics on disc, to check how much time it takes to create this raster dataset. You can also check the cell size to make sure that it is 10m (OPTIONAL).
  3. You can specify the output from Focal Statistics as value raster to Zonal Statistics or Zonal Statistics as Table. At this point, the output from Focal Statistics, if not saved earlier will be saved at 10m cell size, which should take substantially less time.

Comments:

I have also put together some comments below to explain my recommendations.

1. Cell size: It looks like you have not explicitly specified the environment cell size. The Zonal Statistics or Zonal Statistics as Table tool is executing with default cell size, which is Maximum of inputs. So, your Zonal Statistics output cell size is 10m. The Zonal Statistics as Table output is also calculated based on 10m cell size.

2. Focal Statistics: Since there is no analysis environment cell size specified, Focal Statistics is calculated with 0.16m cell size, resulting in a large raster. You are not immediately saving this large raster, and passing the output from Focal Statistics to Zonal Statistics or Zonal Statistics as Table. Since you are not saving the Focal Statistics output, it feels like Focal Statistics is executing quickly. In reality, Focal Statistics is returning a raster function, which stores the analytical model only, but not a raster dataset on disc.

If you are not interested in the Focal Statistics output as the final result, you can use this raster function output to another tool/raster function. However, depending on the nature of the subsequent operation, the raster function will be extended with additional analytical operation (function chain) or first a dataset will be created, and then used as input in the analytical tool.

Usually, the local and focal operations create function chain, where a raster dataset can be created by saving the final output at the end of the analysis, which makes the execution faster.

If a raster function or a function chain is used as input to zonal or global tools/raster functions, a dataset will first be created internally, before executing that in the tool.  In this case, the local/focal operation is not chained together with the zonal/global operation.

3. Zonal Statistics as Table: You are specifying the output of Focal Statistics as input to Zonal Statistics or Zonal Statistics as Table. Due to the nature of the operation (zonal), the tool requires a raster dataset as value. So, it triggers the save internally, which is creating a large raster with 0.16m cell size. This raster is then internally resampled to a 10m value raster for performing the analysis with a 10m zone raster. All this extra work in Zonal Statistics as Table is probably taking the additional time, which can be avoided by specifying analysis cell size and snap raster before performing Focal Statistics.

 

It is not very clear yet why the tool is executing faster in Pro compared to your standalone python script. I am hoping by specifying the cell size and snap raster explicitly, you will see a performance improvement. We can then take a closer look at your workflow in Pro to understand the source of the execution time difference.

Please let me know if this works with your dataset, and if you have additional questions.

Thanks,

Sarmistha

0 Kudos
MichaelLedwith
New Contributor III

Thanks for the explanation Sarmistha (particularly about the internal save), it appears that setting env.cellSize  to 10m did the trick. Interestingly enough, setting the cellSize within EnvManager didn't work but setting the cellSize on a separate line cut the time down from 36 hours to less than two.

0 Kudos