I am working on creating a written workflow for handling a large imagery dataset at my organization and teaching future users how to create them. Quick summary of my workflow:
1. Create Mosaic Dataset
2. Add Rasters to Mosaic Dataset
3. Define Overviews
4. Build Overviews
Step 4 is where I encountered the issue. On my first test, building the overviews took 35 minutes for ~900 imagery tiles. During step 3. I designated that the overviews be stored in the file geodatabase.
Once I finished my documentation, I decided run through it to make sure it was coherent and easy to follow. I did everything the same, except this time, when Defining Overviews in step 3, I did not designate them to be stored in a file goedatabase, but rather a folder on my C drive. When I clicked run, I immediately noticed that the tool was going much slower. When it finally finished it had taken over 5 hours to run.
I decided to comb through the tool logs in the GP history panel and I came across one big discrepancy.
Just before the tool started generating overviews, it said the following:
2023-10-18T13:44:59.883: Distributing mosaic dataset operation across 10 parallel instances on the specified host: [GIS-0123].
When I go to that same spot in the log for the second run, I can clearly see that it did not say anything about parallel processing. Here is a screenshot from the logs (the log on the right is the 35 minute run and the log on the left is the the 5 hour run).
The only thing I did differently was in the defining overview stage. So I can't figure out why the parallel processing was affected. I did not tweak the parallel processing environment settings in either tool. Anyone have any experience with this issue?
Build Overviews (Data Management)—ArcGIS Pro | Documentation
Parallel Processing factor refers to Build Overviews
Parallel Processing Factor (Environment setting)—ArcGIS Pro | Documentation
However, for cases in which all your processes are I/O bound to a disk or to an enterprise database connection, you may get better performance by specifying more processes than you have cores. For example, the Add Rasters to Mosaic Dataset tool is I/O bound when the mosaic dataset is stored in an enterprise database. Also, the Build Overviews tool is primarily I/O bound to the disk. You can use more processes than your machine has cores by specifying either a percent value greater than 100% or a number of processes greater than the number of cores on your machine. For example, if you have a 4-core machine, specifying 8 or 200% will spread operations over 8 processes.
That's very interesting. I know you didn't write the documentation but based off all your helpful posts across the community, I think you know the ins and outs of the software better than many ESRI employees. So do you know what they mean when they say "I/O Bound?" I know that means input/output, but the way I'm reading that, it seems to imply that writing to a file GDB isn't I/O bound.
Isn't pretty much everything on a computer I/O bound at some level? I'm trying to understand the difference between writing to a folder on my local drive and writing to a file geodatabase is in terms of I/O.
Regarding the value you use for parallel processing if you manually input it, do you know what the limit is? Based on the example they provided, are you limited to 2 process per core? So for example, I have an i7 -12700 which has 20 threads (8 p cores and 4 e cores). Which number would I double? I'm guessing the P cores since they are typically the workhorse cores? So I would use 16?
Sorry for throwing all these questions at you! I'm trying to learn this stuff as best I can so I can document it for the future!