I have been doing some experimentation on scaling up the processing of a large geoprocessing job. Currently, I am running a 1:1 join from the GeoAnalytics toolset.
The data:
The environment:
My question:
On the smaller servers, I could see the CPU and RAM max out for significant periods of time. When the RAM would spill over into the pagefile, I noted that as the bottleneck and increased the RAM for the next run. Generally, increasing the size of the VM reduced the duration of the join process. What is interesting is when I increased memory to 256 GB and beyond.
What appears to happen is the CPU fires up to 100% usage early on, while RAM slowly climbs to about 92%. With 32 cores or more, the CPU looks like it burns through the operation pretty quickly, and then drops down to 3% for the rest of the duration. The part that I can't figure out is: The duration doesn't decrease after scaling past 32 cores x 256 GB. The CPU burns through the operations faster, but then it sits there with high memory usage for another 25 mins. Changing the VM size to significantly higher specs barely improved it.
Any ideas on what the bottleneck could be? Network doesn't seem to be active during this time period (high input at the beginning and high output at the end, but little during the long middle). The job logs all look like this: