Finding Performance Bottleneck

ChrisDougherty · ‎06-02-2023

I have been doing some experimentation on scaling up the processing of a large geoprocessing job. Currently, I am running a 1:1 join from the GeoAnalytics toolset.

The data:

Large nationwide polygon data in parquet format
- Hosted in Azure Blob Storage
Statewide wetlands data as a hosted feature service

The environment:

A hosting server, a GeoAnalytics server, and a Spatiotemporal Big Data Store server
All are VMs in Azure
I have been experimenting with the size of the GeoAnalytics server, from 8 CPU cores w/ 64 GB RAM all the way up to 48 cores x 384 GB RAM

My question:

On the smaller servers, I could see the CPU and RAM max out for significant periods of time. When the RAM would spill over into the pagefile, I noted that as the bottleneck and increased the RAM for the next run. Generally, increasing the size of the VM reduced the duration of the join process. What is interesting is when I increased memory to 256 GB and beyond.

What appears to happen is the CPU fires up to 100% usage early on, while RAM slowly climbs to about 92%. With 32 cores or more, the CPU looks like it burns through the operation pretty quickly, and then drops down to 3% for the rest of the duration. The part that I can't figure out is: The duration doesn't decrease after scaling past 32 cores x 256 GB. The CPU burns through the operations faster, but then it sits there with high memory usage for another 25 mins. Changing the VM size to significantly higher specs barely improved it.

Any ideas on what the bottleneck could be? Network doesn't seem to be active during this time period (high input at the beginning and high output at the end, but little during the long middle). The job logs all look like this:

esriJobMessageTypeInformative: {"messageCode":"BD_101029","message":"3824/6980 distributed tasks completed.","params":{"completedTasks":"3824","totalTasks":"6980"}}