Finding Performance Bottleneck

ChrisDougherty · ‎06-02-2023

I have been doing some experimentation on scaling up the processing of a large geoprocessing job. Currently, I am running a 1:1 join from the GeoAnalytics toolset.

The data:

Large nationwide polygon data in parquet format
- Hosted in Azure Blob Storage
Statewide wetlands data as a hosted feature service

The environment:

A hosting server, a GeoAnalytics server, and a Spatiotemporal Big Data Store server
All are VMs in Azure
I have been experimenting with the size of the GeoAnalytics server, from 8 CPU cores w/ 64 GB RAM all the way up to 48 cores x 384 GB RAM

My question:

On the smaller servers, I could see the CPU and RAM max out for significant periods of time. When the RAM would spill over into the pagefile, I noted that as the bottleneck and increased the RAM for the next run. Generally, increasing the size of the VM reduced the duration of the join process. What is interesting is when I increased memory to 256 GB and beyond.

What appears to happen is the CPU fires up to 100% usage early on, while RAM slowly climbs to about 92%. With 32 cores or more, the CPU looks like it burns through the operation pretty quickly, and then drops down to 3% for the rest of the duration. The part that I can't figure out is: The duration doesn't decrease after scaling past 32 cores x 256 GB. The CPU burns through the operations faster, but then it sits there with high memory usage for another 25 mins. Changing the VM size to significantly higher specs barely improved it.

Any ideas on what the bottleneck could be? Network doesn't seem to be active during this time period (high input at the beginning and high output at the end, but little during the long middle). The job logs all look like this:

esriJobMessageTypeInformative: {"messageCode":"BD_101029","message":"3824/6980 distributed tasks completed.","params":{"completedTasks":"3824","totalTasks":"6980"}}

VenkataKondepati

Short take: you’ve likely hit a post-compute bottleneck (skew/stragglers, shuffle spills, or downstream ingest) rather than pure CPU. In GeoAnalytics (Spark under the hood), joins run in phases: read → partition/shuffle → compute → write/ingest. Past ~32 cores / 256 GB, your compute phase speeds up, but total duration stalls because the long middle is waiting on a slower phase.

Here’s how I’d triage and tune:

1) Remove the hosted feature service from the join path

Hosted feature services are great for apps, not for big joins.

Stage both inputs as files (Parquet/Big Data File Share in Blob) or in the Spatiotemporal Big Data Store (STBDS).

Then run Join Features (GeoAnalytics) from file→file or STBDS→STBDS.

If you ultimately need a hosted feature layer, write to STBDS/file first, then “copy to hosted” as a separate step. This avoids mid-job throttling/applyEdits constraints.

2) Check for skew and stragglers

A few partitions with many polygon overlaps can stall the job while others finish.

In the join tool, reduce spatial grid index size (finer tiling) or repartition to increase parallelism.

If available, enable speculative execution so slow tasks get duplicated.

Sanity-check the wetlands state(s) that might dominate (e.g., coastal states) and consider partitioning by state and running in parallel.

3) Watch the right place for bottlenecks

While GeoAnalytics Server shows low CPU, your STBDS/Elasticsearch may be busy indexing (disk/IOPS). Check its node metrics (CPU, disk, heap, indexing rate, shard health).

Put Premium SSD on STBDS data volumes; ensure temp/scratch disks are SSD too (shuffle spill is IO-bound).

Keep Blob storage and VMs in the same region/VNet; enable ADLS Gen2/hierarchical namespace if you can, and increase client parallelism.

4) Spark/GA knobs that matter

Shuffle partitions: set to ~ 2–3× total cores (too small → fat partitions & spills; too big → overhead).

Executor memory overhead: bump to prevent GC churn when memory sits ~90%.

Broadcast the small side if feasible (state wetlands per state is small) to avoid a full shuffle (GA sometimes does this, but size thresholds can be conservative).

5) Output path matters

Writing straight to hosted feature layers forces batched applyEdits and service-level throttles; that often shows up as the “long middle.”

Prefer Parquet/STBDS output, then promote to hosted as a second, controllable step.

6) Quick experiments to confirm

A/B: run the same join with wetlands exported to Parquet in Blob (no feature service). If the 25-min plateau disappears or shrinks, the service write path was the culprit.

Partition sample: limit to 2–3 states with very different geometry densities to see if duration is driven by skew.

Increase STBDS nodes (or shards) briefly—if the tail shortens, ingest/index is your bottleneck.

7) Rule-outs

If network is quiet in the middle, it’s rarely Blob read; it’s usually shuffle/ spill/ index.

CPU ~3% with RAM ~92% typically means waiting on IO (shuffle/temp/index), not “more cores needed.”

Bottom line: Treat the hosted feature service as an output target only, not as a join input. Stage both sides in Parquet or STBDS, tune partitions/grid to kill skew, ensure fast SSD for shuffle/index, and right-size shuffle partitions + memory overhead. Past that, scale the Data Store (ingest/index) rather than piling more cores onto the GeoAnalytics Server.