RetinaNet runs only one cpu socket and gpu instead of 2 cpu

Bigfoot48 · ‎02-17-2024

I'm currently testing my setup to see how it performs for deep learning.

My current setup includes:

HP Proliant ML350 GEN9
CPUs: 2x E5-2698 v3
GPU: Tesla P40
64 GB of RAM
2x NVMe 1TB

I'm trying to follow this course: "Detection of electric utility features and vegetation encroachments from satellite images using deep learning | ArcGIS API for Python."

Detection of electric utility features and vegetation encroachments from satellite images using deep...

During the fitting process, I noticed it seems to be running on only 1 CPU instead of 2, restricting itself to 50% of the CPU capacity.

Change Detection of Buildings from Satellite Imagery - Preview (arcgis.com),

A few weeks before this, I did the course "Change Detection of Buildings from Satellite Imagery - Preview" on arcgis.com. At that time, I only had one CPU available and also noticed only a 50% CPU usage.

Is there a setting in Jupyter Notebook that I need to adjust? Or should I run the commands differently to utilize the most system resources?

MarcoBoeringa · ‎02-18-2024

Welcome to the wonderful world of "processor groups", "CPU sets", "processor affinity", NUMA (Non Uniform Memory Access - a technique to access local or remote RAM in a non-uniform manner, with local faster RAM being preferred) and other processor scheduling details you likely never heard of, but that do start to creep in on your work once you hit large numbers of logical processors on multiple socket server systems.

As I also learned the hard way, software doesn't just magically use all logical processors on your system. In fact, most software out there has never been designed to take full advantage of >=64 logical processor systems, simply because there were none out there up until very recently.

One minor introduction might be this read of Bitsum:

https://bitsum.com/general/the-64-core-threshold-processor-groups-and-windows/

There is a lot more to read out there about these subjects, but I can tell you it doesn't necessarily provide answers or working solutions.

However, as to practical advice, you might attempt to:

- Disable hyperthreading

- Disable NUMA

in your system's BIOS. This will create in your case a system that is more or less logically seen to be as a "single socket" 32 core system without hyperthreading, with both CPUs having equal and predictable access times to local and remote RAM memory sticks. In this configuration, it is more likely all cores will be used, but they will have (slightly) higher RAM latency / access times.

While both of these options are designed to potentially run your system faster, they do not necessarily do so, and may in fact reduce the overall performance of specific workflows if those workflows need access to all physical cores. Note that disabling NUMA does not seem to be recommended on older multi-socket AMD systems from what I have read so far, as the data links between different CPU's memory seem to have far to big latency, actually harming overall performance, but this seems to be less of an issue with INTEL systems. I don't know if that is still an issue with modern AMD systems (likely not).

If you do disable these options in the BIOS of your system, I strongly recommend you to run a controlled performance test of your workflow before and after the change, to see if one or the other configuration is faster, it may turn out that disabling the options is actually harmful (but in my case it was beneficial to disable them).