Distributed GPU deep learning training in Windows

166
5
Jump to solution
09-15-2021 08:46 PM
Labels (1)
TimG
by
New Contributor III

Hello

We have 2 GPUs on a windows machine running ArcGIS 2.8.3 and have tried distributed GPU training as described here:

https://developers.arcgis.com/python/guide/utilize-multiple-gpus-to-train-model/

I have not managed to get it to work and I am wondering if this is because it only works in Linux.

The above web page contains several inconsistencies, such as referring to windows path with:

  • WindowsPath('C:/Users/Admin/AppData/Local/Temp/train_model.py')  then later on the nvidia-smi is run in Ubuntu

Also

  • the train_model.py file won't run without modification as the line below won't run with out being imported from arcgis.learn 
    • m = PSPNetClassifier(data)

 

The issue I am getting is that only one GPU starts processing and then at the end of the 1st epoch crashes (excepts from crash below)

File "C:\ArcGIS\Pro_\bin\Python\envs\arcgispro-py3\lib\site-packages\fastai\callback.py", line 347, in on_batch_end
dist.all_reduce(val, op=dist.ReduceOp.SUM)
AttributeError: module 'torch.distributed' has no attribute 'all_reduce'

and

AttributeError: module 'torch.distributed' has no attribute 'barrier'

Has anyone managed to get this going in Windows?

Many thanks

0 Kudos
1 Solution

Accepted Solutions
SandeepKumar1
Esri Contributor

Hi @TimG ,

 

Multi GPU training is not yet supported on windows. It will supported in Python API for ArcGIS version 1.9.1 and up.

 

Thanks,

Sandeep

View solution in original post

0 Kudos
5 Replies
SandeepKumar1
Esri Contributor

Hi @TimG ,

 

Multi GPU training is not yet supported on windows. It will supported in Python API for ArcGIS version 1.9.1 and up.

 

Thanks,

Sandeep

View solution in original post

0 Kudos
TimG
by
New Contributor III

@SandeepKumar1 ok thanks. Any idea when that will be?

0 Kudos
SandeepKumar1
Esri Contributor

Hi,

ArcGIS API for Python version 1.9.1 has been released. If you are an anaconda user you can get that along with all the deep learning dependencies using this command in a clean environment.

conda install -c esri arcgis_learn =1.9.1 python=3.8

Thanks,

Sandeep

0 Kudos
TimG
by
New Contributor III

Hi @SandeepKumar1 

I create a blank environment using "conda create --name dl13" and as can be seen but got the below issues trying to install arcgis_learn and python.

(dl13) C:\ArcGIS\Pro_\bin\Python\envs\arcgispro-py3>conda install -c esri arcgis_learn=1.9.1 python=3.8
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package python conflicts for:
arcgis_learn=1.9.1 -> python[version='>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0']
arcgis_learn=1.9.1 -> boost=1.73 -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.7|>=3|>=3.6|>=3.9,<3.10.0a0|3.4.*|>=3.9,<3.10|>=3.8,<3.9|>=3.7,<3.8|>=3.6,<3.7']
python=3.8

I also tried to clone arcgispro-py3 and install the arcgis_learn and python - but got pages of conflicts. Such as.

Package keras-gpu conflicts for:
esri/win-64::deep-learning-essentials==2.8=arcgispro_4 -> keras-gpu=2.3
defaults/win-64::keras-gpu==2.3.1=0
defaults|defaults/win-64::keras-gpu==2.3.1=0

Package swat conflicts for:
esri|esri/win-64::swat==1.8.1=py37_0
esri/win-64::arcpy==2.8=py37_arcgispro_29734 -> swat
esri/win-64::swat==1.8.1=py37_0

Any ideas?

Many Thanks

Tim

 

0 Kudos
SandeepKumar1
Esri Contributor

Hi Tim,

From the documentation available here https://developers.arcgis.com/python/guide/install-and-set-up/#Install-using-Python-Command-Prompt-o...

arcgis_learn is a metapackage designed for standalone anaconda environments. You can install it in vanilla anaconda environments, not in ArcGIS Pro conda envrionments.

For ArcGIS Pro I would recommend you to use the deep learning installer, but this multi GPU support will only be available in ArcGIS Pro 2.9 once it is released.

For now you can use a standalone anaconda environment and install arcgis_learn in it.

Thanks,

Sandeep

 

0 Kudos