Distributed GPU deep learning training in Windows

1736
9
Jump to solution
09-15-2021 08:46 PM
Labels (1)
TimG
by
New Contributor III

Hello

We have 2 GPUs on a windows machine running ArcGIS 2.8.3 and have tried distributed GPU training as described here:

https://developers.arcgis.com/python/guide/utilize-multiple-gpus-to-train-model/

I have not managed to get it to work and I am wondering if this is because it only works in Linux.

The above web page contains several inconsistencies, such as referring to windows path with:

  • WindowsPath('C:/Users/Admin/AppData/Local/Temp/train_model.py')  then later on the nvidia-smi is run in Ubuntu

Also

  • the train_model.py file won't run without modification as the line below won't run with out being imported from arcgis.learn 
    • m = PSPNetClassifier(data)

 

The issue I am getting is that only one GPU starts processing and then at the end of the 1st epoch crashes (excepts from crash below)

File "C:\ArcGIS\Pro_\bin\Python\envs\arcgispro-py3\lib\site-packages\fastai\callback.py", line 347, in on_batch_end
dist.all_reduce(val, op=dist.ReduceOp.SUM)
AttributeError: module 'torch.distributed' has no attribute 'all_reduce'

and

AttributeError: module 'torch.distributed' has no attribute 'barrier'

Has anyone managed to get this going in Windows?

Many thanks

0 Kudos
1 Solution

Accepted Solutions
by Anonymous User
Not applicable

Hi @TimG ,

 

Multi GPU training is not yet supported on windows. It will supported in Python API for ArcGIS version 1.9.1 and up.

 

Thanks,

Sandeep

View solution in original post

0 Kudos
9 Replies
by Anonymous User
Not applicable

Hi @TimG ,

 

Multi GPU training is not yet supported on windows. It will supported in Python API for ArcGIS version 1.9.1 and up.

 

Thanks,

Sandeep

0 Kudos
TimG
by
New Contributor III

@Anonymous User ok thanks. Any idea when that will be?

0 Kudos
by Anonymous User
Not applicable

Hi,

ArcGIS API for Python version 1.9.1 has been released. If you are an anaconda user you can get that along with all the deep learning dependencies using this command in a clean environment.

conda install -c esri arcgis_learn =1.9.1 python=3.8

Thanks,

Sandeep

0 Kudos
TimG
by
New Contributor III

Hi @Anonymous User 

I create a blank environment using "conda create --name dl13" and as can be seen but got the below issues trying to install arcgis_learn and python.

(dl13) C:\ArcGIS\Pro_\bin\Python\envs\arcgispro-py3>conda install -c esri arcgis_learn=1.9.1 python=3.8
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package python conflicts for:
arcgis_learn=1.9.1 -> python[version='>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0']
arcgis_learn=1.9.1 -> boost=1.73 -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.7|>=3|>=3.6|>=3.9,<3.10.0a0|3.4.*|>=3.9,<3.10|>=3.8,<3.9|>=3.7,<3.8|>=3.6,<3.7']
python=3.8

I also tried to clone arcgispro-py3 and install the arcgis_learn and python - but got pages of conflicts. Such as.

Package keras-gpu conflicts for:
esri/win-64::deep-learning-essentials==2.8=arcgispro_4 -> keras-gpu=2.3
defaults/win-64::keras-gpu==2.3.1=0
defaults|defaults/win-64::keras-gpu==2.3.1=0

Package swat conflicts for:
esri|esri/win-64::swat==1.8.1=py37_0
esri/win-64::arcpy==2.8=py37_arcgispro_29734 -> swat
esri/win-64::swat==1.8.1=py37_0

Any ideas?

Many Thanks

Tim

 

0 Kudos
by Anonymous User
Not applicable

Hi Tim,

From the documentation available here https://developers.arcgis.com/python/guide/install-and-set-up/#Install-using-Python-Command-Prompt-o...

arcgis_learn is a metapackage designed for standalone anaconda environments. You can install it in vanilla anaconda environments, not in ArcGIS Pro conda envrionments.

For ArcGIS Pro I would recommend you to use the deep learning installer, but this multi GPU support will only be available in ArcGIS Pro 2.9 once it is released.

For now you can use a standalone anaconda environment and install arcgis_learn in it.

Thanks,

Sandeep

 

0 Kudos
TimG
by
New Contributor III

Hi Sandeep

I have tried all sorts of things to try to get it to work including a standalone vanilla  anaconda environment.  But in Windows it gives the below errors.  

conda install -c esri arcgis_learn=1.9.1 python=3.8
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package python conflicts for:
arcgis_learn=1.9.1 -> boost=1.73 -> python[version='>=2.7,<2.8.0a0|>=3.9,<3.10.0a0|>=3.9,<3.10|>=3.8,<3.9|>=3.7,<3.8|>=3.6,<3.7|>=3.6']
arcgis_learn=1.9.1 -> python[version='>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0']
python=3.8

 

I have also tried in Linux, the install gets further but then starts coming up with pages of errors such as

ClobberError: This transaction has incompatible packages due to a shared path.
  packages: esri/linux-64::torch-cluster-1.5.9-py38_torch18.0_cuda11.1_2, esri/linux-64::torch-scatter-2.0.7-py38_torch18.0_cuda11.1_2, esri/linux-64::torch-spline-conv-1.2.1-py38_torch18.0_cuda11.1_2, esri/linux-64::torch-sparse-0.6.10-py38_torch18.0_cuda11.1_1
  path: 'lib/python3.8/site-packages/test/utils.py'

 Any ideas?

Thanks

Tim

0 Kudos
by Anonymous User
Not applicable

Hi @TimG,
Can you try it again now ?

0 Kudos
TimG
by
New Contributor III

Yes @Anonymous User  it installs now.

Many thanks for your help.

0 Kudos
PavanYadav
Esri Contributor

please see my response on https://community.esri.com/t5/arcgis-pro-questions/train-deep-learning-model-using-multiple-gpus-on/...

Cheers!

Pavan Yadav | Product Engineer - Imagery and AI
Esri | 380 New York | Redlands, 92373 | USA

https://www.linkedin.com/in/pavan-yadav-1846606/ 

0 Kudos