Hello
We have 2 GPUs on a windows machine running ArcGIS 2.8.3 and have tried distributed GPU training as described here:
https://developers.arcgis.com/python/guide/utilize-multiple-gpus-to-train-model/
I have not managed to get it to work and I am wondering if this is because it only works in Linux.
The above web page contains several inconsistencies, such as referring to windows path with:
Also
The issue I am getting is that only one GPU starts processing and then at the end of the 1st epoch crashes (excepts from crash below)
File "C:\ArcGIS\Pro_\bin\Python\envs\arcgispro-py3\lib\site-packages\fastai\callback.py", line 347, in on_batch_end
dist.all_reduce(val, op=dist.ReduceOp.SUM)
AttributeError: module 'torch.distributed' has no attribute 'all_reduce'
and
AttributeError: module 'torch.distributed' has no attribute 'barrier'
Has anyone managed to get this going in Windows?
Many thanks
Solved! Go to Solution.
Hi @TimG ,
Multi GPU training is not yet supported on windows. It will supported in Python API for ArcGIS version 1.9.1 and up.
Thanks,
Sandeep
Hi @TimG ,
Multi GPU training is not yet supported on windows. It will supported in Python API for ArcGIS version 1.9.1 and up.
Thanks,
Sandeep
@Anonymous User ok thanks. Any idea when that will be?
Hi,
ArcGIS API for Python version 1.9.1 has been released. If you are an anaconda user you can get that along with all the deep learning dependencies using this command in a clean environment.
conda install -c esri arcgis_learn =1.9.1 python=3.8
Thanks,
Sandeep
Hi @Anonymous User
I create a blank environment using "conda create --name dl13" and as can be seen but got the below issues trying to install arcgis_learn and python.
(dl13) C:\ArcGIS\Pro_\bin\Python\envs\arcgispro-py3>conda install -c esri arcgis_learn=1.9.1 python=3.8
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versions
Package python conflicts for:
arcgis_learn=1.9.1 -> python[version='>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0']
arcgis_learn=1.9.1 -> boost=1.73 -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.7|>=3|>=3.6|>=3.9,<3.10.0a0|3.4.*|>=3.9,<3.10|>=3.8,<3.9|>=3.7,<3.8|>=3.6,<3.7']
python=3.8
I also tried to clone arcgispro-py3 and install the arcgis_learn and python - but got pages of conflicts. Such as.
Package keras-gpu conflicts for:
esri/win-64::deep-learning-essentials==2.8=arcgispro_4 -> keras-gpu=2.3
defaults/win-64::keras-gpu==2.3.1=0
defaults|defaults/win-64::keras-gpu==2.3.1=0
Package swat conflicts for:
esri|esri/win-64::swat==1.8.1=py37_0
esri/win-64::arcpy==2.8=py37_arcgispro_29734 -> swat
esri/win-64::swat==1.8.1=py37_0
Any ideas?
Many Thanks
Tim
Hi Tim,
From the documentation available here https://developers.arcgis.com/python/guide/install-and-set-up/#Install-using-Python-Command-Prompt-o...
arcgis_learn is a metapackage designed for standalone anaconda environments. You can install it in vanilla anaconda environments, not in ArcGIS Pro conda envrionments.
For ArcGIS Pro I would recommend you to use the deep learning installer, but this multi GPU support will only be available in ArcGIS Pro 2.9 once it is released.
For now you can use a standalone anaconda environment and install arcgis_learn in it.
Thanks,
Sandeep
Hi Sandeep
I have tried all sorts of things to try to get it to work including a standalone vanilla anaconda environment. But in Windows it gives the below errors.
conda install -c esri arcgis_learn=1.9.1 python=3.8
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versions
Package python conflicts for:
arcgis_learn=1.9.1 -> boost=1.73 -> python[version='>=2.7,<2.8.0a0|>=3.9,<3.10.0a0|>=3.9,<3.10|>=3.8,<3.9|>=3.7,<3.8|>=3.6,<3.7|>=3.6']
arcgis_learn=1.9.1 -> python[version='>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0']
python=3.8
I have also tried in Linux, the install gets further but then starts coming up with pages of errors such as
ClobberError: This transaction has incompatible packages due to a shared path.
packages: esri/linux-64::torch-cluster-1.5.9-py38_torch18.0_cuda11.1_2, esri/linux-64::torch-scatter-2.0.7-py38_torch18.0_cuda11.1_2, esri/linux-64::torch-spline-conv-1.2.1-py38_torch18.0_cuda11.1_2, esri/linux-64::torch-sparse-0.6.10-py38_torch18.0_cuda11.1_1
path: 'lib/python3.8/site-packages/test/utils.py'
Any ideas?
Thanks
Tim
Hi @TimG,
Can you try it again now ?
Yes @Anonymous User it installs now.
Many thanks for your help.
please see my response on https://community.esri.com/t5/arcgis-pro-questions/train-deep-learning-model-using-multiple-gpus-on/...
Cheers!
Pavan Yadav | Product Engineer - Imagery and AI
Esri | 380 New York | Redlands, 92373 | USA
https://www.linkedin.com/in/pavan-yadav-1846606/