Hi there,I am tring using multiple GPUs to train arcgis.learn models, but got error follows:
To Reproduce
Steps to reproduce the behavior:
python -m torch.distributed.launch --nproc_per_node=8 test.py
test.py code follows:
from arcgis.learn import prepare_data,UnetClassifier,DeepLab
import os
import torch
print ('Available devices ', torch.cuda.device_count())
path = r'/home/gpu_test02/data/label382_chip512_rota120'
data = prepare_data(path, chip_size=512,batch_size=6)
m = DeepLab(data)
m.unfreeze()
m.fit(1, 0.001)
m.save('model_test')
error:
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/site-packages/arcgis/learn/models/_arcgis_model.py", line 1000, in _save
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/site-packages/arcgis/learn/models/_arcgis_model.py", line 1000, in _save
os.makedirs(self.learn.path / self.learn.model_dir)
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
os.makedirs(self.learn.path / self.learn.model_dir)
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
os.makedirs(self.learn.path / self.learn.model_dir)
os.makedirs(self.learn.path / self.learn.model_dir)
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
os.makedirs(self.learn.path / self.learn.model_dir)
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
os.makedirs(self.learn.path / self.learn.model_dir)
File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
mkdir(name, mode)
mkdir(name, mode)
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
Screenshots
Expected behavior
A clear and concise description of what you expected to happen.
Platform (please complete the following information):
Additional context
Add any other context about the problem here, attachments etc.