Issues about Train arcgis.learn models on multiple GPUs

490
0
06-20-2021 11:57 PM
Labels (1)
ChinaEsri
New Contributor

Hi there,I am tring using multiple GPUs to train arcgis.learn models, but got error follows:

To Reproduce
Steps to reproduce the behavior:

python -m torch.distributed.launch --nproc_per_node=8 test.py

test.py code follows:

from arcgis.learn import prepare_data,UnetClassifier,DeepLab
import os
import torch
print ('Available devices ', torch.cuda.device_count())
path = r'/home/gpu_test02/data/label382_chip512_rota120'
data = prepare_data(path, chip_size=512,batch_size=6)
m = DeepLab(data)
m.unfreeze()
m.fit(1, 0.001)
m.save('model_test')

error:

  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/site-packages/arcgis/learn/models/_arcgis_model.py", line 1000, in _save
  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/site-packages/arcgis/learn/models/_arcgis_model.py", line 1000, in _save
    os.makedirs(self.learn.path / self.learn.model_dir)
  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
    os.makedirs(self.learn.path / self.learn.model_dir)
  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
    os.makedirs(self.learn.path / self.learn.model_dir)
    os.makedirs(self.learn.path / self.learn.model_dir)
  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
    os.makedirs(self.learn.path / self.learn.model_dir)
  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
    os.makedirs(self.learn.path / self.learn.model_dir)
  File "/home/gpu_test02/miniconda3/envs/arcgis/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
    mkdir(name, mode)
    mkdir(name, mode)
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/gpu_test02/data/label382_chip512_rota120/models/checkpoint_2021-06-19_15-30-17'

Screenshots

 

Expected behavior
A clear and concise description of what you expected to happen.

Platform (please complete the following information):

  • OS: centos7
  • Browser [e.g. chrome, safari]
  • Python API Version 1.8.4

    Additional context
    Add any other context about the problem here, attachments etc.

Tags (1)
0 Kudos
0 Replies