Train Deep Learning Model stuck on Epoch 1

1941
4
08-05-2020 12:36 AM
MartynasBielinis
New Contributor

I am trying to run the Train Deep Learning Model tool on Pro 2.6 using single-shot detection and resnet34, but it seems to be stuck on Epoch 1. The progress bar is at 0% for hours with no change. The messages show text for "Training Loss" and "Validation Loss" but nothing after that.

Anyone deal with a similar problem?

Tags (4)
4 Replies
jdetka
by
New Contributor

I am having the same problem but with Pro 2.7. Did you ever find a solution? 

 

0 Kudos
MartyRyan
New Contributor III

I am having the same issue - Epoch 1 shows no progress after 18 hours. I am running this on my CPU, which I know makes for slow processing. (This is a test to see what my CPU will do.) The only indication I have of progress is that Task Manager shows my ArcGIS Pro project using almost 100% of my CPU and the Power usage trend shows "Very high". I am using a .5 m resolution image, so I realize this may take a long time. It took 33 hours to export my training data on this same image and computer. 

0 Kudos
MartyRyan
New Contributor III

I discovered that (at least in my case) there doesn't seem to be an error - I am using only my CPU and so progress is slow...I am now at Epoch 3 (10%) - I started training at 2:00 PM Wednesday, 3/3. 

0 Kudos
MartyRyan
New Contributor III

I was having the same issues with slowness - in my case, it was not an error, just where I was processing. We set up a remote PC with an 8 GB NVIDIA GPU (Quadro M4000). It made more progress in 1 hour than in 28 hours on my CPU. Comparable times for Epoch to Epoch progress: Start to Epoch 2 at 5% - 19 hrs. CPU / 20 minutes GPU...Epoch 2 at 5% to Epoch 3 at 10% - 9.5 hours CPU / 4minutes GPU. I got a trained model of 20 epochs in 3 hours.