AnsweredAssumed Answered

Performance Issues with Advanced Notebook with GPU?

Question asked by knoop_umich on Oct 17, 2020
Latest reply on Oct 21, 2020 by knoop_umich

ArcGIS Notebooks seem like a great way to deliver a uniform environment to a large group of users, particular in higher-education, where you often have students bringing a wide variety of hardware and operating system combinations to the table. Our initial experience, however, is that the performance is terrible, particular with the "Advanced with GPU support - 4.0 runtime" kernel.

 

For example, we have a lot of students and researchers interested in training AI models for automated plant species identification for use with Survey123, as exemplified in this Esri-provided Notebook: Plant species identification using a TensorFlow-Lite model within mobile devices. To help them get started, I envisioned offering a workshop where each user could work through this example using an ArcGIS Notebook, so that we don't have to spend time in the workshop configuring each users' local environment.

 

Attempting to run this particular Esri example Notebook as an ArcGIS Notebook on ArcGIS Online, however, appears to require ~5.2 days to complete the training step! (Which means it will never finish, given what appears to be a 48-hour or 2-day limit on for how long a Notebook can run.) The example output provided in the Notebook suggests that step took ~12.5 hours for the Notebook author, and we see similar 10-18 hour runtimes for the training step when done on a typical, on-premise GIS workstation. 

 

Training AI models can certainly take time, however, it feels like the "Advanced with GPU kernel" is significantly under-resourced (or over-priced), or we missed something in the documentation for how we are supposed to use it...

 

For instance, at a cost of 0.5-credits per minute, the ~5.2 day run would cost 3,744 credits. And, if you use the pay as you go pricing for developers to convert this to an approximate cost, the result is ~$375!

 

If I spin up my own AWS g4dn.xlarge instance, which includes a NVIDIA T4 GPU, and run the example Notebook, the training step takes just under 19 hours. The cost of an On-Demand g4dn.xlarge instance is $0.71/hour, which translates to a total cost of ~$13.50 to train the model (plus a few more cents for network, storage, etc.) That is quite a price difference: $375 vs. $13.50, not to mention the time difference of days versus hours!

 

I've looked around for information about what hardware is backing the "Advanced with GPU support - 4.0 runtime" kernel, but was unable to find any. Whatever the configuration is, however, it is not providing the sort of performance boost for tasks like AI model training that I would've expected for a GPU-backed environment. 

 

Is there some trick to getting better performance out of the "Advanced with GPU" kernel on ArcGIS Online?

Outcomes