My guess is that arcpy python code is just a thin layer around compiled C code which does all the heavy lifting, so trying to optimize that thin layer would not buy you very much. Using techniques such as in-memory feature classes, caching objects in python dicts and even multithreading can provide really significant performance improvements
I've chased through all that in debug and you can watch this happen -- with arcpy calls first it reformats the parameters into the old pre-arcpy format (so long ago I forget, it's called gp geoscripting or something) then it calls that old thing then that formats the call again for the compiled code. It seems dodgy to me, like a pile of cards resting on a freight train going into a dark tunnel, but hey! it shipped!
ANYWAY, THAT part happens pretty quickly. What always seems slow to me is the import. I have not seen any other Python module load more slowly than arcpy. I can see what feels like a 10:1 speed up when I can remove "import arcpy". My suspicion is that it's dragging along doing license checks even when there are no calls in my script (meaning I accidentally did the "import arcpy" then never needed to do any arcpy calls). I always figured I have no control over that stage so I never worried about it. Either I need arcpy or I don't.
Using "import arcgis" seems fast, but the login is slow.
So my approach generally is to run one Python script that does a lot of work standalone instead of driving 3 or 4 scripts from a PYT toolbox. It's always a ton of extra work to set up a PYT and so far never been worth the effort. It runs slower and no one else (normal ArcPro GIS people) ever uses it anyway.
I always break the business code out from PYTs anyway, and into several files that can be tested independently so adding a "toolbox" just makes things more work and slower. MUCH harder to test too.
For me, with my teensy rural county datasets, what really rips along is Pandas! YEAH! I find that debugging in a Jupyter notebook with Pandas is awesome. Load dataset into a dataframe in one cell then use that one dataframe all day long while crunching the data different ways. Once I have the process down I can just paste the code into one py file. Sometimes I just save the ipynb itself with some notes for next time and I am done.
The slowdowns are absolutely license based, the first run of a script is almost always the slowest. Same when you load in a toolbox for the first time.
I've built a somewhat maintainable system for breaking out pyt files into modular toolboxes (a series of python modules with tool classes that can be tracked per file) and the pyt itself only loads the tool objects from those modules.
At the end of the day though, getting data into a more standard format and ingesting it with pure pandas (especially if you have an Nvidia GPU and can get cuDF set up properly) will always be fastest. The only reason to ever use python toolboxes is if you need something that mappers and designers need to utilize within Pro itself or standardize some process with an easy to use interface that multiple non programmers will be using.
Thanks for confirming many of my sketchy thoughts.
I wrote a template for PYT years ago that broke out as you describe (1) PYT (2) business logic* (3) utilities. Then a month ago looked at the current project and said "why do I break things out like this? is that the best way?" and tried fewer files and YES what you suggest is best: make the PYT file as small as possible. Keep 100% of the actual work being done in separate files.
PYT files break every time I touch them. SO get them working separately then never touch them.
Cool tip on cuDF -- off to see what that entails now. My home computer has a fancy nvidia board. I need some excuse for having spent that $. 2500+ cores. My work laptop computer is newer and has... 2000 --COOL oh you kids today have no idea what it was like using punch cards and 9 track tapes.
* business logic = ugh! buzz word 🙂 I hate buzz words. 🙂 Probably no longer in style and shows my age. Wait. The comment on punch cards? Strike that from the record.
Some tips for fixing PYT breaking:
Use a wrapper/fallback class during import and make sure to reload the modules during development or you'll need to restart the kernel constantly to get code changes loaded in (with a reload method, you can load in code changes with the toolbox refresh option in ArcPro):
Wrapper:
def build_dev_error(label: str, desc: str):
class Development(object):
def __init__(self):
"""Placeholder tool for development tools"""
self.category = "Tools in Development"
self.label = label
self.alias = self.label.replace(" ", "")
self.description = desc
return
return Development
Reload Logic:
from importlib import reload, import_module
from traceback import format_exc
try:
import tools.ToolModule
reload(tools.ToolModule)
from tools.ToolModule import ToolClass
except:
ToolClass = build_dev_error("Tool Class", format_exc())
class Toolbox(object):
def __init__(self):
"""Define the toolbox (the name of the toolbox is the name of the
.pyt file)."""
self.label = "My Toolbox"
self.alias = "MyToolbox"
# List of tool classes associated with this toolbox
self.tools = \
[
ToolClass,
]
This flow will allow your PYT to just be a list of imports and will allow you to throw a bunch of tools in and handle one failing without breaking the others. The `build_dev_error` function returns a stub class with a dynamically set description parameter (in this case the traceback for debugging purposes)
I can't go back to single file PYTs anymore, this method really does wonders for maintainability and keeping tools that may or may not error out on import from gumming up the whole toolbox.
In my use case keeping only one tool per file so it's easier to track changes and plug/play between branches, but you could have multiple tool classes in one tool module file.
Another important thing is to create a utility module that you put generic reuse code in so you don't have to patch it in multiple places and are able to make the change in one module and get the updates everywhere