Scaling resources to run arcpy scripts on very large datasets

HugoBouckaert1 · ‎09-03-2021

Hi

I have several arcpy scripts that run on my desktop but take days to complete. So far I have spatially gridded the data and then used parallel processing to process the spatial grids simultaneously.

We are looking at a better solution however, because creating spatial grids creates its own problems, with features "cut in half", and if when kept in two grids, resulting in duplication.

Does the GeoAnalytics server support arcpy? It does not look like it, but if it does, this might be a solution.

Ideally we are looking at running this code in an Azure cloud environment. In Azure, you can make use of clusters (multiple VMs) which can be scaled up to do some powerful processing. However, in Azure only pyspark libraries are supported for multiprocessing. When you use arcpy, you have to install an anaconda environment (e.g. the environment that comes with ArcGIS Pro), but any code you run cannot use Azure clustered resources and so you are back to being limited to using a single server.

Are there any solutions out there to run python code with the arcpy libraries at scale? The compute resources needed are huge, so a single server, even if very powerful, might not suffice. Note also that python by default runs on a single core, so that a single powerful machine even with 64 cores will not speed up the process because only a single core will be used. This is why I spatially gridded the data and sent each grid to a separate core, but as I said, we would like to get away from that solution.

Any ideas or help would be most welcome.

Thanks

Hugo