Hi,
Background:
I have a analysis that does cross sections of a multi-patch road model (around 10000+ sections) to get some geometrical statistics. The statistics would be for each different layer/ feature type (road surface, road shoulder, new terrain etc.) and consist of delta height, width etc. As for now, everything works just fine.
however its a bit slow, as it needs 10000 x #layers = approximately 200 000 SearchCursor requests with where_clause.
Note: the layer in question is a singelpoint PointZ layer with around 1M points
Question:
Would it be any benefit to just loading the entire table with point geometries into pandas and do all the where clauses there? Is there any information about the efficiency of the different cursors?
Does anyone have any thoughts on what is most efficient?
With the information provided, it is difficult to comment on whether changing your workflow to pandas would be more efficient. Are the feature classes in an enterprise geodatabase, file geodatabase, or something else? If file or mobile geodatabase, are they stored locally or on network share? Could you copy the feature classes into memory before processing? Are the fields you are querying indexed appropriately? There are likely more questions I could come up with if I spent a bit more time thinking about it.
Exporting a dataset into memory via pandas could definitely be faster, but not necessarily because there are lots of ways of processing data in pandas and some are faster than others. If the dataset is very large and it won't fit in memory, then having the Python process start paging will cause performance hit regardless of specific pandas workflow.
Given there are so many factors involved, the best approach is to do some testing oneself. Is it worth taking the time to test? Probably if you are here asking the question(s).
Thanks for the reply, and yeah, i know its kind of a hard question with alot of variables;
Based on your response it seems like it should be tested when i am writing the "a bit more serious implementation" code (also known as next version) as it takes just a bit of time re-writing the functions for this first test of the analysis. I will try to implement the others right away 🙂
So thanks.
I'm also wondering if the statistics you need to grab can be efficiently done with Python using geoprocessing tools like Frequency and Summary, all of which are in the Statistics toolset of the Analysis toolbox.
Good question. I thought about it, however the tools in it self need to do the where clauses, so if its not implemented in C or some other optimized way (parallel etc.). I could test that as well. I think it there should be some sort of information about the efficiency of different analysis? or a is there an best practice?
You have a challenging road ahead, focused on optimization. I'm guessing you mean the optimization of time, not space. They say space is cheap and time is money! Obviously a lot of variables have to be considered. For example, if you were to run with Python Pandas, you may get time X. But if you were to run with arcpy geoprocessing and have indexed the fields you're grouping on, you may get better time Y. Database A may perform better than database B. A Python script may work faster at a command line and slower in a Notebook.
All I'm saying is that I am not well-read in the area of GIS algorithm optimization. It might be a good idea to use Google Scholar to look for white papers, published articles, theses, and dissertations that address these concepts. For instance, a quick search just now for "binary search in GIS" pointed me to an article named "CudaGIS: report on the design and realization of a massive data parallel GIS on GPUs." But that's not stuff I read.
yeah... i'm aware. Its kind of a can of worms. And i think the easiest for me is to just test out a couple of methods and see which runs the fastest. Was mainly curious if there was any data on the subject and if there was a definitive solution or best practice 🙂