Run Python Script Tool: PySpark errors and performance

PedroCruzUdjat — Tue, 12 Jan 2021 23:54:27 GMT

Hi,

I'm using the Run Python Script Tool in GeoAnalytics Server, using the ArcGIS API for Python, where I load a CSV file in a BigData Store, with about 80k records.

The goal is to use PySpark to run a few joins between some spark dataframes, and store the result in a pre-existing hosted table in Portal.

As currently there is no way to update an existing hosted table using PySpark, I use ArcGIS API for Python to do the update. For that I do dataframe.rdd.collect() and I loop the rows to insert them in the hosted table. The dataframe.rdd.collect() results in the following error: "Job aborted due to stage failure: Task 56 in stage 15.0 failed 4 times, most recent failure: Lost task 56.3 in stage 15.0 (TID 1386, 10.221.254.134, executor 0): TaskResultLost (result lost from block manager)".

Several questions:

At this point, the PySpark dataframe in question has around 20k rows. What can cause this error, when in another test with a CSV file with 1k records, it runs fine with no errors? 20k rows is a very small set of data and it should be no problem to geoanalytics and PySpark to deal with.

Another previous error that was overcome was related to memory usage, ocorring again in the dataframe.rdd.collect() task: An error while executing the Python script: Job aborted due to stage failure: Total size of serialized results of 45 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB). This was fixed setting the spark.driver.maxResultSize to 8g (4g was still not enough).Why does a PySpark dataframe with 20k rows requires so much memory in collect()?

The time to perfom the joins between the PySpark dataframes, before it crashes in collect(), averages 60 seconds. Again, we are dealing with dataframes with few thousand rows, and I was expecting this distributed approach to run this tasks in much less time. I have a some dread when eventually we start to process larger datasets. Is there a better approach to speed up this pipeline?

Thank you for your help,

Pedro Cruz

topic Run Python Script Tool: PySpark errors and performance in ArcGIS GeoAnalytics Server Questions

Run Python Script Tool: PySpark errors and performance