I'm using the Run Python Script Tool in GeoAnalytics Server, using the ArcGIS API for Python, where I load a CSV file in a BigData Store, with about 80k records.
The goal is to use PySpark to run a few joins between some spark dataframes, and store the result in a pre-existing hosted table in Portal.
As currently there is no way to update an existing hosted table using PySpark, I use ArcGIS API for Python to do the update. For that I do dataframe.rdd.collect() and I loop the rows to insert them in the hosted table. The dataframe.rdd.collect() results in the following error: "Job aborted due to stage failure: Task 56 in stage 15.0 failed 4 times, most recent failure: Lost task 56.3 in stage 15.0 (TID 1386, 10.221.254.134, executor 0): TaskResultLost (result lost from block manager)".
- At this point, the PySpark dataframe in question has around 20k rows. What can cause this error, when in another test with a CSV file with 1k records, it runs fine with no errors? 20k rows is a very small set of data and it should be no problem to geoanalytics and PySpark to deal with.
- Another previous error that was overcome was related to memory usage, ocorring again in the dataframe.rdd.collect() task: An error while executing the Python script: Job aborted due to stage failure: Total size of serialized results of 45 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB). This was fixed setting the spark.driver.maxResultSize to 8g (4g was still not enough).Why does a PySpark dataframe with 20k rows requires so much memory in collect()?
- The time to perfom the joins between the PySpark dataframes, before it crashes in collect(), averages 60 seconds. Again, we are dealing with dataframes with few thousand rows, and I was expecting this distributed approach to run this tasks in much less time. I have a some dread when eventually we start to process larger datasets. Is there a better approach to speed up this pipeline?
Thank you for your help,