<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Run Python Script Tool: PySpark errors and performance in ArcGIS GeoAnalytics Server Questions</title>
    <link>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/run-python-script-tool-pyspark-errors-and/m-p/1016209#M182</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm using the Run Python Script Tool in GeoAnalytics Server, using the ArcGIS API for Python, where I load&amp;nbsp;a&amp;nbsp;&lt;SPAN&gt;CSV file in a BigData Store, with about 80k records.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The goal is to use PySpark to run a few joins between some spark dataframes, and store the result in a pre-existing hosted table in Portal.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;As currently there is no way to update an existing hosted table using PySpark, I use&amp;nbsp;ArcGIS API for Python to do the update. For that I do dataframe.rdd.collect() and I loop the rows to insert them in the hosted table. The&amp;nbsp;dataframe.rdd.collect() results in the following error: "Job aborted due to stage failure: Task 56 in stage 15.0 failed 4 times, most recent failure: Lost task 56.3 in stage 15.0 (TID 1386, 10.221.254.134, executor 0): TaskResultLost (result lost from block manager)".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Several questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;At this point, the PySpark dataframe in question has around 20k rows. What can cause this error, when in another test with a&amp;nbsp;&lt;SPAN&gt;CSV file with 1k records, it runs fine with no errors? 20k rows is a very small set of data and it should be no problem to geoanalytics and PySpark to deal with.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Another previous error that was overcome was related to memory usage, ocorring again in the dataframe.rdd.collect() task: An error while executing the Python script: Job aborted due to stage failure: Total size of serialized results of 45 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB). This was fixed setting the&amp;nbsp;spark.driver.maxResultSize to 8g (4g was still not enough).&lt;/SPAN&gt;&lt;SPAN&gt;Why does a PySpark dataframe with 20k rows requires so much memory in collect()?&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;The time to perfom the joins between the PySpark dataframes, before it crashes in collect(), averages 60&amp;nbsp; seconds. Again, we are dealing with dataframes with few thousand rows, and I was expecting this distributed approach to run this tasks in much less time. I have a some dread when eventually we start to process larger datasets. Is there a better approach to speed up this pipeline?&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;SPAN&gt;Thank you for your help,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Pedro Cruz&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 12 Jan 2021 23:54:27 GMT</pubDate>
    <dc:creator>PedroCruzUdjat</dc:creator>
    <dc:date>2021-01-12T23:54:27Z</dc:date>
    <item>
      <title>Run Python Script Tool: PySpark errors and performance</title>
      <link>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/run-python-script-tool-pyspark-errors-and/m-p/1016209#M182</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm using the Run Python Script Tool in GeoAnalytics Server, using the ArcGIS API for Python, where I load&amp;nbsp;a&amp;nbsp;&lt;SPAN&gt;CSV file in a BigData Store, with about 80k records.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The goal is to use PySpark to run a few joins between some spark dataframes, and store the result in a pre-existing hosted table in Portal.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;As currently there is no way to update an existing hosted table using PySpark, I use&amp;nbsp;ArcGIS API for Python to do the update. For that I do dataframe.rdd.collect() and I loop the rows to insert them in the hosted table. The&amp;nbsp;dataframe.rdd.collect() results in the following error: "Job aborted due to stage failure: Task 56 in stage 15.0 failed 4 times, most recent failure: Lost task 56.3 in stage 15.0 (TID 1386, 10.221.254.134, executor 0): TaskResultLost (result lost from block manager)".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Several questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;At this point, the PySpark dataframe in question has around 20k rows. What can cause this error, when in another test with a&amp;nbsp;&lt;SPAN&gt;CSV file with 1k records, it runs fine with no errors? 20k rows is a very small set of data and it should be no problem to geoanalytics and PySpark to deal with.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Another previous error that was overcome was related to memory usage, ocorring again in the dataframe.rdd.collect() task: An error while executing the Python script: Job aborted due to stage failure: Total size of serialized results of 45 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB). This was fixed setting the&amp;nbsp;spark.driver.maxResultSize to 8g (4g was still not enough).&lt;/SPAN&gt;&lt;SPAN&gt;Why does a PySpark dataframe with 20k rows requires so much memory in collect()?&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;The time to perfom the joins between the PySpark dataframes, before it crashes in collect(), averages 60&amp;nbsp; seconds. Again, we are dealing with dataframes with few thousand rows, and I was expecting this distributed approach to run this tasks in much less time. I have a some dread when eventually we start to process larger datasets. Is there a better approach to speed up this pipeline?&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;SPAN&gt;Thank you for your help,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Pedro Cruz&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2021 23:54:27 GMT</pubDate>
      <guid>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/run-python-script-tool-pyspark-errors-and/m-p/1016209#M182</guid>
      <dc:creator>PedroCruzUdjat</dc:creator>
      <dc:date>2021-01-12T23:54:27Z</dc:date>
    </item>
  </channel>
</rss>

