In this blog post we will walk through the process of setting up a local Apache Spark environment in Microsoft Windows to work with ArcGIS GeoAnalytics Engine. While the GeoAnalytics Engine developer documentation provides great instructions for setting up Spark Local Mode, I will walk through the process with more detailed commentary to document how it looks in practice to set things up on my local Windows machine.
As I walk through the process, I will start by collecting the relevant files, setting up my Python / PySpark environment, and finally creating a script to simplify the process of starting the Spark environment to work with GeoAnalytics Engine.
Let’s start by accessing the GeoAnalytics Engine files so we can use them when we set up our local Spark environment. The process will be a little bit different for the connected and disconnected license types.
If you are working with a GeoAnalytics Engine Connected license, you will download the relevant files via the GeoAnalytics Engine dashboard.
Log into the dashboard with your user credentials for GeoAnalytics Engine. Note that these are different than your ArcGIS Online user credentials.
From the dashboard, click on the Downloads tab.
From the Downloads tab, you will have access to the latest releases. There are two files – both are tar.gz files and you will need to extract the files from the *.tar.gz file. You may be able to simply right click on the downloaded file and select “Extract All…” to extract the GeoAnalytics Engine files.
For a disconnected license, you will access the files for GeoAnalytics Engine using My Esri. From the My Esri home page, click on the Downloads tab and choose "All Products and Versions". Under "All Other Products" find ArcGIS GeoAnalytics Engine and click the "View Downloads" button. There are four downloads available for the latest release.
In addition to the GeoAnalytics Engine files, for your local Spark environment you will need to have Java, Python, Hadoop, and Spark files. Since I like to work with Jupyter notebooks, I am going to include those files as well in this example.
In this section, I’ll list the files that you will need including notes on where I procured the files that I am using in my example. In the following section we will set up our environment using these files.
In this section we’ll discuss the last steps to setting up your environment and then creating a startup script to get everything running.
Once all of the files you need are saved on your machine, there are a few last steps to take.
First is to install PySpark within your Python environment. I have opted to use a virtual environment for my GeoAnalytics work, so I created a virtual environment and installed PySpark. The documentation for PySpark also includes details on creating a new conda environment if you need assistance setting one up.
Additionally, I installed Jupyter Lab and matplotlib using conda. To access the virtual environment in Jupyter Lab I also installed ipykernel and registered the environment as a Jupyter kernel like this:
python -m ipykernel install --user --name=my_virtual_env_name --display-name "Python (My New Venv)"
There are several environment variables that you will need to set so that they can be found when starting up your Spark environment. The GeoAnalytics Engine developer documentation has details on setting these up individually from a command prompt. You can also do this with the Microsoft Windows GUI for editing system environment variables (via the Control Panel).
Because I occasionally switch between Spark versions, and don’t want all of my settings permanently in my Path, I have opted to set most of my environment variable (related to Java, Spark, and Hadoop) as part of a startup script. However, there are two items that I do add to my Path: PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON. I add these to my Path since the PySpark driver will be the same regardless of what version of Spark I’m using, and because I want to use Jupyter notebooks to interact with GeoAnalytics Engine.
When I set those as user variables, my Environment Variables look like this – and note that I am pointing specifically to the python.exe and jupyter-lab.exe in my virtual environment for GeoAnalytics Engine (venv_geoanalytics_311).
Now that I have my files staged on my machine, and a few environment variables set up, I can start my Spark environment.
My preference is to automate the process instead of trying to remember all of the commands required to start Spark and load GeoAnalytics Engine, so I wrote a batch file (.bat) to do all of this for me.
In my case, it looks like this – so let’s walk through what this all does:
To start, I set all of my environment variables so that they point to the right version of Spark, Java, and Hadoop. By setting them in my batch script they are only set for this session. This makes it so that I can quickly switch between Spark versions if needed (I have a second batch script that uses Spark 3.x so I can spin that up whenever I need that):
:: set environment variables and path for Spark 4
set SPARK_HOME=C:\Spark\spark-4.0.1-bin-hadoop3
set JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-17.0.16.8-hotspot
set HADOOP_HOME=C:\Hadoop
set PATH=%SPARK_HOME%\bin;%HADOOP_HOME%\bin;%JAVA_HOME%\bin;%PATH%
Next, I activate my virtual environment with PySpark installed. In this example my virtual environment is named venv_geoanalytics_311 (because it uses Python 3.11).
:: activate virtual environment, start up Spark
activate venv_geoanalytics_311 &&^
Finally, we’ll set up the portion of the script that starts up Spark with GeoAnalytics Engine enabled. In this section we do a few things:
pyspark --jars C:\engine\geoanalytics_2.13-1.7.0.jar,C:\engine\geoanalytics-natives_2.13-1.7.0.jar^
--py-files C:\engine\geoanalytics-1.7.0.zip^
--conf spark.plugins=com.esri.geoanalytics.Plugin^
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer^
--conf spark.kryo.registrator=com.esri.geoanalytics.KryoRegistrator^
--conf spark.driver.memory=5g
Note: If you are working with a Disconnected license for GeoAnalytics Engine, you will want to add an additional configuration setting to point to the license file. This would look something like this, but with the direct path to your license file specified:
--conf spark.geoanalytics.auth.license.file=license_file.ecp
Now that this is all put together, I can just run my batch file from the command prompt. When I run the script it should start up Spark, load the GeoAnalytics Engine jars, and open a Jupyter notebook (based on the Path variable for PYSPARK_DRIVER_PYTHON).
Then I can confirm that everything works by trying some simple code in my Jupyter notebook:
I hope that this walk through of the process to set up and start a local Spark environment has been useful! I look forward to hearing about the great analytics at scale that you are able to perform with GeoAnalytics Engine.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.