Setting up a local spark environment for GeoAnalytics Engine

SBattersby

In this blog post we will walk through the process of setting up a local Apache Spark environment in Microsoft Windows to work with ArcGIS GeoAnalytics Engine. While the GeoAnalytics Engine developer documentation provides great instructions for setting up Spark Local Mode, I will walk through the process with more detailed commentary to document how it looks in practice to set things up on my local Windows machine.

As I walk through the process, I will start by collecting the relevant files, setting up my Python / PySpark environment, and finally creating a script to simplify the process of starting the Spark environment to work with GeoAnalytics Engine.

Collecting the relevant files

GeoAnalytics files

Let’s start by accessing the GeoAnalytics Engine files so we can use them when we set up our local Spark environment. The process will be a little bit different for the connected and disconnected license types.

GeoAnalytics Engine Connected

If you are working with a GeoAnalytics Engine Connected license, you will download the relevant files via the GeoAnalytics Engine dashboard.

Log into the dashboard with your user credentials for GeoAnalytics Engine. Note that these are different than your ArcGIS Online user credentials.

From the dashboard, click on the Downloads tab.

From the Downloads tab, you will have access to the latest releases. There are two files – both are tar.gz files and you will need to extract the files from the *.tar.gz file. You may be able to simply right click on the downloaded file and select “Extract All…” to extract the GeoAnalytics Engine files.

API – the core files for using GeoAnalytics Engine. Details on what is included can be found in the installation documentation

Projection Engine Data (optional) – an optional set of data for the Projection Engine. If you are working with map projections or coordinate systems that aren’t included in the basic set in GeoAnalytics Engine, you will want these additional files to facilitate coordinate system transformations. More information about the projection engine can be found in the help documentation on coordinate systems and transformations

GeoAnalytics Engine Disconnected

For a disconnected license, you will access the files for GeoAnalytics Engine using My Esri. From the My Esri home page, click on the Downloads tab and choose "All Products and Versions". Under "All Other Products" find ArcGIS GeoAnalytics Engine and click the "View Downloads" button. There are four downloads available for the latest release.

API – the core files for using GeoAnalytics Engine. Details on what is included can be found in the installation documentation

Projection Engine Data (optional) – an optional set of data for the Projection Engine. If you are working with map projections or coordinate systems that aren’t included in the basic set in GeoAnalytics Engine, you will want these additional files to facilitate coordinate system transformations. More information about the projection engine can be found in the help documentation on coordinate systems and transformations

Software Authorization Wizard (SAW) for GeoAnalytics Engine (Microsoft Windows) – A Windows application for activating your disconnected license.

Software Authorization Wizard (SAW) for GeoAnalytics Engine (Linux) – A Linux application for activating your disconnected license.

Spark environment-related files

In addition to the GeoAnalytics Engine files, for your local Spark environment you will need to have Java, Python, Hadoop, and Spark files. Since I like to work with Jupyter notebooks, I am going to include those files as well in this example.

In this section, I’ll list the files that you will need including notes on where I procured the files that I am using in my example. In the following section we will set up our environment using these files.

Java – You will need a version of Java that will work with your GeoAnalytics Engine version. A list of dependencies can be found here. I am working with GeoAnalytics Engine 1.7 and will be setting up a Spark 4-based environment, so I need to use Java 17.

For Java, I am using an Eclipse Temurin build. I used the provided Microsoft Windows installer to set this up on my machine.

Python – Again, you will need a version of Python that will work with your GeoAnalytics Engine version. A list of dependencies can be found here. I am working with GeoAnalytics Engine 1.7 with Spark 4 and PySpark 4 so will use Python 3.9 or later based on the combined dependencies.

I am using the Anaconda Python distribution, which also comes with the files that I need to work with Jupyter notebooks. I used the Anaconda installer to set this all up on my machine, and am working with Python 3.11 as there is a known PySpark issue (not a GeoAnalytics Engine issue) with Python 3.12+ on Windows.

Hadoop – you will need the Hadoop binaries to read from or write to shapefiles, or for working with distributed file systems that Spark supports, including parquet, S3, and others.

I am working on a Microsoft Windows machine, and opted to download the binaries from a third party. I am using the Hadoop 3.3.6 files. While I have provided a link for accessing prebuilt binaries, Esri does not test or verify prebuilt Hadoop binaries provided by third parties. For simplicity, I saved the files in C:\Hadoop

Spark – As of GeoAnalytics Engine 1.7, we have support for Spark 4. With GeoAnalytics Engine 1.7 you have a choice of working with Spark 3 or 4, but I will set up an environment with Spark 4.

You can download Apache Spark here. Make sure that the release that you select supports the version of Java and Python that you are using. I am using Spark 4.0.1 with the package pre-built for Apache Hadoop 3.4 and later. For simplicity, I saved the Spark files in C:\Spark.

Setup & Startup

In this section we’ll discuss the last steps to setting up your environment and then creating a startup script to get everything running.

Once all of the files you need are saved on your machine, there are a few last steps to take.

Install libraries in Python environment

First is to install PySpark within your Python environment. I have opted to use a virtual environment for my GeoAnalytics work, so I created a virtual environment and installed PySpark. The documentation for PySpark also includes details on creating a new conda environment if you need assistance setting one up.

Additionally, I installed Jupyter Lab and matplotlib using conda. To access the virtual environment in Jupyter Lab I also installed ipykernel and registered the environment as a Jupyter kernel like this:

python -m ipykernel install --user --name=my_virtual_env_name --display-name "Python (My New Venv)"

Setting environment variables

There are several environment variables that you will need to set so that they can be found when starting up your Spark environment. The GeoAnalytics Engine developer documentation has details on setting these up individually from a command prompt. You can also do this with the Microsoft Windows GUI for editing system environment variables (via the Control Panel).

Because I occasionally switch between Spark versions, and don’t want all of my settings permanently in my Path, I have opted to set most of my environment variable (related to Java, Spark, and Hadoop) as part of a startup script. However, there are two items that I do add to my Path: PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON. I add these to my Path since the PySpark driver will be the same regardless of what version of Spark I’m using, and because I want to use Jupyter notebooks to interact with GeoAnalytics Engine.

When I set those as user variables, my Environment Variables look like this – and note that I am pointing specifically to the python.exe and jupyter-lab.exe in my virtual environment for GeoAnalytics Engine (venv_geoanalytics_311).

Starting your Spark environment

Now that I have my files staged on my machine, and a few environment variables set up, I can start my Spark environment.

My preference is to automate the process instead of trying to remember all of the commands required to start Spark and load GeoAnalytics Engine, so I wrote a batch file (.bat) to do all of this for me.

In my case, it looks like this – so let’s walk through what this all does:

To start, I set all of my environment variables so that they point to the right version of Spark, Java, and Hadoop. By setting them in my batch script they are only set for this session. This makes it so that I can quickly switch between Spark versions if needed (I have a second batch script that uses Spark 3.x so I can spin that up whenever I need that):

:: set environment variables and path for Spark 4
set SPARK_HOME=C:\Spark\spark-4.0.1-bin-hadoop3
set JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-17.0.16.8-hotspot
set HADOOP_HOME=C:\Hadoop
set PATH=%SPARK_HOME%\bin;%HADOOP_HOME%\bin;%JAVA_HOME%\bin;%PATH%

Next, I activate my virtual environment with PySpark installed. In this example my virtual environment is named venv_geoanalytics_311 (because it uses Python 3.11).

:: activate virtual environment, start up Spark
activate venv_geoanalytics_311 &&^

Finally, we’ll set up the portion of the script that starts up Spark with GeoAnalytics Engine enabled. In this section we do a few things:

--jars: Point it to the GeoAnalytics Engine jars.
--py-files: Point to the GeoAnalytics .zip file with the python files
--conf: Set all of the configuration settings that you want/need for the Spark environment. In this example I have a very basic set of the required configurations, but you may opt to include other specific configuration settings based on your needs.

pyspark --jars C:\engine\geoanalytics_2.13-1.7.0.jar,C:\engine\geoanalytics-natives_2.13-1.7.0.jar^
  --py-files C:\engine\geoanalytics-1.7.0.zip^
  --conf spark.plugins=com.esri.geoanalytics.Plugin^
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer^
  --conf spark.kryo.registrator=com.esri.geoanalytics.KryoRegistrator^
  --conf spark.driver.memory=5g

Note: If you are working with a Disconnected license for GeoAnalytics Engine, you will want to add an additional configuration setting to point to the license file. This would look something like this, but with the direct path to your license file specified:

--conf spark.geoanalytics.auth.license.file=license_file.ecp

Now that this is all put together, I can just run my batch file from the command prompt. When I run the script it should start up Spark, load the GeoAnalytics Engine jars, and open a Jupyter notebook (based on the Path variable for PYSPARK_DRIVER_PYTHON).

Then I can confirm that everything works by trying some simple code in my Jupyter notebook:

Conclusion

I hope that this walk through of the process to set up and start a local Spark environment has been useful! I look forward to hearing about the great analytics at scale that you are able to perform with GeoAnalytics Engine.