Select to view content in your preferred language

Setting up a Linux-based local Spark environment for GeoAnalytics Engine

205
0
04-02-2026 09:00 AM
SBattersby
Esri Contributor
1 0 205

Co-authored with Randall Whitman, Lawrence Khadka, and Ashwin Shashidharan from the GeoAnalytics software development team

 

In this blog post we will walk through the process of setting up a local Apache Spark environment to work with ArcGIS GeoAnalytics Engine, on a machine running GNU/Linux (henceforth “Linux” here). While the GeoAnalytics Engine developer documentation provides great instructions for setting up Spark Local Mode, we will walk through the process with more detailed commentary to document how it looks in practice to set things up on a local Linux machine. This article was prepared using Ubuntu Linux with bash, xterm, Firefox, and Unity; but it should be similar with only minor differences when using different shell, terminal emulator, browser, desktop environment, and Linux distro or even other Unix-like operating system.

For information about setting up a local Spark environment for working with GeoAnalytics Engine on Windows, we have additional information in the GeoAnalytics Engine Community.

As we walk through the process for setting up an environment with Ubuntu Linux, we will start by collecting the relevant files, setting up a Python / PySpark environment, and finally creating a script to simplify the process of starting the Spark environment to work with GeoAnalytics Engine.

Collecting the relevant files

GeoAnalytics files

Start by accessing the GeoAnalytics Engine files so we can use them when we set up our local Spark environment. The process will be a little bit different for the Connected and Disconnected license types.  We have documented the process of obtaining the files in an earlier blog post focused on set up of a local Spark environment in a Microsoft Windows environment, but the process of obtaining the files will be the same for Linux.  Please refer to the earlier post on Setting up a local Spark environment (Windows) for more information on downloading the GeoAnalytics Engine files for Connected and Disconnected license types.

There will be two files – both are tar.gz files and you will need to extract the files from the *.tar.gz file. You may be able to simply right click on the downloaded file and select “Extract [to …]” to extract the GeoAnalytics Engine files.

ubuntu-extract-geoax.png

The extracted folders will contain the following contents:

  • API – the core files for using GeoAnalytics Engine. Details on what is included can be found in the installation documentation
  • Projection Engine Data (optional) – an optional set of data for the Projection Engine. If you are working with map projections or coordinate systems that aren’t included in the basic set in GeoAnalytics Engine, you will want these additional files to facilitate coordinate system transformations. More information about the projection engine can be found in the help documentation on coordinate systems and transformations.
  • For Disconnected licenses, you will also have access to the Software Authorization Wizard (SAW) 

Spark environment-related files

In addition to the GeoAnalytics Engine files, for your local Spark environment you will need to have Java, Python,  and Spark files. We are also going to include files for Jupyter notebooks in this example.

In this section, we list the files that you will need including notes on where to obtain the files that we are using in this example. In the following section we will set up our environment using these files.

  • Java – You will need a version of Java that will work with your GeoAnalytics Engine version. A list of dependencies by GeoAnalytics version can be found here. We are working with GeoAnalytics Engine 1.7 and will be setting up a Spark 4-based environment, so we need to use Java 17.

    OpenJDK is generally available from the package manager in a Linux distro, and can be installed via either a GUI such as Synaptic or a command such as:
sudo apt-get install openjdk-17-jre

The process will look something like this:

Search for openjdk 17Search for openjdk 17 

Mark openjdk 17 for installationMark openjdk 17 for installation

Apply all marked changesApply all marked changes

Confirm applying changesConfirm applying changes

  • Python – Again, you will need a version of Python that will work with your GeoAnalytics Engine version. A list of dependencies can be found here. We are working with GeoAnalytics Engine 1.7 with Spark 4 and PySpark 4 so will use Python 3.9 or later based on the combined dependencies.

    The Linux distro generally has Python already installed; any reasonably-recent Linux distro will have a new-enough Python-3.  We also need an additional Python library, matplotlib, which is available tvia the package manager.  Python and the library can be installed via GUI or command line such as:
sudo apt-get install python3-minimal python3-matplotlib
sudo apt-get install jupyter-notebook

All of the packages listed in the 3 sections here, can be installed in a single apt-get command or Synaptic interaction.

  • Spark – As of GeoAnalytics Engine 1.7, we have support for Spark 4. With GeoAnalytics Engine 1.7 you have a choice of working with Spark 3 or 4, but we will set up an environment with Spark 4.

You can download Apache Spark here. Make sure that the release that you select supports the version of Java and Python that you are using. In this example, we are using Spark 4.0.1 with the package pre-built for Apache Hadoop 3.4 and later.  Note that the versions of Spark available through the link provided will change based on the current releases; you may only find more recent versions of Spark at this site.  Check the GeoAnalytics Engine documentation for the latest information on the supported Spark versions that you should select from.

For simplicity, we saved the Spark files in $HOME/Spark40, but if multiple people use the same workstation, you might rather place Spark under something like /usr/local/ or /opt/.

Setup & Startup

In this section we’ll discuss the last steps to setting up your environment and then creating a startup script to get everything running.

Once all of the files you need are saved on your machine, there are a few last steps to take.

Set up Python environment

Install PySpark within your Python environment if it is not already included.

Setting environment variables

There are several environment variables that you will need to set so that they can be found when starting up your Spark environment. The GeoAnalytics Engine developer documentation has details on setting these up individually from a terminal. If you would like the same variables to be set in every shell you run, you can update your .profile file with the environment variables discussed below (with the text editor of your choice).

Because we occasionally switch between Spark versions, and don’t want all of the settings permanently in the PATH, we have opted to set most of the environment variables (related to Java and Spark) as part of a startup script.

Starting your Spark environment

Now that we have our files staged on the machine, and a few environment variables set up, we can start the Spark environment.

We prefer to automate the process instead of trying to remember all of the commands required to start Spark and load GeoAnalytics Engine, so we wrote a shell script (.sh) to do all of this.

In our case, it looks like this – so let’s walk through what this all does:

# geoax-spark4.sh
# set environment variables and path for Spark 4
export SPARK_HOME=${HOME}/Spark40
export JAVA_HOME=/usr/lib/jvm/java-1.17.0-openjdk-amd64
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON_OPTS='-m jupyter notebook'
export PATH=${SPARK_HOME}/bin:${JAVA_HOME}/bin:${PATH}
export ENGINE_HOME=${HOME}/Engine17
pyspark --jars ${ENGINE_HOME}/geoanalytics_2.13-1.7.0.jar,${ENGINE_HOME}/geoanalytics-natives_2.13-1.7.0.jar \
--py-files ${ENGINE_HOME}/geoanalytics-1.7.0.zip \
--conf spark.plugins=com.esri.geoanalytics.Plugin \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=com.esri.geoanalytics.KryoRegistrator \
--conf spark.driver.memory=5g \
--conf spark.geoanalytics.auth.license.file=license_file.ecp

To start, we set all of the environment variables so that they point to the right version of Spark, and Java. By setting them in a shell script they are only set for this session. This makes it so that we can quickly switch between Spark versions if needed):

# set environment variables and path for Spark 4
export SPARK_HOME=${HOME}/Spark40
export JAVA_HOME=/usr/lib/jvm/java-1.17.0-openjdk-amd64
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON_OPTS='-m jupyter notebook'
export PATH=${SPARK_HOME}/bin:${JAVA_HOME}/bin:${PATH}
export ENGINE_HOME=${HOME}/Engine17

 

Finally, we’ll set up the portion of the script that starts up Spark with GeoAnalytics Engine enabled. In this section we do a few things:

  • --jars: Point it to the GeoAnalytics Engine jars.
  • --py-files: Point to the GeoAnalytics .zip file with the python files
  • --conf: Set all of the configuration settings that you want/need for the Spark environment. In this example we have a very basic set of the required configurations, but you may opt to include other specific configuration settings based on your needs.
pyspark --jars ${ENGINE_HOME}/geoanalytics_2.13-1.7.0.jar,${ENGINE_HOME}/geoanalytics-natives_2.13-1.7.0.jar \
--py-files ${ENGINE_HOME}/geoanalytics-1.7.0.zip \
--conf spark.plugins=com.esri.geoanalytics.Plugin \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=com.esri.geoanalytics.KryoRegistrator \
--conf spark.driver.memory=5g

 

Note: If you are working with a Disconnected license for GeoAnalytics Engine, you will want to add an additional configuration setting to point to the license file. This would look something like this, but with the direct path to your license file specified:

--conf spark.geoanalytics.auth.license.file=license_file.ecp

Now that this is all put together, we can just run the script from the shell/terminal. When we run the script it should start up Spark, load the GeoAnalytics Engine jars, and open a Jupyter notebook (based on the Path variable for PYSPARK_DRIVER_PYTHON).

Then we can confirm that everything works by trying some simple code in a Jupyter notebook:

import geoanalytics
geoanalytics.version()

This output of GeoAnalytics Engine version of 1.7.0.213.3871 in the notebook.This output of GeoAnalytics Engine version of 1.7.0.213.3871 in the notebook.

We can also check with some basic creation of a DataFrame and plotting a map to confirm that the functionality is all working as expected:

map_plot_geoanalytics_ubuntu.png

 

Conclusion

We hope that this walk through of the process to set up and start a local Spark environment has been useful! We look forward to hearing about the great analytics at scale that you are able to perform with GeoAnalytics Engine.

Contributors