Getting Started with GeoAnalytics Engine in Amazon EMR

SBattersby · ‎08-20-2025

In this post, we are going to walk through setting up and getting started with ArcGIS GeoAnalytics Engine using Amazon EMR. We will cover the basics of configuring the workspace with the GeoAnalytics Engine library, creating your environment, and offer a few examples to test out GeoAnalytics Engine and its functionality. While the general process is documented in the GeoAnalytics developer documentation, we’ll dig into more detail in this post. Note that this post is from August 2025 and user interfaces may change over time.

Before you can set up the EMR environment, you will need to set up your GeoAnalytics Engine account and access the files for the library. If you are still setting up your GeoAnalytics Engine account, you can refer to this additional blog post for more information about account creation and setup.

Once you have your account credentials set up, you can access the GeoAnalytics Engine files and set up your environment in EMR.

Get the GeoAnalytics Engine files

Let’s start by accessing the GeoAnalytics Engine files so we can bring them into the EMR environment. The process will be a little bit different for the connected and disconnected license types.

GeoAnalytics Engine Connected

If you are working with a GeoAnalytics Engine Connected license, you will download the relevant files via the GeoAnalytics Engine dashboard.

Log into the dashboard with your user credentials for GeoAnalytics Engine. Note that these are different than your ArcGIS Online user credentials.

From the dashboard, click on the Downloads tab

From the Downloads tab, you will have access to the latest releases. There are two files – both are tar.gz files that you will need to extract the files from the *.tar.gz file.

API – the core files for using GeoAnalytics Engine. Details on what is included can be found in the installation documentation

Projection Engine Data (optional) – an optional set of data for the Projection Engine. If you are working with map projections or coordinate systems that aren’t included in the basic set in GeoAnalytics Engine, you will want these additional files to facilitate coordinate system transformations. More information about the projection engine can be found in the help documentation on coordinate systems and transformations

GeoAnalytics Engine Disconnected

For a disconnected license, you will access the files for GeoAnalytics Engine using MyEsri. From the MyEsri home page, click on the Downloads tab and choose "All Products and Versions". Under "All Other Products" find ArcGIS GeoAnalytics Engine and click the "View Downloads" button. There are four downloads available for the latest release.

API – the core files for using GeoAnalytics Engine. Details on what is included can be found in the installation documentation

Projection Engine Data (optional) – an optional set of data for the Projection Engine. If you are working with map projections or coordinate systems that aren’t included in the basic set in GeoAnalytics Engine, you will want these additional files to facilitate coordinate system transformations. More information about the projection engine can be found in the help documentation on coordinate systems and transformations

Software Authorization Wizard (SAW) for GeoAnalytics Engine (Windows) – A Windows application for activating your disconnected license.

Software Authorization Wizard (SAW) for GeoAnalytics Engine (Linux) – A Linux application for activating your disconnected license.

Setup in EMR

Now let’s set up an environment to use GeoAnalytics Engine in AWS EMR.

First, you’ll need to sign into the AWS Management Console.

Stage your setup files

Next, you will need to Create a bucket in S3 or choose an existing one to stage your setup files. In this example we will be saving our GeoAnalytics Engine files in the Install_GeoAnalytics folder in our storage bucket. The files that you upload will be the extracted files (not the original *.tar.gz file) that you obtained from the GeoAnalytics Engine dashboard or MyEsri, depending on whether you are using a Connected or Disconnected version of the product.

Upload the jar and whl files to your S3 bucket. Depending on the analysis you will complete, optionally upload the geoanalytics-natives (for geocoding and network analysis) and projection engine jars.

Note that including the geoanalytics-natives jar with EMR 6.x is not supported in GeoAnalytics Engine 1.5 or later versions.

Set up your startup script

Copy and paste the text below into a text editor - you’ll update the script to point to your bucket and file names

#!/bin/bash 

BUCKET_PATH=s3://testbucket 
WHEEL_NAME=geoanalytics-x.x.x-py3-none-any.whl 
JAR_NAME=geoanalytics_2.12-x.x.x.jar 

sudo mkdir -p /home/geoanalytics/ 
sudo aws s3 cp $BUCKET_PATH/$JAR_NAME /usr/lib/spark/jars/ 
sudo aws s3 cp $BUCKET_PATH/$WHEEL_NAME /home/geoanalytics/$WHEEL_NAME 
sudo python3 -m pip install -U pip 
sudo python3 -m pip install /home/geoanalytics/$WHEEL_NAME

Change the BUCKET_PATH variable value to the path of the bucket or folder where you uploaded the jar and wheel.

If you copy the bucket path from S3 using Copy S3 URI make sure to remove the / character that is added at the end, as your script will break otherwise.

Change WHEEL_NAME and JAR_NAME to the names of the wheel file and jar file respectively. Save the file using ".sh" as the file extension and upload it to your S3 bucket.

You can do this locally on your computer and upload the file, or use the built-in editor like AWS CloudShell.

As an example, here is a startup script for GeoAnalytics Engine 1.5.0

If you are using the supplemental projection data jars, you will also need to add details on the projection data jars that you need. For instance:

sudo aws s3 cp $BUCKET_PATH/<esri-projection-name>.jar /usr/lib/spark/jars/

Note: To use ST_H3Bin or ST_H3Bins, you must copy the H3 jar to /usr/lib/spark/jars/ on your cluster using the setup script, like in the examples shown above. Additionally, you may need to pip install modules depending on the functionality that you need to use, for example, for plotting you will need matplotlib.

At this point, we are done with our S3 bucket. It has all of the content that we need – the GeoAnalytics Engine files as well as the startup script. So we can move on to creating our Spark pool.

Create a Spark pool

In the AWS console, search for Amazon EMR

In the menu for your EMR environment, go to Clusters. Note that EMR Serverless is not supported.

Here is where you will clone or create a new cluster.

When you create a new cluster, you need to name it, pick your release version, and pick your application bundle settings.

Under Name and applications choose any supported EMR release for Amazon EMR release. See About Amazon EMR Releases for details on release components. Make sure you are setting up an environment using a supported runtime. The list of supported runtimes is available here in the developer documentation.

For Application bundle select "Custom" and ensure that at least the following applications are checked:

Hadoop

Hive

Spark

JupyterHub

JupyterEnterpriseGateway

Livy – if you’re going to connect to AWS SageMaker

If you are going to use AWS Glue you will also want to check “Use for Spark table metadata.” You can find more information about working with AWS Glue and GeoAnalytics Engine here.

Under Cluster configuration, choose your preferred configuration.

This is determined by your organizational needs. While any instance type will work, we recommend using memory-optimized instances that have at least 8GB of memory per core for best performance.

Configure the Networking for your cluster.

Under Bootstrap Actions - optional click "Add". For Script location, specify the path to the .sh script you created earlier

Keep scrolling down. These will help you to install the Jar files using the script that you created in your S3 bucket. Copy from your S3 bucket uploaded to S3 earlier and click Add bootstrap action.

Under Software settings - optional select "Enter configuration" and copy and paste the text below into the text box. This sets some required configuration settings for the Spark environment.

[ 
   { 
      "classification":"spark-defaults", 
      "properties":{ 
         "spark.plugins":"com.esri.geoanalytics.Plugin", 
         "spark.serializer":"org.apache.spark.serializer.KryoSerializer", 
         "spark.kryo.registrator":"com.esri.geoanalytics.KryoRegistrator" 
      } 
 } 
]

Set the Identity and Access Management (IAM) roles for your cluster.

If your organization requires it, set up the security configuration and EC2 key pair. And then select your service role. In this instance we are using the EMR_DefaultRole

Accept the defaults for all other parameters in the previous steps or change them based on your needs.

Now you can click Create cluster. If the create fails, check the cluster logs to diagnose the issue.

Confirm everything worked!

We can check that everything worked by running a quick script

Check that your environment is ready to use:

Go to Notebooks and Git repos

Go to workspaces – create or use an existing workspace

Open a notebook to work with and we can test that GeoAnalytics Engine is properly loaded and ready to work. Make sure your notebook has been attached to the cluster created in the previous steps. Now you can try running this code to list the user functions available:

from geoanalytics.sql import functions as ST 

spark.sql("show user functions like 'ST_*'").show()

We can also test reading in a feature service and plotting the results like this:

# read a feature service hosted in the Living Atlas of the World 
myFS=”https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_States_Generalized_Boundaries/FeatureServer/0” 

df = spark.read.format('feature-service').load(myFS) 

# plot a DataFrame with geometry from a feature service 
df.st.plot(basemap="light", facecolor="yellow", edgecolor="black", alpha=0.5)
%matplot plt

We can also load data stored in an S3 bucket directly using PySpark without configuring Spark. Here is an example of loading a table from a file geodatabase:

path = "s3://<bucket_name>/<file_path>/<file_name>/" 

df = spark.read.format("filegdb").options(gdbPath=path, gdbName="<table_name>").load()

When AWS credentials are required to access the S3 bucket, you can use the following code snippet to provide them:

# Configurate the Spark  
spark = SparkSession.builder.appName("awsAnalysis").getOrCreate() spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId', 'AWS_ACCESS_KEY') spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey', 'AWS_SECRET_KEY') spark._sc._jsc.hadoopConfiguration().set('fs.s3a.impl', "org.apache.hadoop.fs.s3a.S3AFileSystem")  

# Load the shapefile on Amazon S3 into a PySpark DataFrame  
df = spark.read.format("shapefile").load("s3a://<bucket_name>/<file_path>/<file_name>")

Conclusion

In this blog post we walked through setting up GeoAnalytics Engine in an AWS EMR environment. The screen captures and content reflect the current process as of July 2025. We hope that this walkthrough on setting up GeoAnalytics Engine for use in AWS EMR has been helpful! If you have questions or comments, please feel free to leave a comment in the section below.