In this post, we are going to walk through setting up and getting started with ArcGIS GeoAnalytics Engine using Amazon EMR. We will cover the basics of configuring the workspace with the GeoAnalytics Engine library, creating your environment, and offer a few examples to test out GeoAnalytics Engine and its functionality. While the general process is documented in the GeoAnalytics developer documentation, we’ll dig into more detail in this post. Note that this post is from August 2025 and user interfaces may change over time.
Before you can set up the EMR environment, you will need to set up your GeoAnalytics Engine account and access the files for the library. If you are still setting up your GeoAnalytics Engine account, you can refer to this additional blog post for more information about account creation and setup.
Once you have your account credentials set up, you can access the GeoAnalytics Engine files and set up your environment in EMR.
Let’s start by accessing the GeoAnalytics Engine files so we can bring them into the EMR environment. The process will be a little bit different for the connected and disconnected license types.
If you are working with a GeoAnalytics Engine Connected license, you will download the relevant files via the GeoAnalytics Engine dashboard.
Log into the dashboard with your user credentials for GeoAnalytics Engine. Note that these are different than your ArcGIS Online user credentials.
From the dashboard, click on the Downloads tab
From the Downloads tab, you will have access to the latest releases. There are two files – both are tar.gz files that you will need to extract the files from the *.tar.gz file.
For a disconnected license, you will access the files for GeoAnalytics Engine using MyEsri. From the MyEsri home page, click on the Downloads tab and choose "All Products and Versions". Under "All Other Products" find ArcGIS GeoAnalytics Engine and click the "View Downloads" button. There are four downloads available for the latest release.
Now let’s set up an environment to use GeoAnalytics Engine in AWS EMR.
First, you’ll need to sign into the AWS Management Console.
Next, you will need to Create a bucket in S3 or choose an existing one to stage your setup files. In this example we will be saving our GeoAnalytics Engine files in the Install_GeoAnalytics folder in our storage bucket. The files that you upload will be the extracted files (not the original *.tar.gz file) that you obtained from the GeoAnalytics Engine dashboard or MyEsri, depending on whether you are using a Connected or Disconnected version of the product.
Upload the jar and whl files to your S3 bucket. Depending on the analysis you will complete, optionally upload the geoanalytics-natives (for geocoding and network analysis) and projection engine jars.
Note that including the geoanalytics-natives jar with EMR 6.x is not supported in GeoAnalytics Engine 1.5 or later versions.
Copy and paste the text below into a text editor - you’ll update the script to point to your bucket and file names
#!/bin/bash
BUCKET_PATH=s3://testbucket
WHEEL_NAME=geoanalytics-x.x.x-py3-none-any.whl
JAR_NAME=geoanalytics_2.12-x.x.x.jar
sudo mkdir -p /home/geoanalytics/
sudo aws s3 cp $BUCKET_PATH/$JAR_NAME /usr/lib/spark/jars/
sudo aws s3 cp $BUCKET_PATH/$WHEEL_NAME /home/geoanalytics/$WHEEL_NAME
sudo python3 -m pip install -U pip
sudo python3 -m pip install /home/geoanalytics/$WHEEL_NAME
You can do this locally on your computer and upload the file, or use the built-in editor like AWS CloudShell.
As an example, here is a startup script for GeoAnalytics Engine 1.5.0
If you are using the supplemental projection data jars, you will also need to add details on the projection data jars that you need. For instance:
sudo aws s3 cp $BUCKET_PATH/<esri-projection-name>.jar /usr/lib/spark/jars/
Note: To use ST_H3Bin or ST_H3Bins, you must copy the H3 jar to /usr/lib/spark/jars/ on your cluster using the setup script, like in the examples shown above. Additionally, you may need to pip install modules depending on the functionality that you need to use, for example, for plotting you will need matplotlib.
At this point, we are done with our S3 bucket. It has all of the content that we need – the GeoAnalytics Engine files as well as the startup script. So we can move on to creating our Spark pool.
In the AWS console, search for Amazon EMR
In the menu for your EMR environment, go to Clusters. Note that EMR Serverless is not supported.
Here is where you will clone or create a new cluster.
When you create a new cluster, you need to name it, pick your release version, and pick your application bundle settings.
Under Name and applications choose any supported EMR release for Amazon EMR release. See About Amazon EMR Releases for details on release components. Make sure you are setting up an environment using a supported runtime. The list of supported runtimes is available here in the developer documentation.
For Application bundle select "Custom" and ensure that at least the following applications are checked:
If you are going to use AWS Glue you will also want to check “Use for Spark table metadata.” You can find more information about working with AWS Glue and GeoAnalytics Engine here.
Under Cluster configuration, choose your preferred configuration.
This is determined by your organizational needs. While any instance type will work, we recommend using memory-optimized instances that have at least 8GB of memory per core for best performance.
Configure the Networking for your cluster.
Under Bootstrap Actions - optional click "Add". For Script location, specify the path to the .sh script you created earlier
Keep scrolling down. These will help you to install the Jar files using the script that you created in your S3 bucket. Copy from your S3 bucket uploaded to S3 earlier and click Add bootstrap action.
Under Software settings - optional select "Enter configuration" and copy and paste the text below into the text box. This sets some required configuration settings for the Spark environment.
[
{
"classification":"spark-defaults",
"properties":{
"spark.plugins":"com.esri.geoanalytics.Plugin",
"spark.serializer":"org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator":"com.esri.geoanalytics.KryoRegistrator"
}
}
]
Set the Identity and Access Management (IAM) roles for your cluster.
If your organization requires it, set up the security configuration and EC2 key pair. And then select your service role. In this instance we are using the EMR_DefaultRole
Accept the defaults for all other parameters in the previous steps or change them based on your needs.
Now you can click Create cluster. If the create fails, check the cluster logs to diagnose the issue.
We can check that everything worked by running a quick script
Check that your environment is ready to use:
Go to Notebooks and Git repos
Go to workspaces – create or use an existing workspace
Open a notebook to work with and we can test that GeoAnalytics Engine is properly loaded and ready to work. Make sure your notebook has been attached to the cluster created in the previous steps. Now you can try running this code to list the user functions available:
from geoanalytics.sql import functions as ST
spark.sql("show user functions like 'ST_*'").show()
We can also test reading in a feature service and plotting the results like this:
# read a feature service hosted in the Living Atlas of the World
myFS=”https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_States_Generalized_Boundaries/FeatureServer/0”
df = spark.read.format('feature-service').load(myFS)
# plot a DataFrame with geometry from a feature service
df.st.plot(basemap="light", facecolor="yellow", edgecolor="black", alpha=0.5)
%matplot plt
We can also load data stored in an S3 bucket directly using PySpark without configuring Spark. Here is an example of loading a table from a file geodatabase:
path = "s3://<bucket_name>/<file_path>/<file_name>/"
df = spark.read.format("filegdb").options(gdbPath=path, gdbName="<table_name>").load()
When AWS credentials are required to access the S3 bucket, you can use the following code snippet to provide them:
# Configurate the Spark
spark = SparkSession.builder.appName("awsAnalysis").getOrCreate() spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId', 'AWS_ACCESS_KEY') spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey', 'AWS_SECRET_KEY') spark._sc._jsc.hadoopConfiguration().set('fs.s3a.impl', "org.apache.hadoop.fs.s3a.S3AFileSystem")
# Load the shapefile on Amazon S3 into a PySpark DataFrame
df = spark.read.format("shapefile").load("s3a://<bucket_name>/<file_path>/<file_name>")
In this blog post we walked through setting up GeoAnalytics Engine in an AWS EMR environment. The screen captures and content reflect the current process as of July 2025. We hope that this walkthrough on setting up GeoAnalytics Engine for use in AWS EMR has been helpful! If you have questions or comments, please feel free to leave a comment in the section below.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.