Working with Snowflake data with GeoAnalytics Engine in Databricks

SBattersby · ‎08-29-2024

I feel like it’s rare to find an organization where all of the data is in one nice, tidy data warehouse or data lake. More likely than not, your data lives in a variety of locations, but you still want to be able to work with it in a single environment like Databricks.

Lately, I’ve had a number of questions about working with Snowflake data using the ArcGIS GeoAnalytics Engine library, so I set out to create a proof of concept to demonstrate how to bring Snowflake data into a Databricks dataframe and work with the spatial data from Snowflake using GeoAnalytics Engine.

Connecting to Snowflake data with geographic coordinates for use in GeoAnalytics Engine

In this post, we’ll look at connecting the environments to open up the door to bring your Snowflake data in as part of your spatial analysis workflow.

To start, here are some useful links from both Snowflake and Databricks:

Snowflake Connector for Spark [Snowflake]
Read and write data from Snowflake [Databricks]

Snowflake environment

Let’s start with a description of the Snowflake data that we are working with for the demonstration.

For this example, I’m working with Open Street Map (OSM) data stored in Snowflake. Setting up some sample spatial OSM data is described in the Getting Started with Geospatial – Geography quickstart from Snowflake.

From this quickstart, I set up a database with a variety of OSM data provided by Sonra.

Database with tables of Open Street Map (OSM) data with geographic coordinates in Snowflake

Specifically, we will work with the OSM “Way” data – these are data that typically represent linear features such as roads, railways, or rivers.

Within this table of data, we have a column named Coordinates that is in a Geography type.

The coordinates field of "geography" data type in a Snowflake table

You can find more detail on the Snowflake Geospatial data types in the documentation linked here. In general, you’ll often see the data represented as a GeoJSON string, but Snowflake also supports Well-Known Text (WKT), Well-Known Binary (WKB), and Extended WKT and WKB.

In the data table, the geography appears as GeoJSON strings:

Geographic coordinates in a GeoJSON format in a Snowflake table

Since GeoJSON is a format that can be input with GeoAnalytics Engine it becomes straightforward to work with the data.

Connecting to Snowflake data from Databricks

To use the data in our Snowflake tables with GeoAnalytics Engine, we need access to it in our Spark environment. For this, we’ll follow the guidance in the Databricks documentation to Read and write data from Snowflake.

Databricks provides a Snowflake connector in the Databricks Runtime and we can use it to bring in our Snowflake data into a dataframe.

For the Snowflake connector, we will need to provide details on the location of our Snowflake data. Specifically, we need to specify the host location, the user credentials (user name and password), and the specifics of the warehouse, database, schema, and table to connect to.

With a connection to the database mentioned above, I reference the OSM_NEWYORK database and, specifically, the V_OSM_NY_WAY table within that database.

snowflake_table = (spark.read
  .format("snowflake")
  .option("host", snowflake_host)
  .option("port", "443") # Optional - will use default port 443 if not specified.
  .option("user", snowflake_user)
  .option("password", snowflake_pass)
  .option("sfWarehouse", "COMPUTE_WH")
  .option("database", "OSM_NEWYORK")
  .option("schema", "NEW_YORK") # Optional - will use default schema "public" if not specified.
  .option("dbtable", "V_OSM_NY_WAY")
  .load()
)

In the example above, the details for the host, user, and password are all read from variables generated by reading from a secret management system. When providing credentials in a notebook, I highly recommend using a secrets management system to store and retrieve values, such as Azure Key Vault or Databricks secret management.

After the initial read of the data from Snowflake, we have access to our data in the table as a dataframe in Databricks. In the case of this dataset, we are looking at about 5.7 million records with a variety of attributes, including the GeoJSON strings for the geography coordinates.

Geographic coordinates in GeoJSON format after being read into a dataframe in Databricks

Now that we have connected to our data, we just need to convert the coordinates from a GeoJSON string into a geometry type for work with GeoAnalytics Engine. To do this, we just need the ST_GeomFromGeoJSON function. In this example, I’m adding a new column to hold our geometry and specifying that the coordinate system for the data is WGS84 (SRID 4326):

snowflake_table = snowflake_table\
    .withColumn("geometry", ST.geom_from_geojson("coordinates", 4326))

After adding that column, we now have both a GeoJSON string (“coordinates”) and a geometry (“geometry”) that we can work with directly with GeoAnalytics Engine.

Geographic coordinates from a Snowflake table in GeoJSON format and as geometry when converted with GeoAnalytics Engine

This new geometry column can be used as with any of the GeoAnalytics Engine functions or tools! For instance, we can look at the geometry type and length for each of the records to confirm their type using the ST_GeometryType and ST_GeodesicLength functions. We are using geodesic length since the length is calculated in the units of the coordinate system – and it doesn’t make sense to have units in decimal degrees. So, in this case we use geodesic length to return the length in meters.

snowflake_table\
    .select("geometry", 
            ST.geometry_type("geometry").alias("geom_type"),
            ST.geodesic_length("geometry").alias("geom_length"))

Example of spatial operations using GeoAnalytics Engine

We can also plot the results to look at the data visually. Since this particular dataset includes Linestrings and Polygons, we can separate those out to plot independently.

extent = (-73.98980002777058, 40.74753140314906, -73.98176056791401, 40.751731134711996)
myplt = snowflake_table.filter(F.col("geom_type") == "Polygon")\
        .st.plot(basemap="streets", 
                 geometry="geometry", 
                 linewidth=0.5, 
                 alpha=0.5, 
                 edgecolor="black", 
                 extent=extent)
        
snowflake_table.filter(F.col("geom_type") == "Linestring")\
        .st.plot(ax=myplt,
                 geometry="geometry",
                 alpha=0.5, 
                 edgecolor="black", 
                 extent=extent)

Spatial data from a Snowflake data table rendered in Databricks using GeoAnalytics Engine

Conclusion

That’s all there is to it! Using the Snowflake connector in Databricks we can bring our Snowflake data into the Databricks environment and work with the geometries using GeoAnalytics Engine. The Snowflake connector is a great way to bring new contextual and spatial data to pair with other datasets in Databricks so that we can take advantage of the Databricks Spark-based environment.

We’re excited to hear about what you’re doing with Snowflake, Databricks, and GeoAnalytics Engine to drive amazing analytic workflows.