Select to view content in your preferred language

Open source geospatial data for testing parallel computing

851
1
04-13-2023 03:04 AM
shizu6
by
Deactivated User

Hello,

I am running some tests with Dask GeoPandas but I'd like to run those tests with huge geospatial data. I have been looking around (probably not in the right places) but I cannot find anything that is properly huge. I would love to test Dask GeoPandas with a CSV (or other file type) that contains thousands of geospatial records. The geospatial element could be as simple as having lat and lon columns.

Any help would be much appreciated. Thanks!

0 Kudos
1 Reply
VenkataKondepati
Occasional Contributor

 

You’re right — most sample datasets are too small to really put Dask GeoPandas through its paces. A few good, openly available options that work well for large-scale testing:

import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
from shapely.geometry import Point

# Read multiple monthly CSVs
df = dd.read_csv("yellow_tripdata_2024-*.csv", assume_missing=True)

# Convert lat/lon into point geometry
def add_geometry(pdf):
    return gpd.GeoDataFrame(
        pdf,
        geometry=[Point(xy) for xy in zip(pdf.pickup_longitude, pdf.pickup_latitude)],
        crs="EPSG:4326"
    )

gdf = dg.from_dask_dataframe(df.map_partitions(add_geometry))

print(gdf.head())
 

That way you can test scaling behavior across millions (or even billions) of rows without having to hunt for obscure “huge shapefiles.”

If you want “massive” benchmarks, OpenAddresses and GDELT are probably the best options, while TLC data is a nice, manageable starting point because it’s already partitioned by month.

0 Kudos