Hello,
I am running some tests with Dask GeoPandas but I'd like to run those tests with huge geospatial data. I have been looking around (probably not in the right places) but I cannot find anything that is properly huge. I would love to test Dask GeoPandas with a CSV (or other file type) that contains thousands of geospatial records. The geospatial element could be as simple as having lat and lon columns.
Any help would be much appreciated. Thanks!
You’re right — most sample datasets are too small to really put Dask GeoPandas through its paces. A few good, openly available options that work well for large-scale testing:
NYC TLC Trip Data (CSV/Parquet):
Monthly taxi trip files with pickup/dropoff lat/lon, tens of millions of rows each. Perfect for partitioned reads.
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
OpenAddresses (CSV):
Hundreds of millions of global address points with lat/lon, downloadable in regional CSV bundles.
https://openaddresses.io
GeoNames allCountries (TXT/TSV):
~11 million global place records with lat/lon in a single file.
https://download.geonames.org/export/dump/
GDELT Global Events Database (CSV):
Truly massive multi-TB event dataset with coordinates. Daily CSVs are also published if you just want a slice.
https://www.gdeltproject.org/data.html
Once you have one of these, you can create geometries from lat/lon directly in Dask GeoPandas. For example, with the NYC TLC trips:
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
from shapely.geometry import Point
# Read multiple monthly CSVs
df = dd.read_csv("yellow_tripdata_2024-*.csv", assume_missing=True)
# Convert lat/lon into point geometry
def add_geometry(pdf):
return gpd.GeoDataFrame(
pdf,
geometry=[Point(xy) for xy in zip(pdf.pickup_longitude, pdf.pickup_latitude)],
crs="EPSG:4326"
)
gdf = dg.from_dask_dataframe(df.map_partitions(add_geometry))
print(gdf.head())
That way you can test scaling behavior across millions (or even billions) of rows without having to hunt for obscure “huge shapefiles.”
If you want “massive” benchmarks, OpenAddresses and GDELT are probably the best options, while TLC data is a nice, manageable starting point because it’s already partitioned by month.