<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Open source geospatial data for testing parallel computing in ArcGIS GeoAnalytics Server Questions</title>
    <link>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/open-source-geospatial-data-for-testing-parallel/m-p/1278167#M195</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am running some tests with Dask GeoPandas but I'd like to run those tests with huge geospatial data. I have been looking around (probably not in the right places) but I cannot find anything that is properly huge. I would love to test Dask GeoPandas with a CSV (or other file type) that contains thousands of geospatial records. The geospatial element could be as simple as having lat and lon columns.&lt;/P&gt;&lt;P&gt;Any help would be much appreciated. Thanks!&lt;/P&gt;</description>
    <pubDate>Thu, 13 Apr 2023 10:04:43 GMT</pubDate>
    <dc:creator>shizu6</dc:creator>
    <dc:date>2023-04-13T10:04:43Z</dc:date>
    <item>
      <title>Open source geospatial data for testing parallel computing</title>
      <link>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/open-source-geospatial-data-for-testing-parallel/m-p/1278167#M195</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am running some tests with Dask GeoPandas but I'd like to run those tests with huge geospatial data. I have been looking around (probably not in the right places) but I cannot find anything that is properly huge. I would love to test Dask GeoPandas with a CSV (or other file type) that contains thousands of geospatial records. The geospatial element could be as simple as having lat and lon columns.&lt;/P&gt;&lt;P&gt;Any help would be much appreciated. Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 13 Apr 2023 10:04:43 GMT</pubDate>
      <guid>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/open-source-geospatial-data-for-testing-parallel/m-p/1278167#M195</guid>
      <dc:creator>shizu6</dc:creator>
      <dc:date>2023-04-13T10:04:43Z</dc:date>
    </item>
    <item>
      <title>Re: Open source geospatial data for testing parallel computing</title>
      <link>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/open-source-geospatial-data-for-testing-parallel/m-p/1650132#M198</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You’re right — most sample datasets are too small to really put &lt;STRONG&gt;Dask GeoPandas through its paces. A few good, openly available options that work well for large-scale testing:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;NYC TLC Trip Data (CSV/Parquet):&lt;BR /&gt;Monthly taxi trip files with pickup/dropoff lat/lon, tens of millions of rows each. Perfect for partitioned reads.&lt;BR /&gt;&lt;A href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page?utm_source=chatgpt.com" target="_new" rel="noopener"&gt;https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;OpenAddresses (CSV):&lt;BR /&gt;Hundreds of millions of global address points with lat/lon, downloadable in regional CSV bundles.&lt;BR /&gt;&lt;A href="https://openaddresses.io?utm_source=chatgpt.com" target="_new" rel="noopener"&gt;https://openaddresses.io&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;GeoNames allCountries (TXT/TSV):&lt;BR /&gt;~11 million global place records with lat/lon in a single file.&lt;BR /&gt;&lt;A href="https://download.geonames.org/export/dump/?utm_source=chatgpt.com" target="_new" rel="noopener"&gt;https://download.geonames.org/export/dump/&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;GDELT Global Events Database (CSV):&lt;BR /&gt;Truly massive multi-TB event dataset with coordinates. Daily CSVs are also published if you just want a slice.&lt;BR /&gt;&lt;A href="https://www.gdeltproject.org/data.html?utm_source=chatgpt.com" target="_new" rel="noopener"&gt;https://www.gdeltproject.org/data.html&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Once you have one of these, you can create geometries from lat/lon directly in Dask GeoPandas. For example, with the NYC TLC trips:&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
from shapely.geometry import Point

# Read multiple monthly CSVs
df = dd.read_csv("yellow_tripdata_2024-*.csv", assume_missing=True)

# Convert lat/lon into point geometry
def add_geometry(pdf):
    return gpd.GeoDataFrame(
        pdf,
        geometry=[Point(xy) for xy in zip(pdf.pickup_longitude, pdf.pickup_latitude)],
        crs="EPSG:4326"
    )

gdf = dg.from_dask_dataframe(df.map_partitions(add_geometry))

print(gdf.head())&lt;/LI-CODE&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;P&gt;That way you can test scaling behavior across millions (or even billions) of rows without having to hunt for obscure “huge shapefiles.”&lt;/P&gt;&lt;P&gt;If you want “massive” benchmarks, &lt;STRONG&gt;OpenAddresses and &lt;STRONG&gt;GDELT are probably the best options, while &lt;STRONG&gt;TLC data is a nice, manageable starting point because it’s already partitioned by month.&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 12 Sep 2025 17:06:32 GMT</pubDate>
      <guid>https://community.esri.com/t5/arcgis-geoanalytics-server-questions/open-source-geospatial-data-for-testing-parallel/m-p/1650132#M198</guid>
      <dc:creator>VenkataKondepati</dc:creator>
      <dc:date>2025-09-12T17:06:32Z</dc:date>
    </item>
  </channel>
</rss>

