What is the optimized way to execute spatial queries in Hive, using GIS Tools for Hadoop?

vigneshwaran — Sun, 04 Jun 2023 11:38:52 GMT

Dear Esri Team,

I am trying to implement spatial queries in Hive using GIS Tools for Hadoop, following the steps mentioned in the link below.

https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive

The implementation of the queries are as follows

1.Circle range query
SELECT COUNT(*) FROM table_name WHERE
ST_Intersects(ST_Buffer(ST_Point(x1,y1), radius), ST_Point(x, y));

2. BBOX range query
SELECT COUNT(*) FROM table_name WHERE
ST_Contains(ST_GeomFromText('POLYGON ((x1 y1, x2 y2, .. , x1 y1))'), ST_Point(x, y));

3. k-NN query (k=10)
SELECT ST_Distance(ST_Point(x1, y1), ST_Point(x, y)) as distance, x, y FROM
table_name ORDER BY distance ASC LIMIT 10;

Kindly let me know the optimized way to implement the above queries (especially k-NN).

Thanks & Regards

Vigneswaran

Re: What is the optimized way to execute spatial queries in Hive, using GIS Tools for Hadoop?

VenkataKondepati — Fri, 12 Sep 2025 18:36:18 GMT

Hi @vigneshwaran,
Hive can do these, but you’ll get much better performance with a two-stage filter (cheap spatial prefilter → exact test) and a coarse spatial index (grid/geohash/H3) to avoid full scans especially for k-NN.

Below are practical patterns that work well with GIS Tools for Hadoop UDFs.

0) Table prep (one-time)

Store as ORC/Parquet, enable Tez, vectorization, and collect stats.
Add a point geometry column once, rather than building it per query.
Add a coarse spatial key (e.g., geohash/H3/quadkey) for pruning.

-- Example schema CREATE TABLE pts ( id BIGINT, x DOUBLE, -- lon y DOUBLE, -- lat geom STRING, -- WKT point gh STRING -- geohash / tile id ) STORED AS ORC; -- Populate (example) INSERT OVERWRITE TABLE pts SELECT id, x, y, ST_AsText(ST_Point(x, y)) AS geom, geohash(x, y, 7) AS gh -- use your geohash/H3 UDF FROM raw_pts; ANALYZE TABLE pts COMPUTE STATISTICS;

If you don’t have a geohash/H3 UDF handy, precompute a simple grid key (e.g., floor(x/Δ), floor(y/Δ)).

1) Circle range query (fast pattern)

Avoid buffering every row. Compute the circle once, prefilter by envelope, then run exact check.

-- Params WITH q AS ( SELECT ST_Point(%x1, %y1) AS qpt, %radius AS r ) SELECT COUNT(*) FROM pts JOIN q WHERE -- cheap prefilter: envelope test ST_EnvIntersects( ST_Envelope(ST_Buffer(q.qpt, q.r)), -- computed once ST_Envelope(ST_Point(x, y)) ) -- exact test AND ST_Distance(ST_Point(x, y), q.qpt) <= q.r;

Why faster? The envelope test prunes most rows before the expensive exact distance.

2) BBOX range query (fast pattern)

Use ST_EnvIntersects first, then (optionally) ST_Contains as exact test.

WITH bbox AS ( SELECT ST_GeomFromText('POLYGON((x1 y1, x2 y2, x3 y3, x4 y4, x1 y1))') AS poly ) SELECT COUNT(*) FROM pts JOIN bbox WHERE ST_EnvIntersects(bbox.poly, ST_Point(x, y)) AND ST_Contains(bbox.poly, ST_Point(x, y)); -- optional exact check

3) k-NN (k=10): scalable approach

Global ORDER BY distance LIMIT 10 forces a single reducer (slow). Instead, do candidate pruning with your spatial key, then a small sort.

3a) Precompute neighbors to search

For geohash/H3, get the cell containing (x1,y1) and its immediate neighbors (ring). If < k results, expand by another ring.

-- Example: find target cell + 8 neighbors WITH q AS ( SELECT %x1 AS qx, %y1 AS qy, geohash(%x1, %y1, 7) AS qgh ), nbrs AS ( SELECT q.qx, q.qy, n.gh AS ngh FROM q LATERAL VIEW explode(geohash_neighbors(q.qgh)) n AS gh -- UDF returns center+neighbors ) SELECT distance, id, x, y FROM ( SELECT ST_Distance(ST_Point(x, y), ST_Point(qx, qy)) AS distance, id, x, y FROM pts JOIN nbrs ON pts.gh = nbrs.ngh ) cand ORDER BY distance ASC LIMIT 10;

If your 9 cells don’t yield 10 points, expand to the next ring (second-order neighbors) and rerun. In production, do this iteratively in a procedure, or over-fetch (e.g., 2 rings) once and still be fast.

3b) No geohash? Use a bounding box window as a candidate filter

Start with a small window around (x1,y1), enlarge if needed.

WITH q AS ( SELECT %x1 AS qx, %y1 AS qy, %win AS w ) SELECT distance, id, x, y FROM ( SELECT ST_Distance(ST_Point(x, y), ST_Point(qx, qy)) AS distance, id, x, y FROM pts JOIN q WHERE x BETWEEN qx - w AND qx + w AND y BETWEEN qy - w AND qy + w ) cand ORDER BY distance LIMIT 10;

Pick w based on expected density; if < 10 results, double w and retry.

Extra performance tips

Cache constants: put ST_Buffer(...), ST_Point(x1,y1) in a CTE so they’re computed once.

Partition/Bucket on your spatial key (gh) to get partition pruning.

Use SORT BY distance LIMIT 10 only if you first reduce candidates; otherwise it still hits a single reducer.

Consider converting hot datasets to Parquet with sorted Z-order (lon,lat) or by spatial key to improve locality.

If polygons are small and numerous, pre-tile them (one row per tile) and join on tile id before exact ST_Intersects.

Regards,
Venkat

topic What is the optimized way to execute spatial queries in Hive, using GIS Tools for Hadoop? in ArcGIS GeoAnalytics Server Questions

What is the optimized way to execute spatial queries in Hive, using GIS Tools for Hadoop?

Re: What is the optimized way to execute spatial queries in Hive, using GIS Tools for Hadoop?

0) Table prep (one-time)