Dear Esri Team,
Hi @vigneshwaran,
Hive can do these, but you’ll get much better performance with a two-stage filter (cheap spatial prefilter → exact test) and a coarse spatial index (grid/geohash/H3) to avoid full scans especially for k-NN.
Below are practical patterns that work well with GIS Tools for Hadoop UDFs.
Store as ORC/Parquet, enable Tez, vectorization, and collect stats.
Add a point geometry column once, rather than building it per query.
Add a coarse spatial key (e.g., geohash/H3/quadkey) for pruning.
-- Example schema
CREATE TABLE pts (
id BIGINT,
x DOUBLE, -- lon
y DOUBLE, -- lat
geom STRING, -- WKT point
gh STRING -- geohash / tile id
)
STORED AS ORC;
-- Populate (example)
INSERT OVERWRITE TABLE pts
SELECT id, x, y,
ST_AsText(ST_Point(x, y)) AS geom,
geohash(x, y, 7) AS gh -- use your geohash/H3 UDF
FROM raw_pts;
ANALYZE TABLE pts COMPUTE STATISTICS;
If you don’t have a geohash/H3 UDF handy, precompute a simple grid key (e.g., floor(x/Δ), floor(y/Δ)).
1) Circle range query (fast pattern)
Avoid buffering every row. Compute the circle once, prefilter by envelope, then run exact check.
-- Params
WITH q AS (
SELECT ST_Point(%x1, %y1) AS qpt,
%radius AS r
)
SELECT COUNT(*)
FROM pts
JOIN q
WHERE
-- cheap prefilter: envelope test
ST_EnvIntersects(
ST_Envelope(ST_Buffer(q.qpt, q.r)), -- computed once
ST_Envelope(ST_Point(x, y))
)
-- exact test
AND ST_Distance(ST_Point(x, y), q.qpt) <= q.r;
Why faster? The envelope test prunes most rows before the expensive exact distance.
2) BBOX range query (fast pattern)
Use ST_EnvIntersects first, then (optionally) ST_Contains as exact test.
WITH bbox AS (
SELECT ST_GeomFromText('POLYGON((x1 y1, x2 y2, x3 y3, x4 y4, x1 y1))') AS poly
)
SELECT COUNT(*)
FROM pts
JOIN bbox
WHERE ST_EnvIntersects(bbox.poly, ST_Point(x, y))
AND ST_Contains(bbox.poly, ST_Point(x, y)); -- optional exact check
3) k-NN (k=10): scalable approach
Global ORDER BY distance LIMIT 10 forces a single reducer (slow). Instead, do candidate pruning with your spatial key, then a small sort.
3a) Precompute neighbors to search
For geohash/H3, get the cell containing (x1,y1) and its immediate neighbors (ring). If < k results, expand by another ring.
-- Example: find target cell + 8 neighbors
WITH q AS (
SELECT %x1 AS qx, %y1 AS qy, geohash(%x1, %y1, 7) AS qgh
),
nbrs AS (
SELECT q.qx, q.qy, n.gh AS ngh
FROM q
LATERAL VIEW explode(geohash_neighbors(q.qgh)) n AS gh -- UDF returns center+neighbors
)
SELECT distance, id, x, y
FROM (
SELECT
ST_Distance(ST_Point(x, y), ST_Point(qx, qy)) AS distance,
id, x, y
FROM pts
JOIN nbrs ON pts.gh = nbrs.ngh
) cand
ORDER BY distance ASC
LIMIT 10;
If your 9 cells don’t yield 10 points, expand to the next ring (second-order neighbors) and rerun. In production, do this iteratively in a procedure, or over-fetch (e.g., 2 rings) once and still be fast.
3b) No geohash? Use a bounding box window as a candidate filter
Start with a small window around (x1,y1), enlarge if needed.
WITH q AS (
SELECT %x1 AS qx, %y1 AS qy, %win AS w
)
SELECT distance, id, x, y
FROM (
SELECT ST_Distance(ST_Point(x, y), ST_Point(qx, qy)) AS distance, id, x, y
FROM pts JOIN q
WHERE x BETWEEN qx - w AND qx + w
AND y BETWEEN qy - w AND qy + w
) cand
ORDER BY distance
LIMIT 10;
Pick w based on expected density; if < 10 results, double w and retry.
Extra performance tips
Cache constants: put ST_Buffer(...), ST_Point(x1,y1) in a CTE so they’re computed once.
Partition/Bucket on your spatial key (gh) to get partition pruning.
Use SORT BY distance LIMIT 10 only if you first reduce candidates; otherwise it still hits a single reducer.
Consider converting hot datasets to Parquet with sorted Z-order (lon,lat) or by spatial key to improve locality.
If polygons are small and numerous, pre-tile them (one row per tile) and join on tile id before exact ST_Intersects.
Regards,
Venkat