Hadoop Questions

JeremyGould · ‎04-07-2015

First, let me say that I am new to Hadoop and all things related. However, here at KYTC, we have been approved for a 15 node hadoop cluster (60TB). I am looking to store our AVL data (salt trucks) in Hadoop via the ArcGIS Server Geoevent Processor. This could mean as many as 15 million records for a 24 hour period during a snow/ice event. My questions are this. With the current UDF's, would we be able to show hotspots of where salt is being applied? We will have the GPS locations every 10 seconds of 1400 trucks. Will we be able to add additional information to the GPS points by using a near function (such as county and route) via the UDF's for hadoop? Finally, do you know if/when more Hadoop tools will be available and supported/delivered with ArcGIS, or ArcGIS Pro? I know this is currently an R&D, but is it getting any closet to being an out-of-box functionality. We are very interested in using this. Thanks.

RicardoBarros_Lourenco · ‎04-07-2015

Hi Jeremy!

Well, I've been working with a dataset of 2.5 million records (2014 Divvy Data) on ArcGIS Desktop, and it was consuming a little time to render. Considering that your load is 5x greater on a daily basis, you need to work with a storage. However, for loading and processing this data, you should do any form of data reduction...

Depending on your use, it could be done on a way that doesn't compromise much of your results. For my case, I've done summarization before loading on arcgis desktop. You could do it by time, and once a certain period is elapsed discard some data... Or use all, summing all, and then apply a normalization for visualization...

JeremyGould · ‎04-09-2015

Thanks Ricardo,

I will keep these things in mind.

SarahAmbrose · ‎04-13-2015

Hi Jeremy,

Congrats on getting a Hadoop cluster – you can read about our own experience here.

Currently you won’t be able to show hot spots, but will still be able to look for trends. I would recommended doing an aggregation by bins analysis (there is a new blog post and tutorial on this subject). Aggregating will produce a raster like result, showing the amount of salt being applied. The biggest difference between aggregation by bins and hot spot analysis is that hot spots shows statistical significance. From your question, it doesn't seem like that is very important, as you are more interested in finding information about the distribution of salt.

With the current UDFs you are able to complete a near analysis by applying a buffer and then using a within or intersect operation (depending on your criteria).

These GIS Tools for Hadoop need to be used with MapReduce, and thus need to be completed in Hadoop (not ArcGIS). Although the analysis is completed with Hadoop, you are able to move datasets between Hadoop and ArcGIS using the Geoprocessing Tools for Hadoop.

You are correct that Big Data solutions are currently in R&D, and we think you will be very excited about what is coming. We are continuing to work on out of the box functionality, and will keep the GeoNet community posted on new developments.

In the meantime please feel free to ask questions in GeoNet or submit issues on the GitHub site.

JeremyGould · ‎04-22-2015

Thank you Sarah for the input. We most certainly do look forward to any out of the box functionality you can provide. I suspect in the next couple of weeks we will have our Hadoop cluster up and running. I may be requesting assistance from this site if I run into issues or if I get confused. Once I get in there and get comfortable, hopefully I can share some of our needs. Thanks again for your support

JeremyGould · ‎08-20-2015

Hello Sarah,

We are just now getting up our Hadoop cluster and loading spatial data into it. We successfully ran the aggregating data by spatial bins against our own AVL data without issues. Here is our next challenge. We are wanting to take the millions of AVL points and add attribute data to them from the nearest road segment. How can this be done? I guess I am asking how to do a spatial join, but if possible I would like to take it a step further and get the mile points of those gps locations from our Measured Road Network. Once we can add attributes from the road network to the points, we can then do summations on how much salt/money we spent on each road, county, or even between specific mile points if we can get that information. I reviewed the current UDF's and it appears that most of those only return you a True/False, but do not allow you to add attributes. Any help would be appreciated.

Thanks

Jeremy

SarahAmbrose · ‎08-27-2015

Hi Jeremy,

Glad to hear that the aggregation tutorial worked. In the sample (https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive) point in polygon aggregation you grouped results (group by counties.name) and ended up with a count within counties. To join attributes you will want to remove the group by term, add the attribute fields you are interested in joining, and complete the join, similar to:

SELECT counties.name, table.attributestojoinhere FROM counties 
JOIN earthquakes 
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))

- Sarah

JeremyGould · ‎09-10-2015

Thanks Sarah,

That worked and we are moving forward. Thanks for all your help so far.

JeremyGould · ‎10-19-2015

This is working in Hive. Do you have similar thing for Spark?