Working with millions of table rows

ChrisPedrezuela · ‎06-14-2015

Hi guys,

This might be an old discussion topic and hopefully you guys can be able to point me to the right direction.

I have a single sql table that has information of bus location logs for a given time period. I want to work out a flow that will run some calculations per bus per day. So it is a table with millions of records and I was thinking of breaking the table into separate ones per day to be manageable. I've queried a single day table and I got like 1k unique bus numbers. I am wondering if its ok to create a dictionary having the bus number as key and its value would be the list of all its location logs. Wondering if this is the most practical way to go about it.

Cheers!

RichardFairhurst · ‎06-14-2015

I would need more information to say whether or not that design is the most efficient. Partly it would depend on where these calculations will ultimately be stored and how many records need to be evaluated to do the calculations. Certainly you can design a dictionary to be structured as you have described, but given the size of your table this may risk using up all of the available memory if the entire table is loaded at once. But without knowing what the calculations involve or where they will be stored it is hard to say what the best approach would be if memory limitations become an issue.

ChrisPedrezuela · ‎06-14-2015

Hi Richard,

thanks for the response, my plan is, for each unique bus in a given day, calculate its travel time, speed in between its logged location for a specific route and store that in another table. then redoing the process for the same bus based on the next day until the last day of the month. then calculate a statistics for mean travel time & speed throughout the month.

Regards,

Chris

RichardFairhurst · ‎06-14-2015

Given that I don't work with bus transit routing I don't have a clear idea of how these logs are organized. What fields do your records currently contain? How does it record time of day and distance information for each stop? Are you looking to store an average for the total day of trips, average for the full round trip to a single location, average one way trips, etc. Are there any variations in route for a given bus that need to be accounted for or weekday verses weekend differences? Anyway, the dictionary key could be a tuple if you need to consolidate and organize data to include more information than just the bus, such as bus and date (regardless of time) or bus and stop pairs. It all depends on how much you want to summarize the data at each stage compared to its current organization.

JamesCrandall · ‎06-15-2015

Are there any spatial analysis requirements or is it just strictly a grouping and statistical analysis of attribute information? What's the output supposed to be?

If it's just attributes then I'd say just write the logic into a stored procedure on the database and depending upon the output requirements would determine how to make the results accessible. If you are working with millions of rows then it only makes sense to perform the statistics where the data resides.

If it's absolutely necessary to do this client side then the pandas and numpy libraries will be friendly for this type of grouping and calculating analytics.

ChrisPedrezuela · ‎06-15-2015

Thanks James and Richard,

I basically have zone polygons, and for each polygon, I want to extract all bus IDs that went through using a months data of XY logs. Once I have a subset of the big data, I need to create stats for individual bus IDs (i.e. average speed, length of travel, etc). I can go about the stat part but thinking of best way to generate data per zone based on a big sql table is something I am struggling with.

Cheers