How to count the number of days between events in a flat file (Python, R, or VBA)?

Anonymous User · ‎01-25-2018

Hello all!

So I am working on a project and I've hit a roadblock.

What I want to do is find the number of days that a parcel (identified by 'PREM_CODE') had water service during a two year period.

However, the structure of the data is giving me some issues. Each row in this table represents a water meter reading and whether the water was turned on or off. If the water was turned on, the value in the field 'ACTN_CODE' would be 'IN,' if the meter was being turned off, the value would be 'OUT.'

There are a few assumptions that I need to put into a logic to deal with these that are giving me trouble. Those assumptions are:

If a premises's (PREM_CODE) first chronological event is a turn-on event (i.e. 'ACTN_CODE'='IN'), then use that date should be used as the starting date (Below Example 89 and 99).

If a premises's (PREM_CODE) first chronological event is a turn-off event (i.e. 'ACTN_CODE'='OUT'), then subtract the number of days between the period's first date (7/1/2015) to the date of the event (ACTION_DATE) (Below Example 50,76, and 152).

Then, after that first logic is implemented, I need to count the days between any subsequent 'IN' and 'OUT' events.

If a PREM_CODE's last entry is an 'IN' event, then I need to subtract that 'IN' event's date from the end date, which is 8/1/2017.

And lastly, if a PREM_CODE's last entry is an 'OUT,' then no further calculations should be made.

Here is a sample of what this could look like:

**Start date = 7/1/2015
**End date = 8/1/2017

PREM_CODE	ACTN_CODE	ACTION_DATE
50	OUT	7/30/2016
50	IN	9/21/2016
89	IN	10/10/2015
76	OUT	12/1/2015
76	IN	1/1/2016
76	OUT	9/1/2016
76	IN	5/5/2017
99	IN	7/30/2015
152	OUT	7/1/2017

Each of these has its own unique situation.

For PREM_CODE=50, the calulcation would be: (7/1/2015-7/30/2016)+(9/21/2016-8/1/2017)
For PREM_CODE=89, the calculation would be: (10/10/2015-8/1/2017)
For PREM_CODE=76, the calculation would be:(7/1/2015-12/1/2015)+(1/1/2016-9/1/2016)+(5/5/2017-8/1/2017)
For PREM_CODE=99, the calculation would be: (7/30/2015-8/1/2017)
For PREM_CODE=152, the calculation would be: (7/1/2015-7/1/2017)

Any and all help would be appreciated!

JoshuaBixby · ‎01-26-2018

Are you getting that error with the first record or part-way through the records? (You can check by seeing if service_days is empty or not).

If a record has a NULL, the cursor returns a NoneType. Are there any gaps in values in the records or is every row of every column populated?

View solution in original post

JoshuaBixby · ‎01-25-2018

The following still feels a bit clunky to me, but it should get the job done. The result is a dictionary of service days by premises code. Since you have dates, and not dates and times, I wasn't sure how you wanted to handle partial service days. I assumed a partial service day counted as a full service day, e.g., getting hooked up on 07/01 and disconnected on 07/02 would be 2 days of service.

import arcpy
import datetime
import itertools


tbl = # path to table containing service events
fields = ["PREM_CODE", "ACTION_DATE", "ACTN_CODE"]
start_period = datetime.date(2015, 7, 1)
end_period = datetime.date(2017, 8, 1)

service_days = dict()
with arcpy.da.SearchCursor(tbl, fields) as cur:
    sorted_events = sorted(cur, key=lambda x: x[0:2])
    for k, g in itertools.groupby(sorted_events, key=lambda x: x[0]):
        prem, date, action = next(g)
        if action == "OUT":
            service_days[prem] = (date.date() - start_period).days + 1
        elif action == "IN":
            service_days[prem] = 0
            in_date = date.date()
            
        for prem, date, action in g:    
            if action == "OUT":
                service_days[prem] += (date.date() - in_date).days + 1
            elif action == "IN":
                in_date = date.date()
                
        if action == "IN":
            service_days[prem] += (end_period - date.date()).days + 1
            ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

If your tables are huge, you could have the cursor do the sorting first instead of dumping the cursor into a list using sorted, but most GIS users aren't really working with huge data sets.

Anonymous User · ‎01-25-2018

Awesome! I'll give this a try!

And the dataset is pretty big (300k rows)

I'll let you know how it goes!

JoshuaBixby · ‎01-25-2018

By modern computing standards, 300k isn't that large. My 3+ year-old laptop can probably process a table with 1 million records in a few seconds, maybe 10 seconds. And with machines having GBs and GBs of RAM, loading 300k records into memory usually is no problem unless the records have lots of large BLOB or verbose text fields.

Anonymous User · ‎01-25-2018

Fantastic, and do you foresee any difference in behavior between CSV, xlsx, xls, etc?

JoshuaBixby · ‎01-25-2018

My code is based on the assumption that ArcGIS is treating ACTION_DATE as a date field. Since CSV, XLS, and XLSX can be a bit loose with data types, I would just confirm the ACTION_DATE field is being treated as date.

Anonymous User · ‎01-25-2018

I'm getting a curious error that seems dependent on file type. Csv's get read but fail with a 'invalid column name error,' but xls and xlsx fail with a 'could not open C:\...'

Anonymous User · ‎01-25-2018

If it's worth anything, I've attached the .csv here as a downloadable item in AGOL: http://www.arcgis.com/home/item.html?id=67c5dd7000094f7ab0049e7d38764a24

JoshuaBixby · ‎01-25-2018

I would just swim downstream and copy the contents into a table in a file geodatabase. Then you can verify the field data types are going to be set correctly.

Anonymous User · ‎01-25-2018

There's the trick!

And does the output come as a total? Or is it broken out by PREM_CODE? I'm looking for the latter