How to count the number of days between events in a flat file (Python, R, or VBA)?

1823
15
Jump to solution
01-25-2018 06:09 AM
by Anonymous User
Not applicable

Hello all!

So I am working on a project and I've hit a roadblock.

What I want to do is find the number of days that a parcel (identified by 'PREM_CODE') had water service during a two year period.

However, the structure of the data is giving me some issues. Each row in this table represents a water meter reading and whether the water was turned on or off. If the water was turned on, the value in the field 'ACTN_CODE' would be 'IN,' if the meter was being turned off, the value would be 'OUT.'

There are a few assumptions that I need to put into a logic to deal with these that are giving me trouble. Those assumptions are:

If a premises's (PREM_CODE) first chronological event is a turn-on event (i.e. 'ACTN_CODE'='IN'), then use that date should be used as the starting date (Below Example 89 and 99).

If a premises's (PREM_CODE) first chronological event is a turn-off event (i.e. 'ACTN_CODE'='OUT'), then subtract the number of days between the period's first date (7/1/2015) to the date of the event (ACTION_DATE) (Below Example 50,76, and 152).

Then, after that first logic is implemented, I need to count the days between any subsequent 'IN' and 'OUT' events.

If a PREM_CODE's last entry is an 'IN' event, then I need to subtract that 'IN' event's date from the end date, which is 8/1/2017.

And lastly, if a PREM_CODE's last entry is an 'OUT,' then no further calculations should be made.

Here is a sample of what this could look like:

**Start date = 7/1/2015
**End date = 8/1/2017

PREM_CODEACTN_CODEACTION_DATE
50OUT7/30/2016
50IN9/21/2016
89IN10/10/2015
76OUT12/1/2015
76IN1/1/2016
76OUT9/1/2016
76IN5/5/2017
99IN7/30/2015
152OUT7/1/2017

Each of these has its own unique situation.

For PREM_CODE=50, the calulcation would be: (7/1/2015-7/30/2016)+(9/21/2016-8/1/2017)
For PREM_CODE=89, the calculation would be: (10/10/2015-8/1/2017)
For PREM_CODE=76, the calculation would be:(7/1/2015-12/1/2015)+(1/1/2016-9/1/2016)+(5/5/2017-8/1/2017)
For PREM_CODE=99, the calculation would be: (7/30/2015-8/1/2017)
For PREM_CODE=152, the calculation would be: (7/1/2015-7/1/2017)

Any and all help would be appreciated!

0 Kudos
1 Solution

Accepted Solutions
JoshuaBixby
MVP Esteemed Contributor

Are you getting that error with the first record or part-way through the records?  (You can check by seeing if service_days is empty or not).

If a record has a NULL, the cursor returns a NoneType.  Are there any gaps in values in the records or is every row of every column populated?

View solution in original post

15 Replies
JoshuaBixby
MVP Esteemed Contributor

The following still feels a bit clunky to me, but it should get the job done.  The result is a dictionary of service days by premises code.  Since you have dates, and not dates and times, I wasn't sure how you wanted to handle partial service days.  I assumed a partial service day counted as a full service day, e.g., getting hooked up on 07/01 and disconnected on 07/02 would be 2 days of service.

import arcpy
import datetime
import itertools


tbl = # path to table containing service events
fields = ["PREM_CODE", "ACTION_DATE", "ACTN_CODE"]
start_period = datetime.date(2015, 7, 1)
end_period = datetime.date(2017, 8, 1)

service_days = dict()
with arcpy.da.SearchCursor(tbl, fields) as cur:
    sorted_events = sorted(cur, key=lambda x: x[0:2])
    for k, g in itertools.groupby(sorted_events, key=lambda x: x[0]):
        prem, date, action = next(g)
        if action == "OUT":
            service_days[prem] = (date.date() - start_period).days + 1
        elif action == "IN":
            service_days[prem] = 0
            in_date = date.date()
            
        for prem, date, action in g:    
            if action == "OUT":
                service_days[prem] += (date.date() - in_date).days + 1
            elif action == "IN":
                in_date = date.date()
                
        if action == "IN":
            service_days[prem] += (end_period - date.date()).days + 1
            ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

If your tables are huge, you could have the cursor do the sorting first instead of dumping the cursor into a list using sorted, but most GIS users aren't really working with huge data sets.

by Anonymous User
Not applicable

Awesome! I'll give this a try!

And the dataset is pretty big (300k rows)

I'll let you know how it goes!

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

By modern computing standards, 300k isn't that large.  My 3+ year-old laptop can probably process a table with 1 million records in a few seconds, maybe 10 seconds.  And with machines having GBs and GBs of RAM, loading 300k records into memory usually is no problem unless the records have lots of large BLOB or verbose text fields.

0 Kudos
by Anonymous User
Not applicable

Fantastic, and do you foresee any difference in behavior between CSV, xlsx, xls, etc?

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

My code is based on the assumption that ArcGIS is treating ACTION_DATE as a date field.  Since CSV, XLS, and XLSX can be a bit loose with data types, I would just confirm the ACTION_DATE field is being treated as date.

0 Kudos
by Anonymous User
Not applicable

I'm getting a curious error that seems dependent on file type. Csv's get read but fail with a 'invalid column name error,' but xls and xlsx fail with a 'could not open C:\...'

0 Kudos
by Anonymous User
Not applicable

If it's worth anything, I've attached the .csv here as a downloadable item in AGOL: http://www.arcgis.com/home/item.html?id=67c5dd7000094f7ab0049e7d38764a24  

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

I would just swim downstream and copy the contents into a table in a file geodatabase.  Then you can verify the field data types are going to be set correctly.

0 Kudos
by Anonymous User
Not applicable

There's the trick!

And does the output come as a total? Or is it broken out by PREM_CODE? I'm looking for the latter

0 Kudos