Hello all!
So I am working on a project and I've hit a roadblock.
What I want to do is find the number of days that a parcel (identified by 'PREM_CODE') had water service during a two year period.
However, the structure of the data is giving me some issues. Each row in this table represents a water meter reading and whether the water was turned on or off. If the water was turned on, the value in the field 'ACTN_CODE' would be 'IN,' if the meter was being turned off, the value would be 'OUT.'
There are a few assumptions that I need to put into a logic to deal with these that are giving me trouble. Those assumptions are:
If a premises's (PREM_CODE) first chronological event is a turn-on event (i.e. 'ACTN_CODE'='IN'), then use that date should be used as the starting date (Below Example 89 and 99).
If a premises's (PREM_CODE) first chronological event is a turn-off event (i.e. 'ACTN_CODE'='OUT'), then subtract the number of days between the period's first date (7/1/2015) to the date of the event (ACTION_DATE) (Below Example 50,76, and 152).
Then, after that first logic is implemented, I need to count the days between any subsequent 'IN' and 'OUT' events.
If a PREM_CODE's last entry is an 'IN' event, then I need to subtract that 'IN' event's date from the end date, which is 8/1/2017.
And lastly, if a PREM_CODE's last entry is an 'OUT,' then no further calculations should be made.
Here is a sample of what this could look like:
**Start date = 7/1/2015
**End date = 8/1/2017
PREM_CODE | ACTN_CODE | ACTION_DATE |
---|---|---|
50 | OUT | 7/30/2016 |
50 | IN | 9/21/2016 |
89 | IN | 10/10/2015 |
76 | OUT | 12/1/2015 |
76 | IN | 1/1/2016 |
76 | OUT | 9/1/2016 |
76 | IN | 5/5/2017 |
99 | IN | 7/30/2015 |
152 | OUT | 7/1/2017 |
Each of these has its own unique situation.
For PREM_CODE=50, the calulcation would be: (7/1/2015-7/30/2016)+(9/21/2016-8/1/2017)
For PREM_CODE=89, the calculation would be: (10/10/2015-8/1/2017)
For PREM_CODE=76, the calculation would be:(7/1/2015-12/1/2015)+(1/1/2016-9/1/2016)+(5/5/2017-8/1/2017)
For PREM_CODE=99, the calculation would be: (7/30/2015-8/1/2017)
For PREM_CODE=152, the calculation would be: (7/1/2015-7/1/2017)
Any and all help would be appreciated!
Solved! Go to Solution.
Are you getting that error with the first record or part-way through the records? (You can check by seeing if service_days is empty or not).
If a record has a NULL, the cursor returns a NoneType. Are there any gaps in values in the records or is every row of every column populated?
The following still feels a bit clunky to me, but it should get the job done. The result is a dictionary of service days by premises code. Since you have dates, and not dates and times, I wasn't sure how you wanted to handle partial service days. I assumed a partial service day counted as a full service day, e.g., getting hooked up on 07/01 and disconnected on 07/02 would be 2 days of service.
import arcpy
import datetime
import itertools
tbl = # path to table containing service events
fields = ["PREM_CODE", "ACTION_DATE", "ACTN_CODE"]
start_period = datetime.date(2015, 7, 1)
end_period = datetime.date(2017, 8, 1)
service_days = dict()
with arcpy.da.SearchCursor(tbl, fields) as cur:
sorted_events = sorted(cur, key=lambda x: x[0:2])
for k, g in itertools.groupby(sorted_events, key=lambda x: x[0]):
prem, date, action = next(g)
if action == "OUT":
service_days[prem] = (date.date() - start_period).days + 1
elif action == "IN":
service_days[prem] = 0
in_date = date.date()
for prem, date, action in g:
if action == "OUT":
service_days[prem] += (date.date() - in_date).days + 1
elif action == "IN":
in_date = date.date()
if action == "IN":
service_days[prem] += (end_period - date.date()).days + 1
If your tables are huge, you could have the cursor do the sorting first instead of dumping the cursor into a list using sorted, but most GIS users aren't really working with huge data sets.
Awesome! I'll give this a try!
And the dataset is pretty big (300k rows)
I'll let you know how it goes!
By modern computing standards, 300k isn't that large. My 3+ year-old laptop can probably process a table with 1 million records in a few seconds, maybe 10 seconds. And with machines having GBs and GBs of RAM, loading 300k records into memory usually is no problem unless the records have lots of large BLOB or verbose text fields.
Fantastic, and do you foresee any difference in behavior between CSV, xlsx, xls, etc?
My code is based on the assumption that ArcGIS is treating ACTION_DATE as a date field. Since CSV, XLS, and XLSX can be a bit loose with data types, I would just confirm the ACTION_DATE field is being treated as date.
I'm getting a curious error that seems dependent on file type. Csv's get read but fail with a 'invalid column name error,' but xls and xlsx fail with a 'could not open C:\...'
If it's worth anything, I've attached the .csv here as a downloadable item in AGOL: http://www.arcgis.com/home/item.html?id=67c5dd7000094f7ab0049e7d38764a24
I would just swim downstream and copy the contents into a table in a file geodatabase. Then you can verify the field data types are going to be set correctly.
There's the trick!
And does the output come as a total? Or is it broken out by PREM_CODE? I'm looking for the latter