In my project I am reading a .csv which has two date values, these are at positions 5 and 14 in the row. The date values can either be null (indicated as '' in the .csv) or they can hold an actual date in the format of mm/dd/yyyy. In order to insert these rows into my geodatabase, I have to convert the '' value to a null, otherwise the value is not recognized as valid for a date field.
I can get the following code to work for each field:
empty = line[5]
if empty == '';
line[5] = None
else:
line[5]=emptyHowever, I was thinking I could implement a list comprehension to accomplish this more efficiently by using the following:
datelist = [line[5],line[14]]
datelist = [None if date == '' else date for date in datelist]This doesn't seem to work, I get an error indicating a type mismatch with a date field.
Any thoughts on what I am misunderstanding or how I can approach this. Since I have a working solution its not critical so I am just trying to improve my skills and knowledge a bit here. Though since I am iterating over 20K records, I suppose every bit of efficiency helps.
Is your null value two single quotes or an empty string?
date1,date2
'',dd/mm/yyy
-- OR --
date1,date2
,dd/mm/yyyThe former will evaluate to "''" when read, so your check is incorrect:
>>> # version 1
>>> print(row[4])
''
>>> repr(row[4])
'"\'\'"'
>>> # version 2
>>> print(row[4])
>>> repr(row[4])
"''"
To fix it that way, just modify your condition to check for the literal '' string:
datelist = [line[5],line[14]]
datelist = [None if date == "''" else date for date in datelist]
You can also do some fun stuff with the csv.DictReader if you want to avoid indexing:
from collections.abc import Iterator
from typing import Any
from pathlib import Path
from csv import DictReader
def format(line: str, *, sep: str=',') -> list[str]:
return line.strip().split(sep)
def get_dates(fl: Path) -> Iterator[tuple[str | None, str | None]]:
with fl.open('rt') as data:
headers = format(data.readline())
for line in data:
row = dict(zip(headers, format(line)))
yield (
None if (dt:=row.get('date1')) == "''" else dt,
None if (dt:=row.get('date2')) == "''" else dt,
)
def format_csv(fl: Path) -> Iterator[dict[str, Any]]:
with fl.open('rt') as data:
headers = format(data.readline())
for row in DictReader(data, headers):
row['date1'] = None if (dt:=row.get('date1')) == "''" else dt
row['date2'] = None if (dt:=row.get('date1')) == "''" else dt
yield row
This creates a generator for the CSV rows, which means you are only loading in one row at a time, which lets you filter them before adding them to a memory list. If you have a ton of data this can prevent you from loading all of it into memory every time you want to check for something:
# Initialize the generator (nothing is loaded in)
reader = format_csv(Path('data.csv'))
# Create a list that contains only the rows of the
# csv where both date fields are populated
valid_rows = [
row for row in reader
if row['date1'] and row['date2']
]
Additional Note on Generators:
Once a generator is exhausted (it's been iterated with a comprehension/for loop), It can't be iterated again:
>>> x = (i for i in range(10)) # generator comprehension syntax
>>> list(x)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(x)
[]
>>>
@HaydenWelch thanks! this is great stuff, I'll see if the quoting example you provide works and report back.
The problem is that after building datelist, you still have to write the values back to line. Right now you're updating the list but never putting the results back into the row, so the original line[5] and line[14] are unchanged and still hold ''
datelist = [line[5], line[14]]
datelist = [None if date == '' else date for date in datelist]
# You're missing this:
line[5], line[14] = datelist
Try:
line[5], line[14] = (None if line[i] == '' else line[i] for i in (5, 14))
@TonyAlmeida I'll have to test this out - this pattern seems to work in another case where I am converting null coordinates to 0s, so it is unclear to me why it wouldn't work in this case also. For context perhaps it helps to understand that in reading the .csv I'm doing this:
readcsv = csv.reader(txtfile,delimiter=',')
next(readcsv)
for line in readcsv:
new_line = []
... code to update values...
new_line.extend(line)
insertcursor.insertRow(new_line)
That's the issue. You're building new_line from line after your modifications, but if the list comprehension results aren't being written back to line before the extend, the original '' values are what get copied into new_line.
Since you're constructing new_line anyway, you don't need to write back to line at all. Just build the conversion into however you're populating new_line. The simplest approach:
for line in readcsv:
line[5] = None if line[5] == '' else line[5]
line[14] = None if line[14] == '' else line[14]
new_line = []
new_line.extend(line)
insertcursor.insertRow(new_line)
Or if you want to keep it as a comprehension and avoid the writeback issue entirely, apply it directly when extending:
for line in readcsv:
new_line = [None if (i in (5, 14) and line[i] == '') else line[i] for i in range(len(line))]
insertcursor.insertRow(new_line)That rebuilds the whole row in one pass and drops the extend entirely. Cleaner, and no ambiguity about whether the mutation happened before or after the copy.
If there's mutation of the line, I think my generator solution is probably the best bet. It turns the operation into multiple steps so you process the line, then convert those processed rows to a list:
def process(csv: Path) -> Iterator[list[Any]]:
... do line processing
yield processed_line
valid_lines = [l for l in process(Path('data.csv')) if <predicate>]
invalid_lines = [l for l in process(Path('data.csv')) if not <predicate>]
-- OR --
valid_lines = []
invalid_lines = []
for line in process(Path('data.csv')):
if not <predicate>:
invalid_lines.append(line)
continue
valid_lines.append(line)
Trying to mutate a list of arbitrary values in a comprehension is a pain. Ideally you want a mapping of values to fields, or just a direct list[i] = <some conversion of list[i]>.
@HaydenWelch and @TonyAlmeida appreciate your help and suggestions... I'll play with these different approaches and see which seems to work best for me!
Just putting in my two cents into this conversation but it could also be that the mapping of some fields could result in some errors when populating from a csv to a dbf or feature table in a database.
I would utilize functions that @HaydenWelch provided to extract the rows in the csv while also doing the same for the table you want to insert records into. You can then compare both sets of values to see if there is any mismatch in datatypes so that you can map the required values.
It may also require the use of datetime to convert any text that indicates a time to an actual datetime value if that is another issue you are running into. datetime can also be validated using isinstance(value, datetime) to verify that the value is a datetime value.
If the date is stored as a string, you will need to convert it to a datetime object before inserting it into a date field I believe. At least if it's not stored in an ISO format