Select to view content in your preferred language

list comprehension to evaluate each element of a list

126
9
8 hours ago
clt_cabq
Frequent Contributor

In my project I am reading a .csv which has two date values, these are at positions 5 and 14 in the row. The date values can either be null (indicated as '' in the .csv) or they can hold an actual date in the format of mm/dd/yyyy. In order to insert these rows into my geodatabase, I have to convert the '' value to a null, otherwise the value is not recognized as valid for a date field.

I can get the following code to work for each field:

empty = line[5]
if empty == '';
  line[5] = None
else:
  line[5]=empty

However, I was thinking I could implement a list comprehension to accomplish this more efficiently by using the following:

datelist = [line[5],line[14]]
datelist = [None if date == '' else date for date in datelist]

This doesn't seem to work, I get an error indicating a type mismatch with a date field.

Any thoughts on what I am misunderstanding or how I can approach this. Since I have a working solution its not critical so I am just trying to improve my skills and knowledge a bit here. Though since I am iterating over 20K records, I suppose every bit of efficiency helps.  

0 Kudos
9 Replies
HaydenWelch
MVP Regular Contributor

Is your null value two single quotes or an empty string?

date1,date2
'',dd/mm/yyy

-- OR --
date1,date2
,dd/mm/yyy

The former will evaluate to "''" when read, so your check is incorrect:

>>> # version 1
>>> print(row[4])
''
>>> repr(row[4])
'"\'\'"'
>>> # version 2
>>> print(row[4])

>>> repr(row[4])
"''"

 

To fix it that way, just modify your condition to check for the literal '' string:

datelist = [line[5],line[14]]
datelist = [None if date == "''" else date for date in datelist]

 

You can also do some fun stuff with the csv.DictReader if you want to avoid indexing:

from collections.abc import Iterator
from typing import Any
from pathlib import Path
from csv import DictReader

def format(line: str, *, sep: str=',') -> list[str]:
    return line.strip().split(sep)

def get_dates(fl: Path) -> Iterator[tuple[str | None, str | None]]:
    with fl.open('rt') as data:
        headers = format(data.readline())
        for line in data:
            row = dict(zip(headers, format(line)))
            yield (
                None if (dt:=row.get('date1')) == "''" else dt, 
                None if (dt:=row.get('date2')) == "''" else dt,
            )

def format_csv(fl: Path) -> Iterator[dict[str, Any]]:
    with fl.open('rt') as data:
        headers = format(data.readline())
        for row in DictReader(data, headers):
            row['date1'] = None if (dt:=row.get('date1')) == "''" else dt
            row['date2'] = None if (dt:=row.get('date1')) == "''" else dt
            yield row

 

This creates a generator for the CSV rows, which means you are only loading in one row at a time, which lets you filter them before adding them to a memory list. If you have a ton of data this can prevent you from loading all of it into memory every time you want to check for something:

# Initialize the generator (nothing is loaded in)            
reader = format_csv(Path('data.csv'))

# Create a list that contains only the rows of the
# csv where both date fields are populated
valid_rows = [
    row for row in reader 
    if row['date1'] and row['date2']
]

 

Additional Note on Generators:

Once a generator is exhausted (it's been iterated with a comprehension/for loop), It can't be iterated again:

>>> x = (i for i in range(10)) # generator comprehension syntax
>>> list(x)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(x)
[]
>>>

 

clt_cabq
Frequent Contributor

@HaydenWelch thanks! this is great stuff, I'll see if the quoting example you provide works and report back.

TonyAlmeida
MVP Regular Contributor

The problem is that after building  datelist, you still have to write the values back to line. Right now you're updating the list but never putting the results back into the row, so the original line[5] and line[14] are unchanged and still hold ''

datelist = [line[5], line[14]]
datelist = [None if date == '' else date for date in datelist]

# You're missing this:
line[5], line[14] = datelist

 

Try:

line[5], line[14] = (None if line[i] == '' else line[i] for i in (5, 14))
clt_cabq
Frequent Contributor

@TonyAlmeida I'll have to test this out - this pattern seems to work in another case where I am converting null coordinates to 0s, so it is unclear to me why it wouldn't work in this case also. For context perhaps it helps to understand that in reading the .csv I'm doing this:

readcsv = csv.reader(txtfile,delimiter=',')
next(readcsv)
  for line in readcsv:
  new_line = []
... code to update values...
new_line.extend(line)
insertcursor.insertRow(new_line)

 

0 Kudos
TonyAlmeida
MVP Regular Contributor

That's the issue. You're building new_line from line after your modifications, but if the list comprehension results aren't being written back to line before the extend, the original '' values are what get copied into new_line.

Since you're constructing new_line anyway, you don't need to write back to line at all. Just build the conversion into however you're populating new_line. The simplest approach:

for line in readcsv:
    line[5]  = None if line[5]  == '' else line[5]
    line[14] = None if line[14] == '' else line[14]
    new_line = []
    new_line.extend(line)
    insertcursor.insertRow(new_line)

 

Or if you want to keep it as a comprehension and avoid the writeback issue entirely, apply it directly when extending:

for line in readcsv:
    new_line = [None if (i in (5, 14) and line[i] == '') else line[i] for i in range(len(line))]
    insertcursor.insertRow(new_line)

That rebuilds the whole row in one pass and drops the extend entirely. Cleaner, and no ambiguity about whether the mutation happened before or after the copy.

HaydenWelch
MVP Regular Contributor

If there's mutation of the line, I think my generator solution is probably the best bet. It turns the operation into multiple steps so you process the line, then convert those processed rows to a list:

def process(csv: Path) -> Iterator[list[Any]]:
    ... do line processing
    yield processed_line

valid_lines = [l for l in process(Path('data.csv')) if <predicate>]
invalid_lines = [l for l in process(Path('data.csv')) if not <predicate>]

-- OR --

valid_lines = []
invalid_lines = []
for line in process(Path('data.csv')):
    if not <predicate>:
        invalid_lines.append(line)
        continue
    valid_lines.append(line)

 

Trying to mutate a list of arbitrary values in a comprehension is a pain. Ideally you want a mapping of values to fields, or just a direct list[i] = <some conversion of list[i]>. 

0 Kudos
clt_cabq
Frequent Contributor

@HaydenWelch and @TonyAlmeida appreciate your help and suggestions... I'll play with these different approaches and see which seems to work best for me!

0 Kudos
RPGIS
by MVP Regular Contributor
MVP Regular Contributor

Just putting in my two cents into this conversation but it could also be that the mapping of some fields could result in some errors when populating from a csv to a dbf or feature table in a database.

I would utilize functions that @HaydenWelch provided to extract the rows in the csv while also doing the same for the table you want to insert records into. You can then compare both sets of values to see if there is any mismatch in datatypes so that you can map the required values.

It may also require the use of datetime to convert any text that indicates a time to an actual datetime value if that is another issue you are running into. datetime can also be validated using isinstance(value, datetime) to verify that the value is a datetime value.

 

HaydenWelch
MVP Regular Contributor

If the date is stored as a string, you will need to convert it to a datetime object before inserting it into a date field I believe. At least if it's not stored in an ISO format

0 Kudos