Compare Two CSV Files; Understanding DICT Statement

WilliamCole · ‎03-07-2021

Understanding Python dict Statement:

I am new to Python, so forgive me if I’m asking questions about what should be obvious. If this is posted in the wrong place, my apologies: please let me know where I should be posting this.

Using Python 2.7.

I am re-using and modifying 2014 code that was written by Martjin Pieters and posted to StackOverflow [message 5268929]. It allows for the comparison of two CSV files, reporting on the differences between the master and host files. I thank him for this head start. I am comparing a file exported from ArcGIS using ArcPy. Here it is:

1 #2/3/2021

2 #Source: Stack Overflow "Compare two CSV files and search for similar items”

3

4 import csv

5

6 with open('masterlist.csv', 'rb') as master:

7 master_indicies = dics ((r[1], i) for i, r in enumerate(csv.reader(master)))

8

9 with open('hosts.csv', 'rb') as hosts:

10 with open('results.csv', 'wb') as results:

11 reader = csv.reader(hosts)

12 writer = csv.writer(results)

13

14 for row in reader:

15 index = master_indices.get(row[3])

16 if index is not None:

17 message = 'FOUND in master list (row {})' .format(index)

18 else:

19 message = 'NOT FOUND in master list'

20 writer.writerrow(row + [message])

I sort of understand most of it save one section of line 7 shown in the figure below (though not sure I’ve parsed [the colors] the components correctly):

The part highlighted in green is the one element of the dict entry I simply do not know how to read/interpret [what is ‘r’?; what is ‘i’ and what does ‘i) for i’ do; etc.]. If someone could point me to a place that I could learn about dict at an introductory level, hopefully with enough specificity that would help me to sort this out; that would be great. Just trying to get my head around this.

Thank you very much.

DavidPike · ‎03-07-2021

So in the enumerate function, it has 2 components - 'i' and 'r' (these are just 'made up' references which could be anything - we just give them a simple name).

In this case, i is the index of the csv row, r is the entire row. r[1] is the 2nd cell in the CSV row (indexes start at 0!).

Enumerate is a Generator function where the r[1] value is being mapped to it's row index in a dictionary.

dict() can be used to construct key-value pair dictionary values:

e.g. dict( [('a':1), ('b':2)] )

so the big list of row[1] values and their corresponding row index is going in to the dictionary constructor.

Not explained very well at all I think 😞

Anonymous User · ‎03-07-2021

On the surface, r would be the values (data) returned in the csv.reader(master)

i is the enumerated value (0,1, 2, 3, 4 etc) of that data.

i, r is the tupled (index, data) result of the enumerate(csv.reader(master)) function.

dict((r[1], i) for i, r in enumerate(csv.reader(master))) is a form of list comprehension that is creating a dict from the returned value.

edit to add:

(r[1], i) for i, r in enumerate(csv.reader(master)) creates a list of tuples for the dict to create a dictionary.

The value from position r[1] would be the key and the i as the value.

JoshuaBixby · ‎03-07-2021

Just for learning purposes, it is helpful to look to the documentation: 5.5 Dictionaries -- Data Structures -- Python 3.9.2 documentation

The dict() constructor builds dictionaries directly from sequences of key-value pairs

Using an annotated, multi-line format to represent the single-line of code:

iterable = csv.reader(master)                     # CSV reader object is an iterable

dictionary = dict(                                # dictionary constructor
    (                                             # generator expression
        (count, value)                            # key-value pair
        for count, value in                       # for comprehension
        enumerate(                                # enumerating function
            iterable                              # iterable or sequence
        )
    )
)

master_indices = dictionary                       # master_indicies is a dictionary

jcarlson · ‎03-07-2021

The green part is basically a for-loop, though formatted differently. The two bits of code below are essentially equivalent.

new_list = []

for item in list_of_items:
    empty_list.append(str(item))

new_list = [str(item) for item in list_of_items]

The first feels a lot clearer and easier to follow, especially if you're new to Python, but the second is much more efficient. Still, they both end at the same point.

But supposing you're using that variable new_list simply as an input to another function, now the first code sequence creates a useless variable that's going to hang out for the rest of your code, taking up memory. The second makes it possible to avoid this.

@Anonymous Userand @DavidPike have explained the rest of it well, I think, but list comprehension can get very complex, and if you're not familiar with them, I'd hardly blame you for being a bit thrown by them. I still don't use them right half the time.

Check out some of the examples in the Python docs to get a clearer picture of what they are and how they can be used.

- Josh Carlson
Kendall County GIS

DavidPike · ‎03-07-2021

@jcarlson Indeed! the generator inside of the list comprehension alongside the dict() method did need a long think. I also think a lot of my own stuff can be put in a one-liner but I just feel I know what's going on in the code a bit more logically when I make it verbose, and I 'trust' it more. It's something I need to start making an effort to understand/implement if I want to get better.

WilliamCole · ‎03-09-2021

Thank you all for being so kind to take the time to reply. I'm working my way through your responses and already seeing some daylight. I'll post later when I get my head around this in a useful way. Doing a couple of simple scripts and work up from there.