Python Implementation Question

850
13
Jump to solution
12-23-2019 02:46 PM
SamLee1
New Contributor

Hello I am trying to figure out the if there is a way to compare two excel file's dates/time?

I wrote a python script, attached below, that downloads a excel file from a website and when I run it again

it pushed the file that was downloaded first to an archived folder and the new file that gets downloads foes to where the old was was originally at and it becomes am excel to csv file in the end.

Excel File Link: https://www.gosolarcalifornia.ca.gov/equipment/documents/Grid_Support_Inverter_List_Full_Data.xlsm 

Website Link: Inverters 

I got this idea so far:import pandas as pd
#Reading two Excel Sheets

sheet1 = pd.read_excel(r'Book1.xlsx')
sheet2 = pd.read_excel(r'Book2.xlsx')

# Iterating the Columns Names of both Sheets
for i,j in zip(sheet1,sheet2):

# Creating empty lists to append the columns values
a,b =[],[]

# Iterating the columns values
for m, n in zip(sheet1,sheet2:(

# Appending values in lists
a.append(m)
b.append(n)

# Sorting the lists
a.sort()
b.sort()

# Iterating the list's values and comparing them
for m, n in zip(range(len(a)), range(len(b))):
if a != b:
print('Column name : \'{}\' and Row Number : {}'.format(i,m))

Is there a way I can do this logic:
if(file1 != file2):
   run script

else

   do nothing

It's the conditional statement that I am stuck on.

*file 1 is the excel on the website and file 2 is the download excel file on the computer

Tags (2)
0 Kudos
1 Solution

Accepted Solutions
BlakeTerhune
MVP Regular Contributor

Line 4 will return a -1 if fileNameHTML is not found in indexHTML. So, line 5 checks for the presence of a <td> tag with the filename. In the event it is found (the value is not -1), it then checks for the first ">" and the first "<" in indexHTML on lines 6 and 7. Using the returned indexes, it slices indexHTML, strips the leading and trailing white space, and formats the text into a datetime object.

Yes, that's exactly right. That was the simplest way I could think (with builtin libraries) to get the text value of the tag in the HTML table. This does assume the filenames are all unique, which I think is safe. Once it finds the file name, it assumes the next tag will contain the date. I simplified this to just look for the first > (indiciating the last character of the opening tag) and the next < after that (indicating the first character of the closing tag). The value of the tag (the date) will be between those two characters. Hopefully that is clear and you can comment that section in a way that makes sense to you.

View solution in original post

0 Kudos
13 Replies
BlakeTerhune
MVP Regular Contributor

If I understand correctly, you want a way to compare the created or modified date of two files? Here's one solution I found.

0 Kudos
Arne_Gelfert
Regular Contributor

Sam, are you just trying to see if the files are the same? If so, how about doing something like:

>>> import filecmp
>>> file1 = r"c:\temp\foo.txt"
>>> file2 = r"c:\temp\bar.txt"
>>> filecmp.cmp(file1, file2)
True
>>> 

Of course, you can't compare XLS and CSV, you'd have to be comparing apples to apples.

0 Kudos
SamLee1
New Contributor

Hello Arne,

Thanks for the help.

How about if the excel file is on the website, how can I compare an excel file on my desktop with an excel file on a website without downloading it?

0 Kudos
BlakeTerhune
MVP Regular Contributor

Here is kind of a hacky solution that just parses the file index page for the file you want and gets the date listed in the next cell of the same table row. You can then compare this date with the date of your local file to decide if you want to download it.

import datetime
import os
import requests
import time

indexHTML = requests.get("https://www.gosolarcalifornia.ca.gov/equipment/documents/").text
fileName = "Grid_Support_Inverter_List_Full_Data.xlsm"
fileNameHTML = '<td><a href="{0}">{0}</a></td>'.format(fileName)
fileNameHTMLIndex = indexHTML.find(fileNameHTML)
if fileNameHTMLIndex != -1:
    dateStartIndex = indexHTML.find(">", fileNameHTMLIndex+len(fileNameHTML)) + 1
    dateEndIndex = indexHTML.find("<", dateStartIndex)
    webFileDate = datetime.datetime.strptime(indexHTML[dateStartIndex:dateEndIndex].strip(), "%Y-%m-%d %H:%M")
else:
    raise Exception("Unable to find fileName {}".format(fileName))

localFilePath = r"your\local\path\here"
localFileDate = os.path.getctime(localFilePath)
localFileDate = datetime.datetime.strptime(time.ctime(localFileDate), "%a %b %d %H:%M:%S %Y")
if webFileDate > localFileDate:
    print("File on the web is newer!")
    # download web file to replace local file‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Alternatively, just download the file and use the method Arne Gelfert‌ mentioned to compare them for differences. The file isn't large so it shouldn't take excessively long to download and the logic is much simpler.

0 Kudos
SamLee1
New Contributor

Hello Blake,

Thanks, I believe it works because I added an else statement, since the file on the website hasn't been updated yet.

To clarify, we are comparing the date of those two files despite one of them being a csv file which is the file on the local

drive and not the website, right?

0 Kudos
BlakeTerhune
MVP Regular Contributor

Correct. My method is only comparing file dates so it doesn't matter what the file actually is.

0 Kudos
SamLee1
New Contributor

Is it comparing the modified date or just date in general?

0 Kudos
BlakeTerhune
MVP Regular Contributor

My code sample is comparing the date listed for the file on the website (not sure what date it is) with the created date for the file on your local drive. You can adjust the line with getctime() to getmtime() if you want the modified date instead.

You might want to do some testing to make sure the file you download from the website has a created date either when it was downloaded or that matches what's listed on the website.

0 Kudos
SamLee1
New Contributor

Hello Blake,

May you explain a little bit more about the code, specifically this part of the code:

indexHTML = requests.get("https://www.gosolarcalifornia.ca.gov/equipment/documents/").text
fileName = "Grid_Support_Inverter_List_Full_Data.xlsm"
fileNameHTML = '<td><a href="{0}">{0}</a></td>'.format(fileName)
fileNameHTMLIndex = indexHTML.find(fileNameHTML)
if fileNameHTMLIndex != -1:
    dateStartIndex = indexHTML.find(">", fileNameHTMLIndex+len(fileNameHTML)) + 1
    dateEndIndex = indexHTML.find("<", dateStartIndex)
    webFileDate = datetime.datetime.strptime(indexHTML[dateStartIndex:dateEndIndex].strip(), "%Y-%m-%d %H:%M")
else:
    raise Exception("Unable to find fileName {}".format(fileName))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Line 4 will return a -1 if fileNameHTML is not found in indexHTML. So, line 5 checks for the presence of a <td> tag with the filename. In the event it is found (the value is not -1), it then checks for the first ">" and the first "<" in indexHTML on lines 6 and 7. Using the returned indexes, it slices indexHTML, strips the leading and trailing white space, and formats the text into a datetime object.

0 Kudos