Select to view content in your preferred language

BeautifulSoup Detect Change Trigger

3997
8
Jump to solution
04-04-2020 04:24 PM
JaredPilbeam2
MVP Regular Contributor

I have a script that uses bs4 to scrape a webpage and grab a string named, "Last Updated 4/3/2020, 8:28 p.m.". I then assign this string to a variable and send it in an email. The script is scheduled to run once a day. However, the date and time on the website change every other day. So, instead of emailing every time I run the script I'd like to set up a trigger so that it sends only when the date is different. How do I configure the script to detect that change?

'''Checks municipal websites for changes in meal assistance during C19 pandemic'''

# Import requests (to download the page)
import requests
# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup
# Import win32com (Python module that controls Outlook)
import win32com.client
from win32com.client import Dispatch, constants
import re
import urllib
import urllib.request

#list of urls
urls = ['http://www.vofil.com/covid19_updates']

#set the headers like we are a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

#download the homepage
response = requests.get(urls[0], headers=headers)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(response.text, "lxml")
#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
print(last_update_fra)

#Put something here (if else..?) to trigger an email.

#I left off email block...‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Here's where the string is in the HTML:

<h1>COVID-19 News Updates</h1>


<div><br></div><div>Last Updated 4/3/2020, 12:08 p.m.</div>
0 Kudos
1 Solution

Accepted Solutions
JoshuaBixby
MVP Esteemed Contributor

Your current script is "stateless" in that it has no memory/awareness of whether it was run before nor what those results were from previous runs.  If you want to only have an e-mail sent when the date changes, then the script has to be able to determine when the last update happened.  I recommend having your script save the previous date in a file on the local machine, then it can open the file and read the date, and then check it against the current date.

View solution in original post

8 Replies
RandyBurton
MVP Alum

Perhaps if "last update" was done in the last 24 hours (or other time period), then send an email.  You may also need to consider the time zone in the time calculations.  The basic idea:

import re
from dateutil import parser
from datetime import datetime

pattern = re.compile(r"\d[\d\/,:. AaPpMm]{4,24}")

last_update_fra = """<h1>COVID-19 News Updates</h1>

<div><br></div><div>Last Updated 4/3/2020, 12:08 p.m.</div>"""

# assumes datetime will be last element, and a datetime was found
dt = parser.parse(pattern.findall(last_update_fra)[-1])
updated = datetime.now() - dt

if updated.days < 1 : # updated in the past day
    print "process -", dt
else:
    print "ignore -", dt

# prints: ignore - 2020-04-03 12:08:00‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
JoshuaBixby
MVP Esteemed Contributor

Your current script is "stateless" in that it has no memory/awareness of whether it was run before nor what those results were from previous runs.  If you want to only have an e-mail sent when the date changes, then the script has to be able to determine when the last update happened.  I recommend having your script save the previous date in a file on the local machine, then it can open the file and read the date, and then check it against the current date.

JaredPilbeam2
MVP Regular Contributor

Joshua,

I have a good grasp on what you're saying, thanks. I'm at the point where I'm checking the two dates against each other.

Both the "last_update_fra" and "txt" variables print out the same, but when I do if == there is no result.

#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
print(last_update_fra)

#write "Last Update" to file
txt = open(r"C:\Users\jpilbeam\LastUpdate.txt", "w")
#to write to file parameter has to be a string
txt.write(str(last_update_fra))
txt.close()
#open and read the file
txt = open(r"C:\Users\jpilbeam\LastUpdate.txt", "r")
print(txt.read())
if txt == last_update_fra:
    print("good")
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
>>> 
['Last Updated 4/3/2020, 8:28 p.m.']
['Last Updated 4/3/2020, 8:28 p.m.']
>>> ‍‍‍‍‍‍‍‍


Edit: They're not the same. Now I have to figure out why.

if txt != last_update_fra:
    print("good")

#prints good
0 Kudos
RandyBurton
MVP Alum

Although they look identical, txt is a file object, and last_update_fra appears to be a list object.

>>> last_update_fra = ['Last Updated 4/3/2020, 8:28 p.m.']
>>> txt = open(r"C:\Users\jpilbeam\LastUpdate.txt", "r")
>>> type(txt)
<type 'file'>
>>> t = txt.read()
>>> type(t)
<type 'str'>
>>> t
"['Last Updated 4/3/2020, 8:28 p.m.']"
>>> type(last_update_fra)
<type 'list'>
>>> tt = str(last_update_fra)
>>> tt
"['Last Updated 4/3/2020, 8:28 p.m.']"
>>> if t == tt:
... 	print "match"
... 	
match
>>> 
JaredPilbeam2
MVP Regular Contributor

Randy,

Thanks. I'm still working on this. I had another method going based on an answer which isn't working either after testing. It seems like I'm comparing the same thing, so the email never sent when the website eventually changed: python - BeautifulSoup Detect Change Trigger - Stack Overflow 

So, back here now.

I was running the snippets in the interpreter as you've done. When I read the txt file it came back as something different (date has changed as the website has since updated):

>>> last_update_fra = ['Last Updated 4/3/2020, 8:28 p.m.']
>>> txt = open(r"C:\Users\jpilbeam\LastUpdate.txt", "r")
>>> type(txt)
<class '_io.TextIOWrapper'>
>>> t = txt.read()
>>> type(t)
<class 'str'>
>>> t
"['Last Updated 4/6/2020, 4:45 p.m.']"
>>> type(last_update_fra)
<class 'list'>
>>> tt = str(last_update_fra)
>>> tt
"['Last Updated 4/3/2020, 8:28 p.m.']"
>>> if t == tt:
...     print("match")
... 
0 Kudos
JoshuaBixby
MVP Esteemed Contributor

You are comparing a text representation of a list to an actual list, in effect comparing string to list so of course they won't be equal.  Ditch the list by writing only the string of the date in the list and then comparing it to the string of the date you just retrieved.

0 Kudos
JaredPilbeam2
MVP Regular Contributor

" Ditch the list by writing only the string of the date in the list and then comparing it to the string of the date you just retrieved."

Joshua,

I'm thinking I did just that. But, it sent the email even though they appear to be both strings now?

#Find string in webpage
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
#use ''.join to change list object to string
last_update_fra_string = ''.join(last_update_fra)
print(last_update_fra_string)

#write to text file
txt = open(r"C:\Users\jpilbeam\LastUpdate.txt", "w")
txt.write(last_update_fra_string)
txt.close()

txt = open(r"C:\Users\jpilbeam\LastUpdate.txt", "r")
print(txt.read())

if last_update_fra_string == txt:
    print("no change")
else:
    print("send email")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
>>> 
Last Updated 4/6/2020, 4:45 p.m.
Last Updated 4/6/2020, 4:45 p.m.
send email
>>>‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
0 Kudos
JoshuaBixby
MVP Esteemed Contributor

In your code, you are comparing a string to a file object, which will never return as equal.  Read the text file into a variable and then compare it.