Beautifulsoup count occurences of string

6246
5
Jump to solution
08-19-2020 08:15 AM
JaredPilbeam2
MVP Regular Contributor

I think this gets me the length of the text count for "COVID-19" because it prints 8.

import requests
from bs4 import BeautifulSoup
import re

url = r'https://www.bolingbrook.com/coronavirus'

#request webpage
soup = BeautifulSoup(requests.get(url.content, "lxml")
#find occurences of string
print(len(soup.find_all(string=re.compile("COVID-19"))))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

#prints
>>> 8‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

When I do a CTRL+F for "COVID-19" on the webpage I get a count of 5 occurrences. When I do a CTRL+F for "COVID-19" in the Developer tools I get 15.

I'm trying to get the count for the total occurrences of the string "COVID-19". How can I set up the code to do that?

0 Kudos
1 Solution

Accepted Solutions
JoshuaSharp-Heward
Occasional Contributor III

Seems like the easiest way to find out what it's doing is to just print out the results of 

soup.find_all(string=re.compile("COVID-19"))

right? Seems that the function returns a list of all matches, so you could just do something like the following:

matches = soup.find_all(string=re.compile("COVID-19"))
for match in matches:
   print(match)‍‍‍‍‍‍

Also I had a bit of a peek at that webpage, and I think in some places it references "Covid-19" and in others "COVID-19", which with your regular expression would only return the latter.

View solution in original post

5 Replies
JoshuaBixby
MVP Esteemed Contributor

I can't say I fully understand what your COVID-19 searches are meant to reflect, but I am quite certain the results from the Developer tool are not what you are after.  HTML is stylized content, and there are plenty of reasons COVID-19 could should up in parts of HTML that would not be what you are trying to capture with your search.

In order to extract meaning from scraping HTML pages, you need a decent understanding ahead of time about the HTML structure of the site/pages.    Most web pages today are dynamically generated from content management systems with fairly organized structure, i.e., there are set banners, sidebars, footers, ..., and main content.  Which part of the page you want to analyze and how you analyze it varies from question to question.

JaredPilbeam2
MVP Regular Contributor

Apologies if I wasn't very clear. I was more or less wondering if the len() function was doing what I intended it to, which was to count the occurrences of the COVID-19 element not the length of the string.

The script is meant to find the number of times "COVID-19" occurs in that URL. If that number changes the next time the script runs (through Task Scheduler) it triggers an email to one of the GIS staff members here (I left those parts out for clarification).

I see what you mean, though. You have to really study the HTML in able to get the script to accurately find things. It's a little difficult to find time to know a website that good when you're looking at every municipal and school district website in the county, however. They change constantly.

0 Kudos
JoshuaSharp-Heward
Occasional Contributor III

Seems like the easiest way to find out what it's doing is to just print out the results of 

soup.find_all(string=re.compile("COVID-19"))

right? Seems that the function returns a list of all matches, so you could just do something like the following:

matches = soup.find_all(string=re.compile("COVID-19"))
for match in matches:
   print(match)‍‍‍‍‍‍

Also I had a bit of a peek at that webpage, and I think in some places it references "Covid-19" and in others "COVID-19", which with your regular expression would only return the latter.

JaredPilbeam2
MVP Regular Contributor

You're right. That's exactly what I did. I put the print statement in, and simply counted the results. There are 8 of them. I was thinking it was counting the length of "COVID-19" because there are 8 characters there too.

#printed this
>>>
COVID-19 Message
Village of Bolingbrook COVID-19 Update 05.06.20
COVID-19 Message
Village of Bolingbrook COVID-19 Update 05.06.20
The Village of Bolingbrook has been in constant communication with both Amita Bolingbrook Hospital and Edward Hospital, along with the Will County and Illinois Department of Public Health to monitor the possible spread of the Coronavirus (COVID-19) in the Bolingbrook area. We are following the most current recommendations for treatment and preventing the potential spread of infection. Protective equipment and procedures are in place to provide emergency medical treatment and safe transport of individuals to the hospital. 
COVID-19 HELP
Will County COVID-19 Cases
Here in Northern Illinois, we have had 41 blood drives canceled due to coronavirus concerns, resulting in approximately 1,500 uncollected blood donations.  As the number of COVID-19 cases grow, we do expect that number to increase unfortunately.  That’s why we are asking organizations to please keep their blood drives and for donors to continue to give.  Together, we must ensure a readily available blood supply for patients who are counting on us.‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Also, good point. CTRL+F in the developer tools seems to not be case-sensitive. I'll go ahead and mark your answer correct since it was the closest thing to my not-so-clear question.

0 Kudos
JoshuaSharp-Heward
Occasional Contributor III

I also noticed that for the first time today, that CTRL-F isn't case sensitive! Glad I could help out though mate.

0 Kudos