SearchCursor and BeautifulSoup

1281
8
03-27-2020 08:31 AM
JaredPilbeam2
MVP Regular Contributor

I'd like to have this script loop through a list of URLs and use a web scraper to search certain things from each of these websites. I have my cursor set up as well as Beautiful soup, but I'm wondering how I identify each item in the list? Can I attach an index number to each one somehow? Here's what I have. If I run this it will print the URLs. I've used Beautiful Soup to find things in HTML before, but I'm not sure how to find things from URLs in a list?

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os, arcpy
import time


#the hosted layer with website urls in 'Website' field
fc = r'https://services.arcgis.com/fGsbyIOAuxHnF97m/arcgis/rest/services/Grab_Go_School_Meals_Location_(View)/FeatureServer/0?token=fbpCfA34sTJ4rzWO2TQn_c38B4TfGWOZ6jTMeFL1m7CNKd9_odI1t_t_hL-YvvePbE3M428FRT-zW-bISRYrGdJ2CnloKrHoHAfMnbGXpJ-5-zZBU6ONK1u0hMv5D-Vy-fnRpqpQP3aiQEke8L9d9jxDVBKWPamqCa0z0ko4IZX3xpIpHPSEKpmwpcJEaK7Z_rai3IBsT5-tqfMKIxnGCwe4SZZED8bDZM9j1T55-LggpjCgpwqWODs4vpj58iMy'
#query the webpage and return the html to the variable'soup'
html = urllib.request.urlopen(url)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(html, 'html.parser')

#use current time to detect change
t = time.ctime()

#Search Cursor
#fc field where URLs are stored
field = ["Website"]
with arcpy.da.SearchCursor(fc, field) as cursor:
     for row in cursor:
        print(row)

##BeautifulSoup
##count the number of '<h1>' tags in HTML
n = len(soup.find_all('h2'))
print(n)
#the text of the 26th '<h2>' tag
atts = soup.find_all('h2')[20].text‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
('http://www.manhattan114.org/index.php/download_file/view/2776/1/',)
('http://www2.nlsd122.org/files/district/parentsandstudents/message_from_superintendent/2019-2020/mfts_031720.pdf',)
('https://www.peotoneschools.org/UserFiles/Servers/Server_266769/File/COVID-19%20Email%203.15.20.pdf',)
('https://manteno5.org/news/what_s_new/c_o_v_i_d-19_updates',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
0 Kudos
8 Replies
JoshuaBixby
MVP Esteemed Contributor

Can you take a step back and explain your overall objective?  From looking at your code, I am not understanding the end part involving using bs4.

0 Kudos
JaredPilbeam2
MVP Regular Contributor

Joshua,

Thanks for the reply. My overall objective is two-part. I want to use the SearchCursor to create a list of URLs from the Website field of a hosted feature layer. That I've done. Secondly, I want to use bs4 to find certain things on each one of the webpages from the list. I'm stuck wondering how I connect the two blocks of code?

Sorry, that end part wasn't even set up for this script yet. It's from something else. This might be a better reference:

#the hosted layer with website urls in 'Website' field
fc = r'https://services.arcgis.com/fGsbyIOAuxHnF97m/arcgis/rest/services/Grab_Go_School_Meals_Location_(View)/FeatureServer/0?token=fbpCfA34sTJ4rzWO2TQn_c38B4TfGWOZ6jTMeFL1m7CNKd9_odI1t_t_hL-YvvePbE3M428FRT-zW-bISRYrGdJ2CnloKrHoHAfMnbGXpJ-5-zZBU6ONK1u0hMv5D-Vy-fnRpqpQP3aiQEke8L9d9jxDVBKWPamqCa0z0ko4IZX3xpIpHPSEKpmwpcJEaK7Z_rai3IBsT5-tqfMKIxnGCwe4SZZED8bDZM9j1T55-LggpjCgpwqWODs4vpj58iMy'
#use current time to detect change
t = time.ctime()

#Search Cursor
#fc field where URLs are stored
field = ["Website"]
with arcpy.da.SearchCursor(fc, field) as cursor:
     for row in cursor:
        print(row)

###Below still under construction###

#BeautifulSoup
#query the webpage and return the html to the variable'soup'
html = urllib.request.urlopen(fc)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(html, 'html.parser')
#print(soup.prettify())

#Count the number of '<h1>' tags in HTML
n = len(soup.find_all('h2'))
#The text of the 26th '<h2>' tag
##atts = soup.find_all('h2')[1].text‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
0 Kudos
DanPatterson_Retired
MVP Emeritus

so, is there some output? or an error? or it does nothing?

0 Kudos
JaredPilbeam2
MVP Regular Contributor

Dan,

I've only tested the top half. So, down to line 11. It prints the URLs just fine.

('http://www.manhattan114.org/index.php/download_file/view/2776/1/',)
('http://www2.nlsd122.org/files/district/parentsandstudents/message_from_superintendent/2019-2020/mfts_031720.pdf',)
('https://www.peotoneschools.org/UserFiles/Servers/Server_266769/File/COVID-19%20Email%203.15.20.pdf',)
('https://manteno5.org/news/what_s_new/c_o_v_i_d-19_updates',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)
('https://www.joliet86.org/student-grab-and-go-meals-available/',)
0 Kudos
JoshuaBixby
MVP Esteemed Contributor

bs4 won't parse PDF files, I believe, so you will have to figure out some intermediate step to download the PDF and extract the text.  Even after the text is extracted, there are no HTML structure tags with it.

DanPatterson_Retired
MVP Emeritus

sorry... only do soup with lunch

no examples on the web or the beautifulsoup doc site?

Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation 

at crummy.com... got to love it

JaredPilbeam2
MVP Regular Contributor

I dropped the idea of using a cursor. Now I'm just using Beautifulsoup, but I'm having some additional trouble that seems to be caused by the if else statement. I posted another question: https://community.esri.com/thread/250821-beautifulsoup-if-else-statement 

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

I marked this thread as Assumed Answered because your issues aren't really with cursors but BeautifulSoup, and you are not using cursors anymore and have started a new question about BeautifulSoup.

0 Kudos