Gather data from a regularly formatted webpage (Tax Parcels)

AlfredBaldenweck · ‎08-01-2022

Hi all,

We frequently use the tax parcel layer published by the State of Hawaii to help us plan projects.

In the last few years, the State stopped publishing the layer with the ownership information, opting instead to include a link to a webpage featuring ownership, taxes, etc. as an attribute

Example here, with Hawai'i Volcano National Park. qPublic.net - Hawai'i County, HI - Report: 980010010000 (schneidercorp.com)

I'd like to be able to populate a copy of the layer (filtered to be relevant to us) with attributes from the webpage, mostly (especially) the ownership information.

Does anyone have any tips as to this might be done? Dynamic is not needed.

Thanks!

I_AM_ERROR · ‎08-01-2022

Since the URL contains the TMK # of the parcel you could use that with the requests library. Retrieve info from the page, parse the return, then repeat for each record of interest.

Anonymous User · ‎08-01-2022

Taking a look at the sites robots.txt file, it disallows all user agents (web crawlers/ automatic scraping) for /Application.aprx/ so be respectful/careful how you go about your data extraction.

You can use the python package BeautifulSoup to extract items/text from webpages/urls- there are a ton of tutorials on the net for how it can be done.