How to use beautiful soup to remove HTML tags from ArcGIS Metadata

JaredPilbeam2 · ‎12-22-2017

I have a script that uses a Python package called arcpy_metdata. It basically allows you to get at ArcGIS metadata.

The script is set up to write the metadata to a text file and runs without errors, but unfortunately HTML code used to format the Description and Limitation items also gets written. It interferes with the readability of the textfile. I contacted the author of the package and he suggested BeautifulSoup, which leads to my question. I have this so far, but am at a loss at how to implement it:

from bs4 import BeautifulSoup 
cleantext = BeautifulSoup(raw_html).text‍‍‍‍

And here is my Metadata2Txt script:

import arcpy
import arcpy_metadata as md
import re


ws = r'Database Connections\ims to Plainfield.sde\gisedit.DBO.Tax_Map_LY\gisedit.DBO.Tax_Map_Parcels_LY'
metadata = md.MetadataEditor(ws)
path = r'\\gisfile\GISstaff\Jared\Python Scripts\Test\Parcels'
##cleantext = BeautifulSoup(raw_html).text

def meta2txt():
    title = metadata.title
    tags = metadata.tags
    purpose = metadata.purpose
    abstract = metadata.abstract
    credits = metadata.credits
    citation = metadata.citation
    limitation = metadata.limitation
    extent_description = metadata.extent_description
    desc = arcpy.Describe(ws)
    sr = desc.spatialReference
    tf = open(path + " " + "{}".format("Metadata.txt"), "w")
    tf.write("Metadata Content:" + "\n")
    tf.write("----------------------------------------------" + "\n")

    if title:
        print('Title:\n{}\n'.format(title))
        tf.write('Title:\n{}\n'.format(title) + '\n')
    else:
        print('Title: \nThere is no title.\n')
        tf.write('Title: \nThere is no title.\n' + '\n')
        
    if tags:
        print('Tags:\n{}\n'.format(tags))
        tf.write('Tags:\n{}\n'.format(tags) + '\n')
    else:
        print("Tags: \nThere are no tags.\n")
        tf.write('Tags: \nThere are no tags.\n' + '\n')

    if purpose:
        print('Summary:\n{}\n'.format(purpose))
        tf.write('Summary:\n{}\n'.format(purpose) + '\n')
    else:
        print('Summary: \nThere is no summary.\n' + '\n')
        tf.write('Summary: \nThere is no summary.\n' + '\n')

    if abstract:
        print('Description:\n{}\n'.format(abstract))
        tf.write('Description:\n{}\n'.format(abstract) + '\n')
    else:
        print('Description: \nThere is no description.\n')
        tf.write('Description: \nThere is no description.\n' + '\n')

    if credits:
        print('Credits:\n{}\n'.format(credits))
        tf.write('Credits:\n{}\n'.format(credits) + '\n')
    else:
        print('Credits: \nThere are no credits.\n')
        tf.write('Credits: \nThere are no credits.\n' + '\n')

    if citation:
        print('Citation:\n{}\n'.format(citation))
        tf.write('Citation:\n{}\n'.format(citation) + '\n')
    else:
        print('Citation: \nThere is no citation.\n')
        tf.write('Citation: \nThere is no citation.\n' + '\n')

    if limitation:
        print('Limitation:\n{}\n'.format(limitation))
        tf.write('Limitation:\n{}\n'.format(limitation) + '\n')
    else:
        print('Limitation: \nThere is no limitation.\n')
        tf.write('Limitation: \nThere is no limitation.\n' + '\n')

    if extent_description:
        print('Extent:\n{}\n'.format(extent_description))
        tf.write('Extent:\n{}\n'.format(extent_description) + '\n')
    else:
        print('Extent: \nThere is no extent.\n')
        tf.write('Extent: \nThere is no extent.\n' + '\n')

    if sr:
        print('Spatial Reference:\n{}\n'.format(sr.name))
        tf.write('Spatial Reference:\n{}\n'.format(sr.name) + '\n')
    else:
        print('Spatial Reference: \nThere is no spatial reference.\n')
        tf.write('Extent: \nThere is no spatial reference.\n' + '\n')

meta2txt()‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Here's how Description item of this particular feature class looks in the textfile after running the script:

Description:
<DIV STYLE="text-align:Left;"><DIV><DIV><P><SPAN>The tax map parcels layer is published 
every year normally during the spring through the Will County Clerk Tax Extension. This
layer contains various parcels within Will County. The tax map parcels entered is only
digitized based upon plats and recorded documents received by the Tax Extension within 
the Will County Clerk Office. </SPAN></P></DIV></DIV></DIV>‍‍‍‍‍‍‍‍‍‍‍‍

ShaunWalbridge · ‎12-27-2017

Jared,

BeautifulSoup expects an input that's an HTML page or fragment. Usually, it's doing the top level parsing, but here, you have arcpy_metadata to do the primary parsing, then want to filter the results through BeautifulSoup. If you'd like to learn how to use BeautifulSoup, I recommend their documentation -- it's quite good.

In your case, you want to pass through elements of your results to BeautifulSoup for further filtering prior to their output. A simple approach would be to write a function which passed the elements in question on to BeautifulSoup:

def strip_html(input):
    return BeautifulSoup(input).text‍‍

Then use that function for those elements which contain HTML.

View solution in original post

ShaunWalbridge · ‎12-27-2017

Jared,

BeautifulSoup expects an input that's an HTML page or fragment. Usually, it's doing the top level parsing, but here, you have arcpy_metadata to do the primary parsing, then want to filter the results through BeautifulSoup. If you'd like to learn how to use BeautifulSoup, I recommend their documentation -- it's quite good.

In your case, you want to pass through elements of your results to BeautifulSoup for further filtering prior to their output. A simple approach would be to write a function which passed the elements in question on to BeautifulSoup:

def strip_html(input):
    return BeautifulSoup(input).text‍‍

Then use that function for those elements which contain HTML.

JaredPilbeam2 · ‎12-27-2017

Shaun,

Thanks for the help. I put the same question to Sack Exchange and was answered by at least one alternative to BeautifulSoup: python - remove BeautifulSoup tags from Text file - Stack Overflow This uses the w3lib library and it seems to have done the trick.

For the time being, I'm going with this:

import arcpy
import arcpy_metadata as md
import w3lib.html
from w3lib.html import remove_tags

ws = r'Database Connections\ims to Plainfield.sde\gisedit.DBO.Tax_Map_LY\gisedit.DBO.Tax_Map_Parcels_LY'
metadata = md.MetadataEditor(ws)
path = r'\\gisfile\GISstaff\Jared\Python Scripts\Test\Parcels'

def meta2txt():
    abstract = metadata.abstract
    if abstract:
        new_abstract = remove_tags(abstract)
        print('Description:\n{}\n'.format(new_abstract)) 
    else:
        print('Description: \nThere is no description.\n')

meta2txt()‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍