I have a script that uses a Python package called arcpy_metdata. It basically allows you to get at ArcGIS metadata.
The script is set up to write the metadata to a text file and runs without errors, but unfortunately HTML code used to format the Description and Limitation items also gets written. It interferes with the readability of the textfile. I contacted the author of the package and he suggested BeautifulSoup, which leads to my question. I have this so far, but am at a loss at how to implement it:
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html).text
And here is my Metadata2Txt script:
import arcpy
import arcpy_metadata as md
import re
ws = r'Database Connections\ims to Plainfield.sde\gisedit.DBO.Tax_Map_LY\gisedit.DBO.Tax_Map_Parcels_LY'
metadata = md.MetadataEditor(ws)
path = r'\\gisfile\GISstaff\Jared\Python Scripts\Test\Parcels'
##cleantext = BeautifulSoup(raw_html).text
def meta2txt():
title = metadata.title
tags = metadata.tags
purpose = metadata.purpose
abstract = metadata.abstract
credits = metadata.credits
citation = metadata.citation
limitation = metadata.limitation
extent_description = metadata.extent_description
desc = arcpy.Describe(ws)
sr = desc.spatialReference
tf = open(path + " " + "{}".format("Metadata.txt"), "w")
tf.write("Metadata Content:" + "\n")
tf.write("----------------------------------------------" + "\n")
if title:
print('Title:\n{}\n'.format(title))
tf.write('Title:\n{}\n'.format(title) + '\n')
else:
print('Title: \nThere is no title.\n')
tf.write('Title: \nThere is no title.\n' + '\n')
if tags:
print('Tags:\n{}\n'.format(tags))
tf.write('Tags:\n{}\n'.format(tags) + '\n')
else:
print("Tags: \nThere are no tags.\n")
tf.write('Tags: \nThere are no tags.\n' + '\n')
if purpose:
print('Summary:\n{}\n'.format(purpose))
tf.write('Summary:\n{}\n'.format(purpose) + '\n')
else:
print('Summary: \nThere is no summary.\n' + '\n')
tf.write('Summary: \nThere is no summary.\n' + '\n')
if abstract:
print('Description:\n{}\n'.format(abstract))
tf.write('Description:\n{}\n'.format(abstract) + '\n')
else:
print('Description: \nThere is no description.\n')
tf.write('Description: \nThere is no description.\n' + '\n')
if credits:
print('Credits:\n{}\n'.format(credits))
tf.write('Credits:\n{}\n'.format(credits) + '\n')
else:
print('Credits: \nThere are no credits.\n')
tf.write('Credits: \nThere are no credits.\n' + '\n')
if citation:
print('Citation:\n{}\n'.format(citation))
tf.write('Citation:\n{}\n'.format(citation) + '\n')
else:
print('Citation: \nThere is no citation.\n')
tf.write('Citation: \nThere is no citation.\n' + '\n')
if limitation:
print('Limitation:\n{}\n'.format(limitation))
tf.write('Limitation:\n{}\n'.format(limitation) + '\n')
else:
print('Limitation: \nThere is no limitation.\n')
tf.write('Limitation: \nThere is no limitation.\n' + '\n')
if extent_description:
print('Extent:\n{}\n'.format(extent_description))
tf.write('Extent:\n{}\n'.format(extent_description) + '\n')
else:
print('Extent: \nThere is no extent.\n')
tf.write('Extent: \nThere is no extent.\n' + '\n')
if sr:
print('Spatial Reference:\n{}\n'.format(sr.name))
tf.write('Spatial Reference:\n{}\n'.format(sr.name) + '\n')
else:
print('Spatial Reference: \nThere is no spatial reference.\n')
tf.write('Extent: \nThere is no spatial reference.\n' + '\n')
meta2txt()
Here's how Description item of this particular feature class looks in the textfile after running the script:
Description:
<DIV STYLE="text-align:Left;"><DIV><DIV><P><SPAN>The tax map parcels layer is published
every year normally during the spring through the Will County Clerk Tax Extension. This
layer contains various parcels within Will County. The tax map parcels entered is only
digitized based upon plats and recorded documents received by the Tax Extension within
the Will County Clerk Office. </SPAN></P></DIV></DIV></DIV>
Solved! Go to Solution.
Jared,
BeautifulSoup expects an input that's an HTML page or fragment. Usually, it's doing the top level parsing, but here, you have arcpy_metadata to do the primary parsing, then want to filter the results through BeautifulSoup. If you'd like to learn how to use BeautifulSoup, I recommend their documentation -- it's quite good.
In your case, you want to pass through elements of your results to BeautifulSoup for further filtering prior to their output. A simple approach would be to write a function which passed the elements in question on to BeautifulSoup:
def strip_html(input):
return BeautifulSoup(input).text
Then use that function for those elements which contain HTML.
Jared,
BeautifulSoup expects an input that's an HTML page or fragment. Usually, it's doing the top level parsing, but here, you have arcpy_metadata to do the primary parsing, then want to filter the results through BeautifulSoup. If you'd like to learn how to use BeautifulSoup, I recommend their documentation -- it's quite good.
In your case, you want to pass through elements of your results to BeautifulSoup for further filtering prior to their output. A simple approach would be to write a function which passed the elements in question on to BeautifulSoup:
def strip_html(input):
return BeautifulSoup(input).text
Then use that function for those elements which contain HTML.
Shaun,
Thanks for the help. I put the same question to Sack Exchange and was answered by at least one alternative to BeautifulSoup: python - remove BeautifulSoup tags from Text file - Stack Overflow This uses the w3lib
library and it seems to have done the trick.
For the time being, I'm going with this:
import arcpy
import arcpy_metadata as md
import w3lib.html
from w3lib.html import remove_tags
ws = r'Database Connections\ims to Plainfield.sde\gisedit.DBO.Tax_Map_LY\gisedit.DBO.Tax_Map_Parcels_LY'
metadata = md.MetadataEditor(ws)
path = r'\\gisfile\GISstaff\Jared\Python Scripts\Test\Parcels'
def meta2txt():
abstract = metadata.abstract
if abstract:
new_abstract = remove_tags(abstract)
print('Description:\n{}\n'.format(new_abstract))
else:
print('Description: \nThere is no description.\n')
meta2txt()