Scenario: I have a wide variety of PDF and I am using a Notebook script I created in python to extract specific information. The specific information I need is extracted to console, and PDFs that do not have the specific values I am looking for are labeled as "Unprocessed".
I essentially want all the "Processed" files and their data to be uploaded to a feature layer under the data tab.
Is this possible? Or, are there any other workarounds such as an XLSX file being created from notebooks with the extracted information.
Thanks.
You would need a python module to process PDF files e.g. PyPDF2
It isn't installed by default in Notebook so excel files might be the better method.
You can read excel files using the following code:
# Import libraries
from arcgis.gis import GIS
gis = GIS("home")
from openpyxl import load_workbook# File location
directory = r'/arcgis/home/'
myexcelfile = directory + r'MyFile.xlsx'## Read excel file
wb = load_workbook(myexcelfile)
ws = wb['Sheet_Name']
# Iterate the loop to read the cell values
for col in range(1, ws.max_column+1):
for row in range(1, ws.max_row+1):
print(ws.cell(column=col,row=row).value)If you are using arcgis online notebook you need to add your excel files into the notebook directory. It's kind of weird but this post explains how to add files into your direcotry. https://www.esri.com/arcgis-blog/products/arcgis-online/announcements/new-user-workspace-and-file-ma...
I have the PYPDF reader, but I was wondering if it was possible if I can also develop script that then can repopulate the excel file with more data that was read from a file that was parsed.
Sure this isn't a problem. You just parse the PDF file using something like this:
from openpyxl import load_workbook
from pypdf import PdfReader
## Read excel file
myexcelfile = r"MyExcelFile.xlsx"
wb = load_workbook(myexcelfile)
ws = wb['Sheet_Name']
last_row = ws.max_row + 1 # Get the last row in the sheet
# Create a dictionary of column names
column_names = dict()
count = 1
for col in ws.iter_cols(1, ws.max_column):
print(col[0].value)
column_names[col[0].value] = count
count += 1
## Read the PDF
reader = PdfReader(r"myPDFFile.pdf")
page = reader.pages[0] # Choose the page
text = page.extract_text()
## Read the text line by line
for line in text.split("\n"):
#print(line.split(" "))
# Split your lines into values, I used space
column_1 = line.split(" ")[0]
column_2 = line.split(" ")[1]
column_3 = line.split(" ")[2]
# Write to the last cell specifying row, column and value
ws.cell(column=column_names["Column 1"],row=last_row, value=column_1)
ws.cell(column=column_names["Column 2"],row=last_row, value=column_2)
ws.cell(column=column_names["Column 3"],row=last_row, value=column_3)
# increase last row
last_row += 1
# Save the changes to the excel file
wb.save(filename=myexcelfile)
wb.close()
This should add your data to the excel file. You spreadsheet should have column names for your data. eg.