topic Re: Extract Text from PDF maps in Python Questions

Extract Text from PDF maps

DavidAnderson_1701 — Sun, 30 Jul 2023 23:21:22 GMT

Hello,

I want to extract some of the text strings that are in a PDF map document this from an exported layout. For example I want to extract the string AZ-FTA-000574 that is stored in a pdf

I tried the PyPDF2 that comes with ArcGIS Pro. That returns the georeferencing information but no text. The text is present as the PDF is searchable for labels as per this post.

https://support.esri.com/en-us/knowledge-base/problem-unable-to-search-for-text-in-an-exported-pdf-fr-000027716

The PyPDF2 included is version 1.26 which appears to be a circa 2016 package. A bit out of date.

I'd like to do this with the out of the box tools shipped with Pro, rather than installing ReportLab or other Python PDF tools.

Re: Extract Text from PDF maps

DavidAnderson_1701 — Sun, 30 Jul 2023 23:23:22 GMT

No answers there though.

Re: Extract Text from PDF maps

Anonymous User — Mon, 31 Jul 2023 01:05:48 GMT

Do you have a sample pdf you can share?

Re: Extract Text from PDF maps

DavidAnderson_1701 — Mon, 31 Jul 2023 03:49:56 GMT

Here is a sample file.

IT is a two page PDF. The first page is the one that has the information to be extracted.

https://ftp.wildfire.gov/public/incident_specific_data/southwest/GACC_Incidents/2023/2023_CottonwoodRidge/GIS/Maps/20230730/Pilot_and_Table_11x17_Land_20230729_2148_Cottonwood%20Ridge_AZFTA000555_0730day.pdf

Re: Extract Text from PDF maps

Anonymous User — Mon, 31 Jul 2023 12:44:26 GMT

Looks like pypdf2 (PyPDF2 is deprecated since December 2022) fails to grab any text from that pdf, but the pypdf (same maintainer) package that is recommended (by the developer) to use gets it. Hard to say why pypdf2 is still in the base environment other than Dec 2022 is relatively recent and it is probably a dependency related install.

You can have the script install pypdf if it is not installed already...

import re try: from pypdf import PdfReader except Exception as ex: from subprocess import run import os import sys import json proc = run(["conda", "install", "pypdf", "-q", "-y", "--json"], text=True, capture_output=True) res = json.loads(proc.stdout) if res.get('stderr'): print(res['stderr']) # else: from pypdf import PdfReader # creating a pdf reader object reader = PdfReader( r'C:\Users\...\Pilot_and_Table_11x17_Land_20230729_2148_Cottonwood Ridge_AZFTA000555_0730day.pdf') # printing number of pages in pdf file print(f'pages in pdf: {len(reader.pages)}') # getting a specific page from the pdf file page = reader.pages[0] # extracting text from page text = page.extract_text() # use regex to get the string: # r"(?:\w*-\w*-\d*)" # Non-capturing group (?:\w*-\w*-\d*) # \w matches any word character (equivalent to [a-zA-Z0-9_]) # * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy) # - matches the character - with index 4510 (2D16 or 558) literally (case sensitive) # \w matches any word character (equivalent to [a-zA-Z0-9_]) # \d matches a digit (equivalent to [0-9]) comp = re.compile('(?:\w*-\w*-\d*)') res = comp.search(text) print(f'extracted: {res.group(0)}')

result:

pages in pdf: 2
extracted: AZ-FTA-000555

out from all the text that is returned:

2
51DIV A
DIV G
DIV DCopyright:© 2013 National Geographic Society, i-cubed
34°14.5'N 34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N 34°7.5'N34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W
110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W/Helispot
Division Break
Wildfire Daily Fire Perimeter
Temporary Flight Restriction
Contained
Uncontained
7/29/2023 2249
Acres from IR and GPS Acres from
IR and GPS Acres from IR and GPS
Acres from IR and GPS346Cottonwood Ridge
AZ-FTA-000555
07/29/2023 dayPilot
0 1 2
Miles

Re: Extract Text from PDF maps

DavidAnderson_1701 — Mon, 31 Jul 2023 22:52:55 GMT

You had me at re.

This does work the way I expected it too. I too am not sure sure why a 7 year old version of pypdf2 is in the packages other than as a dependency.

I was trying to avoid setting up a new environment for this, which would be needed to add pypdf.

Anyhow thanks for the well crafted solution.