Extract Text from PDF maps

DavidAnderson_1701 · ‎07-30-2023

Hello,

I want to extract some of the text strings that are in a PDF map document this from an exported layout. For example I want to extract the string AZ-FTA-000574 that is stored in a pdf

I tried the PyPDF2 that comes with ArcGIS Pro. That returns the georeferencing information but no text. The text is present as the PDF is searchable for labels as per this post.

https://support.esri.com/en-us/knowledge-base/problem-unable-to-search-for-text-in-an-exported-pdf-f...

The PyPDF2 included is version 1.26 which appears to be a circa 2016 package. A bit out of date.

I'd like to do this with the out of the box tools shipped with Pro, rather than installing ReportLab or other Python PDF tools.

Anonymous User · ‎07-31-2023

Looks like pypdf2 (PyPDF2 is deprecated since December 2022) fails to grab any text from that pdf, but the pypdf (same maintainer) package that is recommended (by the developer) to use gets it. Hard to say why pypdf2 is still in the base environment other than Dec 2022 is relatively recent and it is probably a dependency related install.

You can have the script install pypdf if it is not installed already...

import re

try:
    from pypdf import PdfReader
except Exception as ex:
    from subprocess import run
    import os
    import sys
    import json

    proc = run(["conda", "install", "pypdf", "-q", "-y", "--json"], text=True, capture_output=True)
    res = json.loads(proc.stdout)

    if res.get('stderr'):
        print(res['stderr'])  #
    else:
        from pypdf import PdfReader

# creating a pdf reader object
reader = PdfReader(
    r'C:\Users\...\Pilot_and_Table_11x17_Land_20230729_2148_Cottonwood Ridge_AZFTA000555_0730day.pdf')

# printing number of pages in pdf file
print(f'pages in pdf: {len(reader.pages)}')

# getting a specific page from the pdf file
page = reader.pages[0]

# extracting text from page
text = page.extract_text()

# use regex to get the string:
# r"(?:\w*-\w*-\d*)"
# Non-capturing group (?:\w*-\w*-\d*)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
# - matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# \d matches a digit (equivalent to [0-9])

comp = re.compile('(?:\w*-\w*-\d*)')
res = comp.search(text)

print(f'extracted: {res.group(0)}')

result:

pages in pdf: 2
extracted: AZ-FTA-000555

out from all the text that is returned:

2
51DIV A
DIV G
DIV DCopyright:© 2013 National Geographic Society, i-cubed
34°14.5'N 34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N 34°7.5'N34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W
110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W/Helispot
Division Break
Wildfire Daily Fire Perimeter
Temporary Flight Restriction
Contained
Uncontained
7/29/2023 2249
Acres from IR and GPS Acres from
IR and GPS Acres from IR and GPS
Acres from IR and GPS346Cottonwood Ridge
AZ-FTA-000555
07/29/2023 dayPilot
0 1 2
Miles

View solution in original post

DavidAnderson_1701 · ‎07-30-2023

This is a similar question to https://community.esri.com/t5/geoprocessing-questions/extracting-annotation-from-pdf-maps/m-p/127753...

No answers there though.

Anonymous User · ‎07-30-2023

Do you have a sample pdf you can share?

DavidAnderson_1701 · ‎07-30-2023

Here is a sample file.

IT is a two page PDF. The first page is the one that has the information to be extracted.

https://ftp.wildfire.gov/public/incident_specific_data/southwest/GACC_Incidents/2023/2023_Cottonwood...

Anonymous User · ‎07-31-2023

Looks like pypdf2 (PyPDF2 is deprecated since December 2022) fails to grab any text from that pdf, but the pypdf (same maintainer) package that is recommended (by the developer) to use gets it. Hard to say why pypdf2 is still in the base environment other than Dec 2022 is relatively recent and it is probably a dependency related install.

You can have the script install pypdf if it is not installed already...

import re

try:
    from pypdf import PdfReader
except Exception as ex:
    from subprocess import run
    import os
    import sys
    import json

    proc = run(["conda", "install", "pypdf", "-q", "-y", "--json"], text=True, capture_output=True)
    res = json.loads(proc.stdout)

    if res.get('stderr'):
        print(res['stderr'])  #
    else:
        from pypdf import PdfReader

# creating a pdf reader object
reader = PdfReader(
    r'C:\Users\...\Pilot_and_Table_11x17_Land_20230729_2148_Cottonwood Ridge_AZFTA000555_0730day.pdf')

# printing number of pages in pdf file
print(f'pages in pdf: {len(reader.pages)}')

# getting a specific page from the pdf file
page = reader.pages[0]

# extracting text from page
text = page.extract_text()

# use regex to get the string:
# r"(?:\w*-\w*-\d*)"
# Non-capturing group (?:\w*-\w*-\d*)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
# - matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# \d matches a digit (equivalent to [0-9])

comp = re.compile('(?:\w*-\w*-\d*)')
res = comp.search(text)

print(f'extracted: {res.group(0)}')

result:

pages in pdf: 2
extracted: AZ-FTA-000555

out from all the text that is returned:

2
51DIV A
DIV G
DIV DCopyright:© 2013 National Geographic Society, i-cubed
34°14.5'N 34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N 34°7.5'N34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W
110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W/Helispot
Division Break
Wildfire Daily Fire Perimeter
Temporary Flight Restriction
Contained
Uncontained
7/29/2023 2249
Acres from IR and GPS Acres from
IR and GPS Acres from IR and GPS
Acres from IR and GPS346Cottonwood Ridge
AZ-FTA-000555
07/29/2023 dayPilot
0 1 2
Miles

DavidAnderson_1701 · ‎07-31-2023

You had me at re.

This does work the way I expected it too. I too am not sure sure why a 7 year old version of pypdf2 is in the packages other than as a dependency.

I was trying to avoid setting up a new environment for this, which would be needed to add pypdf.

Anyhow thanks for the well crafted solution.