Select to view content in your preferred language

Extract Text from PDF maps

3431
5
Jump to solution
07-30-2023 04:21 PM
DavidAnderson_1701
Frequent Contributor

Hello,

I want to extract some of the text strings that are in a PDF map document this from an exported layout.   For example I want to extract the string AZ-FTA-000574 that is stored in a pdf

DavidAnderson_1701_1-1690759242088.png

 

 

 

I tried the PyPDF2 that comes with ArcGIS Pro.  That returns the georeferencing information but no text.  The text is present as the PDF is searchable for labels as per this post.

https://support.esri.com/en-us/knowledge-base/problem-unable-to-search-for-text-in-an-exported-pdf-f...

The PyPDF2 included is version 1.26 which appears to be a circa 2016 package.  A bit out of date.

I'd like to do this with the out of the box tools shipped with Pro, rather than installing ReportLab or other Python PDF tools.

 

 

 

0 Kudos
1 Solution

Accepted Solutions
by Anonymous User
Not applicable

Looks like pypdf2 (PyPDF2 is deprecated since December 2022) fails to grab any text from that pdf, but the pypdf (same maintainer) package that is recommended (by the developer) to use gets it. Hard to say why pypdf2 is still in the base environment other than Dec 2022 is relatively recent and it is probably a dependency related install.

You can have the script install pypdf if it is not installed already...

 

import re

try:
    from pypdf import PdfReader
except Exception as ex:
    from subprocess import run
    import os
    import sys
    import json

    proc = run(["conda", "install", "pypdf", "-q", "-y", "--json"], text=True, capture_output=True)
    res = json.loads(proc.stdout)

    if res.get('stderr'):
        print(res['stderr'])  #
    else:
        from pypdf import PdfReader

# creating a pdf reader object
reader = PdfReader(
    r'C:\Users\...\Pilot_and_Table_11x17_Land_20230729_2148_Cottonwood Ridge_AZFTA000555_0730day.pdf')

# printing number of pages in pdf file
print(f'pages in pdf: {len(reader.pages)}')

# getting a specific page from the pdf file
page = reader.pages[0]

# extracting text from page
text = page.extract_text()

# use regex to get the string:
# r"(?:\w*-\w*-\d*)"
# Non-capturing group (?:\w*-\w*-\d*)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
# - matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# \d matches a digit (equivalent to [0-9])

comp = re.compile('(?:\w*-\w*-\d*)')
res = comp.search(text)

print(f'extracted: {res.group(0)}')

 

 

result:

pages in pdf: 2
extracted: AZ-FTA-000555

out from all the text that is returned:

2
51DIV A
DIV G
DIV DCopyright:© 2013 National Geographic Society, i-cubed
34°14.5'N 34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N 34°7.5'N34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W
110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W/Helispot
Division Break
Wildfire Daily Fire Perimeter
Temporary Flight Restriction
Contained
Uncontained
7/29/2023 2249
Acres from IR and GPS Acres from
IR and GPS Acres from IR and GPS
Acres from IR and GPS346Cottonwood Ridge
AZ-FTA-000555
07/29/2023 dayPilot
0 1 2
Miles

 

View solution in original post

5 Replies
DavidAnderson_1701
Frequent Contributor
0 Kudos
by Anonymous User
Not applicable

Do you have a sample pdf you can share?

0 Kudos
DavidAnderson_1701
Frequent Contributor

Here is a sample file.

IT is a two page PDF.  The first page is the one that has the information to be extracted.

 

https://ftp.wildfire.gov/public/incident_specific_data/southwest/GACC_Incidents/2023/2023_Cottonwood...

0 Kudos
by Anonymous User
Not applicable

Looks like pypdf2 (PyPDF2 is deprecated since December 2022) fails to grab any text from that pdf, but the pypdf (same maintainer) package that is recommended (by the developer) to use gets it. Hard to say why pypdf2 is still in the base environment other than Dec 2022 is relatively recent and it is probably a dependency related install.

You can have the script install pypdf if it is not installed already...

 

import re

try:
    from pypdf import PdfReader
except Exception as ex:
    from subprocess import run
    import os
    import sys
    import json

    proc = run(["conda", "install", "pypdf", "-q", "-y", "--json"], text=True, capture_output=True)
    res = json.loads(proc.stdout)

    if res.get('stderr'):
        print(res['stderr'])  #
    else:
        from pypdf import PdfReader

# creating a pdf reader object
reader = PdfReader(
    r'C:\Users\...\Pilot_and_Table_11x17_Land_20230729_2148_Cottonwood Ridge_AZFTA000555_0730day.pdf')

# printing number of pages in pdf file
print(f'pages in pdf: {len(reader.pages)}')

# getting a specific page from the pdf file
page = reader.pages[0]

# extracting text from page
text = page.extract_text()

# use regex to get the string:
# r"(?:\w*-\w*-\d*)"
# Non-capturing group (?:\w*-\w*-\d*)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
# - matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
# \w matches any word character (equivalent to [a-zA-Z0-9_])
# \d matches a digit (equivalent to [0-9])

comp = re.compile('(?:\w*-\w*-\d*)')
res = comp.search(text)

print(f'extracted: {res.group(0)}')

 

 

result:

pages in pdf: 2
extracted: AZ-FTA-000555

out from all the text that is returned:

2
51DIV A
DIV G
DIV DCopyright:© 2013 National Geographic Society, i-cubed
34°14.5'N 34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N 34°7.5'N34°14'N 34°13.5'N 34°13'N 34°12.5'N 34°12'N 34°11.5'N 34°11'N 34°10.5'N 34°10'N 34°9.5'N 34°8.99'N 34°8.5'N 34°8'N110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W
110°4'W 110°4.5'W 110°5'W 110°5.5'W 110°6.01'W 110°6.5'W 110°7'W 110°7.5'W 110°8'W 110°8.5'W 110°9'W 110°9.5'W 110°10'W 110°10.5'W 110°11'W 110°11.5'W 110°12'W 110°12.5'W 110°13'W 110°13.51'W 110°14'W 110°14.5'W 110°15'W/Helispot
Division Break
Wildfire Daily Fire Perimeter
Temporary Flight Restriction
Contained
Uncontained
7/29/2023 2249
Acres from IR and GPS Acres from
IR and GPS Acres from IR and GPS
Acres from IR and GPS346Cottonwood Ridge
AZ-FTA-000555
07/29/2023 dayPilot
0 1 2
Miles

 

DavidAnderson_1701
Frequent Contributor

You had me at re.

This does work the way I expected it too.  I too am not sure sure why a 7 year old version of pypdf2 is in the packages other than as a  dependency.

I was trying to avoid setting up a new environment for this, which would be needed to add pypdf. 

Anyhow thanks for the well crafted solution.  

0 Kudos