To read a PDF image, we have many libraries in python, but I found pytesseract outperforms the others. Let's see how to create a quick python script to read a PDF Image and get the required parameters or contents and then download as CSV

sudo apt-get update
sudo apt install python-pip
sudo apt install python3-pip
sudo apt install unzip
pip3 install pandas
pip3 install pytesseract
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo apt-get install -y poppler-utils
pip3 install pdf2image

Make sure, you have clean environment by running above commands. It basically keep your system ready to read & scan the necessasry PDF's.

import necessary packages in scripts, we will be using all these package from python

import pytesseract
from pdf2image import convert_from_path
import glob
import re
import pandas as pd
from PyPDF2 import PdfFileWriter, PdfFileReader

read your folder, and assign a path

pdfs = glob.glob(r"/home/ubuntu/pdffiles/*.PDF")

then write a script

for pdf_path in pdfs:
if filename:
inputpdf = PdfFileReader(open(pdf_path, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
pages = convert_from_path(pdf_path,first_page=page,last_page=min(page + 100 - 1, maxPages))
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
phone ='\b[56789]\d{9}\b', text, flags=0)
if phone:
df = pd.DataFrame({'PhoneNumber':phonenumber})
print("successfully Generated {}".format(filename))

Format the above code and it should look like below. Basically, what we are doing here is that, we run through list of files from folder & read 100 pages at a time then scanning the specific text from the page and extracting it into a list.
Then we are converting that into Pandas dataframe and finally creating a CSV with unique filename for each PDF.