In this tutorial, we are going to learn how to extract text from a PDF file to a Text file using Python.
Before we dive into tutorial, you will need to install PyPDF2 library (pip install PyPDF2).
Buy Me a Coffee? Your support is much appreciated!
PayPal Me: https://www.paypal.me/jiejenn/5
Venmo: @Jie-Jenn
Source Code:
from PyPDF2 import PdfFileReader, PdfFileWriter
file_path = 'Lecture.pdf'
pdf = PdfFileReader(file_path)
with open('Lecture Note.txt', 'w') as f:
for page_num in range(pdf.numPages):
# print('Page: {0}'.format(page_num))
pageObj = pdf.getPage(page_num)
try:
txt = pageObj.extractText()
print(''.center(100, '-'))
except:
pass
else:
f.write('Page {0}\n'.format(page_num+1))
f.write(''.center(100, '-'))
f.write(txt)
f.close()
Thankyou this source code is very helpful and,
please tell how to extract image from pdf also ?
its urgent !!!
Computer Vision.. opencv2 library
This is not working for this pdf.
See link to google drive:
https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing
Getting error fileobj = open(stream, ‘rb’)
OSError: [Errno 22] Invalid argument: in the pdf = PdfFileReader(file_path) line.
Hi,
I’ve tested your code. It works.
But I get a different results from you.
I’m trying to use journal paper for the test.
The words are always going down between each word.
Can you suggest to me an idea to solve the problem?
getting this error while executing above mentioned code. Kindly help
TypeError: latin_1_encode() argument 1 must be str, not bytes