TL;DR I write a code to delete replicate pdf.
摘要: 我做了個小程式來刪除PDF多餘的頁面。
2023/08/19 There are some pdf have encoded, and that kinds of pdf cannot work well in this program
Code: https://colab.research.google.com/drive/1RtjjD9TcNi3HFvfVdE4owegIfqYWxXa4?usp=sharing
之前在研究所的時候,都有類似的問題。因為pdf的簡報不能有動畫功能,所以會有講者把一頁拆成好幾頁,這好幾頁的差異就只是多了或少了一段文字,就成功有動畫效果了。但是對於學生來說,這樣的pdf是很痛苦的,因為會把原本可能十幾頁的文件變成三十幾頁。例如在 introduction to statistical learning 的其中一個講義中,被拆成4頁的樣子:
During my time in the research institute, we often encountered a similar issue. Since PDF presentations don't have animation functions, some speakers would divide a single slide into several slides. The difference between these slides would only be the addition or removal of a piece of text, thus achieving an effect similar to animation. However, this type of PDF is quite painful for students, as it can turn a document that might originally be only a dozen pages into over thirty. For instance, in one of the lectures from Introduction to Statistical Learning, there are 4 pages like following:
所以我做了一個code,目前存在colab裡面。目前只能解決有頁碼的pdf,之後功能再說啦。
So, I create a function, and save in colab. Now, it just solve the problem for the pdf with page number.
但是我想要分享一下如何用python處理pdf, 因為PyPDF2目前版本為3.0.0,但是網路上都是前一代的教學。目前pdf讀入的function改成PdfReader,我有用的功能就是讀入、告訴我有幾頁、給我的一頁資訊、以及輸出第一頁內容。
However, I would like to share how to handle PDFs using Python. Currently, the version of PyPDF2 is 3.0.0, but most of the tutorials available online are based on the previous version. Now, the function to read PDFs has changed to PdfReader. The features I use include loading the PDF, telling me the number of pages, providing information about a page, and outputting the content of the first page.
from PyPDF2 import PdfReader
reader = PdfReader(path)
page_len = len(reader.pages)
page = reader.pages[0]
print(page.extract_text())
pdf輸出則是用PdfWriter,我有用的功能為建立class、增加pdf頁面、和存出pdf
For PDF output, I use the PdfWriter function. The features I use include creating a class, adding PDF pages, and saving the PDF.
from PyPDF2 import PdfWriter
writer = PdfWriter()
writer.add_page(page)
with open(path, "wb") as fp:
writer.write(fp)