I am trying to convert pdf into two lists: titles and content. but i find this function is not working for pdf last pages.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer,LTChar
#pdf--> title list and content list
def extract_title_content(path):
title=[]
content=[]
a=""
b=""
mode,minn= check_size(path)
for page_layout in extract_pages(path):
title.append(a)
content.append(b)
a=""
b=""
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
if character.size > mode:
a =character.get_text()
elif character.size> minn:
b =character.get_text()
else:
pass
return title,content
CodePudding user response:
In your outer loop you first add the recently extracted larger text in a to title and the medium text in b to content, then clear a and b, and then extract new text to a and b:
for page_layout in extract_pages(path):
title.append(a)
content.append(b)
a=""
b=""
[... extract into a and b ...]
Thus, what you extract from the last page never is added to title and content.
To fix this either move the adding of a and b to title and content after filling a and b:
for page_layout in extract_pages(path):
[... extract into a and b ...]
title.append(a)
content.append(b)
a=""
b=""
Or, if you do the adding before the filling for a reason, explicitly add again after the loop:
for page_layout in extract_pages(path):
title.append(a)
content.append(b)
a=""
b=""
[... extract into a and b ...]
title.append(a)
content.append(b)
