Home > database >  i can't extract last pages content, can some one debug?
i can't extract last pages content, can some one debug?

Time:01-08

I am trying to convert pdf into two lists: titles and content. but i find this function is not working for pdf last pages.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer,LTChar
#pdf--> title list and content list 
def extract_title_content(path):
    title=[]
    content=[]
    a=""
    b=""   
    mode,minn= check_size(path)
    for page_layout in extract_pages(path):
        title.append(a)
        content.append(b)
        a=""
        b=""           
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:               
                    for character in text_line:
                        if isinstance(character, LTChar):                       
                            if character.size > mode:
                                a =character.get_text()
                            elif character.size> minn:
                                b =character.get_text()
                            else:
                                pass  
    return title,content

CodePudding user response:

In your outer loop you first add the recently extracted larger text in a to title and the medium text in b to content, then clear a and b, and then extract new text to a and b:

for page_layout in extract_pages(path):
    title.append(a)
    content.append(b)
    a=""
    b=""           
    [... extract into a and b ...]

Thus, what you extract from the last page never is added to title and content.

To fix this either move the adding of a and b to title and content after filling a and b:

for page_layout in extract_pages(path):
    [... extract into a and b ...]
    title.append(a)
    content.append(b)
    a=""
    b=""           

Or, if you do the adding before the filling for a reason, explicitly add again after the loop:

for page_layout in extract_pages(path):
    title.append(a)
    content.append(b)
    a=""
    b=""           
    [... extract into a and b ...]
title.append(a)
content.append(b)
  •  Tags:  
  • Related