Home > Software design >  Extracting Information from a List of Documents
Extracting Information from a List of Documents

Time:01-14

I have extracted a large number of .txt documents from folders (using glob). I have then appended each document into a list called MaxDoc by doing this:

Documents1 = glob.glob('path*.txt')

MaxDoc = []

for file in Documents1:
     f = open(file,'r')
     MaxDoc.append(f.readlines())
     f.close()

now, what I want to do is this: each item in that list is a whole document. each document has a section that says "Date of Last Revision: mm/dd/yyyy" and also "Revision No: xx"

here is a snapshot of what the part of the doc that has the info looks like: enter image description here

I have been trying to see if I could iterate over the list and use regex to find the strings and extract the info. Once I extract it, I need to save it as a variable because I need to delete all the top portion of this document (the later portion of the doc is a table, and I want to convert that to a df). TIA for any advice!

CodePudding user response:

If each item in the list is the whole document ( I assume a string ) Then you can use regex with multistring.

I do not have your text (not given in the question) but I will show you an example with other text and you can use that as a start for your own code.

import re

txt='''
This is a part text
With multi line

It could be a text 
document.
Revision No.: 18
Date: 12/27/2021

Other text

table,table,table
table,talbe,table
'''

reg_revision_no = re.compile(r'(Revision\sNo.:\s \d )')
reg_date = re.compile(r'(Date:\s \d2\/\d{2}\/\d{4})')

files = [txt, txt]
for file in files:
    revision_no = reg_revision_no.search(file, re.M).group()
    date = reg_date.search(file, re.M).group()
    print(f"Revision No.: {revision_no}\nData: {date}")

output:

Revision No.: Revision No.: 18
Data: Date: 12/27/2021
Revision No.: Revision No.: 18
Data: Date: 12/27/2021

CodePudding user response:

You could just look for the key words and then extract only the content:

import re

txt = '''
This is a part text
With multi line

It could be a text 
document.
Revision No.: 18
Date: 12/27/2021

Other text

table,table,table
table,table,table
'''

results = re.findall(r'Revision No\.:\s*(\d ).*Date:\s*([\d\/] )', txt, re.S)

for i in results:
    print('Revision:', i[0])
    print('Date:', i[1])
  •  Tags:  
  • Related