I have extracted a large number of .txt documents from folders (using glob). I have then appended each document into a list called MaxDoc by doing this:
Documents1 = glob.glob('path*.txt')
MaxDoc = []
for file in Documents1:
f = open(file,'r')
MaxDoc.append(f.readlines())
f.close()
now, what I want to do is this: each item in that list is a whole document. each document has a section that says "Date of Last Revision: mm/dd/yyyy" and also "Revision No: xx"
here is a snapshot of what the part of the doc that has the info looks like:

I have been trying to see if I could iterate over the list and use regex to find the strings and extract the info. Once I extract it, I need to save it as a variable because I need to delete all the top portion of this document (the later portion of the doc is a table, and I want to convert that to a df). TIA for any advice!
CodePudding user response:
If each item in the list is the whole document ( I assume a string ) Then you can use regex with multistring.
I do not have your text (not given in the question) but I will show you an example with other text and you can use that as a start for your own code.
import re
txt='''
This is a part text
With multi line
It could be a text
document.
Revision No.: 18
Date: 12/27/2021
Other text
table,table,table
table,talbe,table
'''
reg_revision_no = re.compile(r'(Revision\sNo.:\s \d )')
reg_date = re.compile(r'(Date:\s \d2\/\d{2}\/\d{4})')
files = [txt, txt]
for file in files:
revision_no = reg_revision_no.search(file, re.M).group()
date = reg_date.search(file, re.M).group()
print(f"Revision No.: {revision_no}\nData: {date}")
output:
Revision No.: Revision No.: 18
Data: Date: 12/27/2021
Revision No.: Revision No.: 18
Data: Date: 12/27/2021
CodePudding user response:
You could just look for the key words and then extract only the content:
import re
txt = '''
This is a part text
With multi line
It could be a text
document.
Revision No.: 18
Date: 12/27/2021
Other text
table,table,table
table,table,table
'''
results = re.findall(r'Revision No\.:\s*(\d ).*Date:\s*([\d\/] )', txt, re.S)
for i in results:
print('Revision:', i[0])
print('Date:', i[1])
