I have the following test string:
================================================================================\nCorporate Participants\n================================================================================\n * Kirk Walters\n Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n * Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O\'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n\n================================================================================
I would like to get * Kirk Walters\n and
* Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O\'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n
My code so for the first one is
participants_corp = re.findall('Corporate Participants\n================================================================================\n (.*)\n\n================================================================================\nConference Call Participants', str)
I think it has to do somethin with the backslashes for the newline commands. I trie using four backslashes instead of one but that didn't change anything. Can you give advice?
CodePudding user response:
Without regex, you can use:
import io
buf = io.StringIO(text)
data = []
for line in buf:
line = line.strip()
if line.startswith('*'):
line1 = next(buf).strip().split('-')
data.append({'name': line[1:].strip(),
'company': line1[0].strip(),
'job': line1[1].strip()})
print(data)
# Output
[{'name': 'Kirk Walters',
'company': 'Chittenden Corporation',
'job': 'EVP and Chief Financial Officer and Treasurer and CTC'},
{'name': 'Beth Messmore', 'company': 'Merrill Lynch', 'job': 'Analyst'},
{'name': 'Troy Ward', 'company': 'A.G. Edwards', 'job': 'Analyst'},
{'name': 'Lori Hasiner', 'company': 'FBR', 'job': 'Analyst'},
{'name': 'Tom Doheny', 'company': "Sandler O'Neill", 'job': 'Analyst'},
{'name': 'Gerard Cassidy',
'company': 'RBC Capital Markets',
'job': 'Analyst'},
{'name': 'Faye Elliott-Gurney',
'company': 'Lehman Brothers',
'job': 'Analyst'}]
Setup:
text = """\
================================================================================
Corporate Participants
================================================================================
* Kirk Walters
Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC
================================================================================
Conference Call Participants
================================================================================
* Beth Messmore
Merrill Lynch - Analyst
* Troy Ward
A.G. Edwards - Analyst
* Lori Hasiner
FBR - Analyst
* Tom Doheny
Sandler O'Neill - Analyst
* Gerard Cassidy
RBC Capital Markets - Analyst
* Faye Elliott-Gurney
Lehman Brothers - Analyst
================================================================================"""
CodePudding user response:
The problem with your regex is group (.*) matches all characters except the line terminators while your desired string has line terminators. You should try this regex for your purposes:
import re
your_string = "================================================================================\nCorporate Participants\n================================================================================\n * Kirk Walters\n Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n * Beth Messmore\n Merrill Lynch - Analyst\n * Troy Ward\n A.G. Edwards - Analyst\n * Lori Hasiner\n FBR - Analyst\n * Tom Doheny\n Sandler O\'Neill - Analyst\n * Gerard Cassidy\n RBC Capital Markets - Analyst\n * Faye Elliott-Gurney\n Lehman Brothers - Analyst\n\n================================================================================"
participants_corp = re.findall(r"Corporate Participants\n================================================================================\n ([\S\s]*)\n\n================================================================================\nConference Call Participants", your_string)
print(participants_corp)
CodePudding user response:
>>> p = re.compile(r"\* (.*)\n", re.MULTILINE)
>>> p.findall(given_text)
['Kirk Walters', 'Beth Messmore', 'Troy Ward', 'Lori Hasiner', 'Tom Doheny', 'Gerard Cassidy', 'Faye Elliott-Gurney']
use re.MULTILINE flag in re.compile, and use findall() after that.
