Regex findall with \n in text-CodePudding

I have the following test string:

================================================================================\nCorporate Participants\n================================================================================\n   *  Kirk Walters\n      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n   *  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O\'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n\n================================================================================

I would like to get * Kirk Walters\n and

*  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O\'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n

My code so for the first one is

participants_corp = re.findall('Corporate Participants\n================================================================================\n   (.*)\n\n================================================================================\nConference Call Participants', str)

I think it has to do somethin with the backslashes for the newline commands. I trie using four backslashes instead of one but that didn't change anything. Can you give advice?

CodePudding user response：

Without regex, you can use:

import io

buf = io.StringIO(text)
data = []
for line in buf:
    line = line.strip()
    if line.startswith('*'):
        line1 = next(buf).strip().split('-')
        data.append({'name': line[1:].strip(),
                     'company': line1[0].strip(),
                     'job': line1[1].strip()})
print(data)

# Output
[{'name': 'Kirk Walters',
  'company': 'Chittenden Corporation',
  'job': 'EVP and Chief Financial Officer and Treasurer and CTC'},
 {'name': 'Beth Messmore', 'company': 'Merrill Lynch', 'job': 'Analyst'},
 {'name': 'Troy Ward', 'company': 'A.G. Edwards', 'job': 'Analyst'},
 {'name': 'Lori Hasiner', 'company': 'FBR', 'job': 'Analyst'},
 {'name': 'Tom Doheny', 'company': "Sandler O'Neill", 'job': 'Analyst'},
 {'name': 'Gerard Cassidy',
  'company': 'RBC Capital Markets',
  'job': 'Analyst'},
 {'name': 'Faye Elliott-Gurney',
  'company': 'Lehman Brothers',
  'job': 'Analyst'}]

Setup:

text = """\
================================================================================
Corporate Participants
================================================================================
   *  Kirk Walters
      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC

================================================================================
Conference Call Participants
================================================================================
   *  Beth Messmore
      Merrill Lynch - Analyst
   *  Troy Ward
      A.G. Edwards - Analyst
   *  Lori Hasiner
      FBR - Analyst
   *  Tom Doheny
      Sandler O'Neill - Analyst
   *  Gerard Cassidy
      RBC Capital Markets - Analyst
   *  Faye Elliott-Gurney
      Lehman Brothers - Analyst

================================================================================"""

CodePudding user response：

The problem with your regex is group (.*) matches all characters except the line terminators while your desired string has line terminators. You should try this regex for your purposes:

import re

your_string = "================================================================================\nCorporate Participants\n================================================================================\n   *  Kirk Walters\n      Chittenden Corporation - EVP and Chief Financial Officer and Treasurer and CTC\n\n================================================================================\nConference Call Participants\n================================================================================\n   *  Beth Messmore\n      Merrill Lynch - Analyst\n   *  Troy Ward\n      A.G. Edwards - Analyst\n   *  Lori Hasiner\n      FBR - Analyst\n   *  Tom Doheny\n      Sandler O\'Neill - Analyst\n   *  Gerard Cassidy\n      RBC Capital Markets - Analyst\n   *  Faye Elliott-Gurney\n      Lehman Brothers - Analyst\n\n================================================================================"
participants_corp = re.findall(r"Corporate Participants\n================================================================================\n   ([\S\s]*)\n\n================================================================================\nConference Call Participants", your_string)
print(participants_corp)

CodePudding user response：

>>> p = re.compile(r"\*  (.*)\n", re.MULTILINE)
>>> p.findall(given_text)
['Kirk Walters', 'Beth Messmore', 'Troy Ward', 'Lori Hasiner', 'Tom Doheny', 'Gerard Cassidy', 'Faye Elliott-Gurney']

use re.MULTILINE flag in re.compile, and use findall() after that.