Home > database >  Regex finding all substrings between markers keeps extra characters
Regex finding all substrings between markers keeps extra characters

Time:01-26

I'm really confused because I don't think those are special characters. In either case I tried prepending them with a backslash. But I have a big text file that's basically html code. And i want to extract text between some tags. I cropped a piece below:

b282yb keod5gw0 nxhoafnm aigsh9s9 d3f4x2em iv3no6db jq4qci2q a3bd9o3v lrazzd5p
 bwm1u5wc" dir="auto"><span >Text #1</span></a></div><div ></span></span></div>
</div><div  dir="auto">Text #2</span></a></div>
<div ><span aahdfvyu">', f)

but it comes back with

['<span >Text #1', '</span></div></div><div dir="auto">Text #2']

so it doesn't remove everything before the string. Why?

CodePudding user response:

text="""b282yb keod5gw0 nxhoafnm aigsh9s9 d3f4x2em iv3no6db jq4qci2q a3bd9o3v
lrazzd5pbwm1u5wc" dir="auto"><span >Text #1</span></a></div><div ></span></span></div>
</div><div  dir="auto">Text #2</span></a></div><div ><span ""

re.findall(r'>([^<] )</span></a></div><div >',text)

result

['Text #1', 'Text #2']

demo

  •  Tags:  
  • Related