So, I have a 300 page document, and I want to remove all the notes I wrote, which are enclosed within "[(" and ")]". Since I also sometimes nest multiple notes, "[(blah [(blah [(blah)])] )]", I need to make sure I don't just remove "[(blah [(blah [(blah)]".
So, to do that, I am not sure what is most efficient ... and this is a large job. What occurs to me is that I could check to see there aren't two consecutive "[(", with a ".*" between them, and just remove the simple cases of "[(...)]". I hope there is a better way than this, though.
I think the two regex codes I use would be something like "/(?<=[[(])[\s\S]*(?=[(])/gi" and "/(?![\s\S][[(][\s\S][[(]).*/gi". Something like that? I'm sorry, I'm still trying to figure out these things.
Also, can I write a python program to open an OpenOffice (odt) file and edit it? The "open(r'C:\Users\Blah\Documents\Blah.odt', 'rw').read()" will work for that too, right?
CodePudding user response:
check out this way:
a = '[(blah [(blah [(blah)])] )]'
x = re.compile(r'([\[])(.*?)([\]])')
remove_text = re.sub(x, r'', a)
CodePudding user response:
One way is to repeatedly remove (replace matches with empty strings) such clauses that do not contain clauses until no more replacements are made. If the maximum number of levels is n this will take n 1 iterations. The regular expression to match is as follows:
\[\((?:(?!\[\().)*?\)\]
Consider the string:
begin [(Mary [(had [(a )]lil' [(lamb [(whose [(fleece )])])])])]was [(white [( as )])]snow
1 2 3 3 3 4 5 5 4 3 2 1 1 2 2 1
As shown, this has five nesting levels. After the first replacement we obtain:
begin [(Mary [(had lil' [(lamb [(whose )])])])]was [(white )]snow
1 2 3 4 4 3 2 1 1 1
After the second replacement:
begin [(Mary [(had lil' [(lamb )])])]was snow
1 2 3 3 2 1
After the third replacement:
begin [(Mary [(had lil' )])]was snow
1 2 2 1
After the fourth replacement:
begin [(Mary )]was snow
1 1
After the fifth replacement:
begin was snow
After the next attempted replacement:
begin was snow
As no replacements were made at the last step we are finished.
The regular expression can be broken down as follows.
\[\( # match '[('
(?: # begin non-capture group
(?!\[\() # negative lookahead asserts that next to chars are not '[('
. # match any char
)*? # end non-capture group and execute zero or more times lazily
\)\] # match ')]'
The regular expression employs a technique called the tempered greedy token solution.
