I have very large text file. It contains duplicate text patterns. In the code below, we can see the pattern "Path": "/home/downloads/file" exists 3 times. I want to add/concat the count number at the end of each Path pattern according to its position. E.g. when the code finds first Path pattern, it should concatenate 1 at the end like "Path": "/home/downloads/file/1". For the second Path pattern, it should add 2 at the end e.g. "Path": "/home/downloads/file/2" and so on. My current code counts the patterns but doesn't concatenate it properly at the end of the Path pattern. Below is my code, its current output and the desired output. I've also added a small chunk from the text.
from io import StringIO
import re
file = StringIO("""{
"title": "Pilot",
"image": [
{
"Path": "/home/downloads/file"
"Path": "/home/downloads/file",
"Path": "/home/downloads/file"
}
],
"content": "<p>The wing man ...</p>"
}""")
text = file.read()
patterns = r'"Path": "(.*?)"'
count = 0
for match in re.finditer(patterns, text):
count = 1
replace = '"Path": "\\1/' str(count) '"'
text = re.sub(patterns, replace, text)
print(text)
Current output of the code is:
{
"title": "Pilot",
"image": [
{
"Path": "/home/downloads/file/1/2/3"
"Path": "/home/downloads/file/1/2/3",
"Path": "/home/downloads/file/1/2/3"
}
],
"content": "<p>The wing man ...</p>"
}
Desired output is:
{
"title": "Pilot",
"image": [
{
"Path": "/home/downloads/file/1"
"Path": "/home/downloads/file/2",
"Path": "/home/downloads/file/3"
}
],
"content": "<p>The wing man ...</p>"
}
CodePudding user response:
You have to limit the times that re.sub makes the replacement:
for cnt,match in enumerate(re.finditer(patterns, text),1):
replace = '"Path": "\\1/' str(cnt) '"'
text = re.sub(patterns, rf'\1/{cnt}', text, count=1)
CodePudding user response:
You can use re.sub with a function to replace non-overlapping occurences as follows.
Code
text = file.read()
patterns = r'"Path": "(.*?)"'
def repl(m):
global count
count = 1 # update count with each
# detection of pattern
return m.group(0).replace('file', f'file/{count}') # Desired substitution
count = 0
text = re.sub(patterns, repl, text) # applies function repl to each detection of pattern
print(text)
Output
{
"title": "Pilot",
"image": [
{
"Path": "/home/downloads/file/1"
"Path": "/home/downloads/file/2",
"Path": "/home/downloads/file/3"
}
],
"content": "<p>The wing man ...</p>"
}
