Home > Back-end >  Python regex - replace group with character
Python regex - replace group with character

Time:01-09

I'm trying to find the best approach to replace a specific pattern with a character in python.

For example if I have the text "prop1": "val1","prop2": "val2" "abcdefg": "hijklmn" "1234": "5678"

But I want the string: "prop1": "val1","prop2": "val2","abcdefg": "hijklmn","1234": "5678"

I found this pattern seems to group the space between the sets of quotes correctly from regex101

'"\S*"(\s{1})"\S*"'

enter image description here

But when using this in python it seems this is not replacing the group rather the entire match or some other variant behavior.

Code:

testStr = 'prop1": "val1","prop2": "val2" "abcdefg": "hijklmn" "1234": "5678'
testMatch = re.search('"\S*"(\s{1})"\S*"', testStr)
print(f'Full match: {testMatch.group(0)}')
testGroupMatch = testMatch.group(1)
print(f'Group match: {testGroupMatch}')

print(f'Test string before replace: {testStr}')
testStrReplaced = re.sub('"\S*"(\s{1})"\S*"', ',', testStr)
print(f'Test string after replace: {testStrReplaced}')

Output:

Full match: "val2" "abcdefg"
Group match:  
Test string before replace: prop1": "val1","prop2": "val2" "abcdefg": "hijklmn" "1234": "5678"
Test string after replace: prop1": "val1","prop2": ,: ,: "5678"

Does anyone know if this is the right approach for this kind of scenario? If so does the regex expression look correct to target the pattern I'm trying to replace?

Does anyone know how I would replace the matched group? Most of the examples I've found mention backreferencing the groups, however, this seems to be if I want to replace something with a group I've already matched. In this case I simply want to replace the matched group, which from my test output is just the space, with a single character such as a comma.

Thanks!

CodePudding user response:

So, what you want is to find key and value (in form of "...": "..."), and add a comma after it if there is no comma (except for the last key-and-value group).

You could do a replace (".*?"\s*:\s*".*?")\s*,?(?!$) with \1,

The idea is to find out pattern of "key": "value" followed by optional comma, and replace with "key": "value",

Demo: https://regex101.com/r/OX6HH0/1

(".*?"\s*:\s*".*?")\s*,?(?!$)
(                                start of group 1
 "                               double quote
  .*?                            reluctant match of any number of any char
                                   (i.e. match as least char as 
                                   possible)
     "                           double quote 
      \s*                        any number of space
         :                       colon
          \s*                    any number of space
             ".*?"               similar to key part: double quote, followed
                                   by reluctant match of any char, followed
                                   by double quote
                  )              end of group 1
                   \s*,?         followed by any space, with optional comma
                        (?!$)    negative lookahead: not followed by end of
                                   line (i.e. do not match if it is the last
                                   key-and-value)

and replace the above match with group1, followed by comma

CodePudding user response:

Regex is meant to find specific texts, and when you capture a part of a match, you usually want to get (or keep when replacing) this part.

Your approch is not going to work in many cases, and I would suggest matching all cases of "...":"..." and then simply join then with a comma.

See the Python demo:

import re
text = r'"prop1": "val1","prop2": "va\" l2" "abcdefg": "h ij kl mn""1234": "5678"'
rx = r'"[^"\\]*(?:\\.[^"\\]*)*"\s*:\s*"[^"\\]*(?:\\.[^"\\]*)*"'
print( ', '.join(re.findall(rx, text, re.S)) )
# => "prop1": "val1", "prop2": "va\" l2", "abcdefg": "h ij kl mn", "1234": "5678"

The regex is

"[^"\\]*(?:\\.[^"\\]*)*"\s*:\s*"[^"\\]*(?:\\.[^"\\]*)*"

See the regex demo. Details:

  • "[^"\\]*(?:\\.[^"\\]*)*" - a ", zero or more chars other than " and \ and then zero or more occurrences of any escaped char and then zero or more chars other than " and \ and then a " char (a string between two double quotation marks that can contain any escape sequences inside)
  • \s*:\s* - a colon enclosed with any zero or more whitespaces
  • "[^"\\]*(?:\\.[^"\\]*)*" - see above.
  •  Tags:  
  • Related