Home > Back-end >  Removing all characters on right after a specific character identified in string within list
Removing all characters on right after a specific character identified in string within list

Time:01-31

I used Python beautiful soup.findAll function to bring back the following elements in a list from a website:

['1.97M\u202c 1601.31%', '10.43M\u202c 429.12%', '25.28M\u202c 142.42%', '47.90M\u202c 89.45%', '64.90M\u202c 35.50%', '48.30M\u202c−25.58%', '54.70M\u202c 13.25%', '54.70M\u202c0.00%', '3.39M\u202c 28533.18%', '18.60M\u202c 448.06%', '46.71M\u202c 151.06%', '87.79M\u202c 87.96%', '113.80M\u202c 29.63%', '80.50M\u202c−29.26%', '90.90M\u202c 12.92%', '90.90M\u202c0.00%', '−1.42M\u202c', '−8.17M\u202c', '−21.42M\u202c', '−39.89M\u202c', '−48.90M\u202c', '−32.20M\u202c', '−36.20M\u202c', '−36.20M\u202c', '−7.40M\u202c', '−22.37M\u202c', '−31.29M\u202c', '−75.66M\u202c', '−65.90M\u202c', '−49.60M\u202c', '−43.10M\u202c', '−43.10M\u202c']

Any elements within the above list that have \u202c and any characters after it, I want removed. I tried using replace, regex, split functions, but no luck as I'm relatively new to Python (I want to keep output as list). Is there a way round this?

Desired output:

['1.97M', '10.43M', '25.28M', '47.90M', '64.90M', '48.30M', '54.70M', '54.70M', '3.39M', '18.60M', '46.71M', '87.79M', '113.80M', '80.50M', '90.90M', '90.90M', '−1.42M', '−8.17M', '−21.42M', '−39.89M', '−48.90M', '−32.20M', '−36.20M', '−36.20M', '−7.40M', '−22.37M', '−31.29M', '−75.66M', '−65.90M', '−49.60M', '−43.10M', '−43.10M']

CodePudding user response:

Works fine with split() and list comprehension:

your_list = ['1.97M\u202c 1601.31%', '10.43M\u202c 429.12%', '25.28M\u202c 142.42%', '47.90M\u202c 89.45%', '64.90M\u202c 35.50%', '48.30M\u202c−25.58%', '54.70M\u202c 13.25%', '54.70M\u202c0.00%']


sep = '\u202c'
result = [x.split(sep)[0] for x in your_list] 

Output:

['1.97M', '10.43M', '25.28M', '47.90M', '64.90M', '48.30M', '54.70M', '54.70M']

CodePudding user response:

I would use re.sub() for this to substitute the part after \u202c with an empty string. This has the advantage of working well when that string is not present:

import re

l = ['1.97M\u202c 1601.31%', '10.43M', '25.28M\u202c 142.42%']

rx = re.compile(r'\u202c.*')   # match \u202c and everything after
[rx.sub('', s) for s in l]     # replace it with nothing

# ['1.97M', '10.43M', '25.28M']

CodePudding user response:

import re
def extract(input_str):
    p = re.compile("(.*)\\u202c.*")
    result = p.search(input_str)
    if result is not None: 
        return result.group(1)
    return input_str

def transform(input_list):
    return list(map(extract, input_list))

Calling transform with the list of strings gives you the desired output.

  •  Tags:  
  • Related