Removing signs and repeating numbers-CodePudding

I want to remove all signs from my dataframe to leave it in either one of the two formats: 100-200 or 200

So the salaries should either have a single hyphen between them if a range of salaries if given, otherwise a clean single number.

I have the following data:

import pandas as pd
import re
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
           '£26,000 - £28,000/annum plus bonus',
           '£21,000/annum',
           '£26,768 - £30,136/annum Attractive benefits package',
           '£33/hour', 
           '£18,500 - £20,500/annum Inc Bonus - Study Support   Bens',
           '£27,500 - £30,000/annum £27,500 to £30,000   Study',
           '£35,000 - £40,000/annum',
           '£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
           '£19,000 - £24,000/annum Study Support',
           '£30,000 - £35,000/annum', 
           '£44,000 - £66,000/annum   15% Bonus   Excellent Benefits. L',
           '£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df)

Here's what I have tried to remove some of the signs:

salary = []
for i in data.salary:
    space = re.sub(" ",'',i)
    lower = re.sub("[a-z]",'',space)
    upper = re.sub("[A-Z]",'',lower)
    bracket = re.sub("/",'',upper)
    comma = re.sub(",", '', bracket)
    plus = re.sub("\ ",'',comma)
    percentage = re.sub("\%",'', plus)
    dot = re.sub("\.",'', percentage)
    bracket1 = re.sub("\(",'',dot)
    bracket2 = re.sub("\)",'',bracket1)
    salary.append(bracket2)

Which gives me:

'£26768-£30136',
 '£26000-£28000',
 '£21000',
 '£26768-£30136',
 '£33',
 '£18500-£20500-',
 '£27500-£30000£27500£30000',
 '£35000-£40000',
 '£24000-£27000',
 '£19000-£24000',
 '£30000-£35000',
 '£44000-£6600015',
 '£75-£90£75-£90'

However, I have some repeating numbers, essentially I want anything after the first range of values removed, and any sign besides the hyphen between the two numbers.

Expected output:

 '26768-30136',
 '26000-28000',
 '21000',
 '26768-30136',
 '33',
 '18500-20500',
 '27500-30000',
 '35000-40000',
 '24000-27000',
 '19000-24000',
 '30000-35000',
 '44000-66000',
 '75-90

CodePudding user response：

Another way using pandas.Series.str.partition with replace:

data["salary"].str.partition("/")[0].str.replace("[£ ,]", "", regex=True)

Output:

0     26768-30136
1     26000-28000
2           21000
3     26768-30136
4              33
5     18500-20500
6     27500-30000
7     35000-40000
8     24000-27000
9     19000-24000
10    30000-35000
11    44000-66000
12          75-90
Name: 0, dtype: object

Explain:

It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes unit, space and comma.

CodePudding user response：

You can do it in only two regex passes. First extract the monetary amounts with a regex, then remove the thousands separators, finally, join the output by group keeping only the first two occurrences per original row:

(data['salary'].str.extractall(r'£([,\d] )')[0]     # extract £123,456 digits
 .str.replace(r'\D', '', regex=True)                # remove separator
 .groupby(level=0).apply(lambda x: '-'.join(x[:2])) # join first two occurrences
)

output:

0     26768-30136
1     26000-28000
2           21000
3     26768-30136
4              33
5     18500-20500
6     27500-30000
7     35000-40000
8     24000-27000
9     19000-24000
10    30000-35000
11    44000-66000
12          75-90

CodePudding user response：

You can use

data['salary'].str.split('/', n=1).str[0].replace('[^\d-] ','', regex=True)
# 0     26768-30136
# 1     26000-28000
# 2           21000
# 3     26768-30136
# 4              33
# 5     18500-20500
# 6     27500-30000
# 7     35000-40000
# 8     24000-27000
# 9     19000-24000
# 10    30000-35000
# 11    44000-66000
# 12          75-90

Here,

.str.split('/', n=1) - splits into two parts with the first / char
.str[0] - gets the first item
.replace('[^\d-] ','', regex=True) - removes all chars other than digits and hyphens.