I have a dataframe with one column like this:
| Locations |
|---|
| Germany:city_Berlin |
| France:town_Montpellier |
| Italy:village_Amalfi |
I would like to get rid of the substrings: 'city_', 'town_', 'village_', etc.
So the output should be:
| Locations |
|---|
| Germany:Berlin |
| France:tMontpellier |
| Italy:Amalfi |
I can get rid of one of them this way:
F.regexp_replace('Locations', 'city_', '')
Is there a similar way to pass several substrings to remove from the original column?
Ideally I'm looking for a one line solution, without having to create separate functions or convoluted things.
CodePudding user response:
I wouldnt map. Looks to me like you want to replace strings immediately to the left of : if they end with _. If so use regex. Code below
df.withColumn('new_Locations', regexp_replace('Locations', '(?<=\:)[a-z_] ','')).show(truncate=False)
--- ----------------------- ------------------
|id |Locations |new_Locations |
--- ----------------------- ------------------
|1 |Germany:city_Berlin |Germany:Berlin |
|2 |France:town_Montpellier|France:Montpellier|
|4 |Italy:village_Amalfi |Italy:Amalfi |
--- ----------------------- ------------------
CodePudding user response:
F.regexp_replace('Locations', r'(?<=:).*_', '')
.* tells that you will match all characters. But it is located between (?<=:) and _.
_ is the symbol which must follow all the characters matched by .*.
(?<=:) is a syntax for "positive lookbehind". It is not a part of a match, but it ensures that right before the .*_ you must have a : symbol.
