Guys I have the input string 00000000995510.32 and I need to remove the sign and the leading zeros, my output number should be: 995510.32.
Is there a regular expression to do this in regexp_replace?
My current code:
df.withColumn("vl_fat",regexp_replace(col("vl_fat"),"^([0-9]|[1-9][0-9])$ ", ""))
but that didn't work
CodePudding user response:
if you want to practise regex, try: https://regex101.com/. The pattern you describe is that it starts with one and then with a zero to many amount of 0, which in python regex would be [ ][0]*. You also need to consider the look ahead feature of regex that can get a little weird. This should work however:
(?![ ])(?![0]).*
CodePudding user response:
you can use this regex "\ 0 " to catch the leading 000...
explanation from regex101 :
\matches the characterliterally (case sensitive)0matches the character0literally (case sensitive)matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
CodePudding user response:
My two cents: you can use regex_extract (that seems to suite better your use case) and convert the input string into a float:
from pyspark.sql import functions as F, types as T
df = spark.createDataFrame(
[(' 00000000995510.32',),
('34.32',),
(' 00000.34',),
(' 0444444',),
('9.',)],
T.StructType([
T.StructField('input_string', T.StringType())
])
)
df.withColumn('parsed_float',
F.regexp_extract('input_string', '^(\ 0 |)(\d (\.\d*|))$', 2).cast(T.FloatType()))
This is what you get:
------------------ ------------
| input_string|parsed_float|
------------------ ------------
| 00000000995510.32| 995510.3|
| 34.32| 34.32|
| 00000.34| 0.34|
| 0444444| 444444.0|
| 9.| 9.0|
------------------ ------------
For the regex:
(\ 0 |): this captures the initial (optional)followed by one or more0(\d (\.\d*|)): this captures the whole figure, described as a sequence of numbers followed by a (optional) sequence composed of a.followed by any number of decimals
The second argument of regex_extract is the group you are interested into; in this case is the second one, i.e., (\d (\.\d*|)).
CodePudding user response:
Instead of regex you might like to use TRIM. I find this easier to read and it better conveys the intention of the code. Note this code will also remove any signs directly after your leading zeros.
import pyspark.sql.functions as F
df = spark.createDataFrame([(' 00000000995510.32',)], ['number'])
df.withColumn('trimmed', F.expr("TRIM(LEADING ' 0' FROM number)")).show()
------------------ ---------
| number| trimmed|
------------------ ---------
| 00000000995510.32|995510.32|
------------------ ---------
Or if you want an actual number, you could simply cast it to float (or decimal). Note any value which cannot be cast will become NULL.
df.withColumn('trimmed', F.col('number').cast('float')).show()
------------------ --------
| number| trimmed|
------------------ --------
| 00000000995510.32|995510.3|
------------------ --------
