regular expression to remove only leading zeros in pyspark?-CodePudding

Guys I have the input string 00000000995510.32 and I need to remove the sign and the leading zeros, my output number should be: 995510.32.

Is there a regular expression to do this in regexp_replace?

My current code:

df.withColumn("vl_fat",regexp_replace(col("vl_fat"),"^([0-9]|[1-9][0-9])$ ", ""))

but that didn't work

CodePudding user response：

if you want to practise regex, try: https://regex101.com/. The pattern you describe is that it starts with one and then with a zero to many amount of 0, which in python regex would be [ ][0]*. You also need to consider the look ahead feature of regex that can get a little weird. This should work however:

(?![ ])(?![0]).*

CodePudding user response：

you can use this regex "\ 0 " to catch the leading 000...

explanation from regex101 :

\ matches the character literally (case sensitive)
0 matches the character 0 literally (case sensitive)
matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)

CodePudding user response：

My two cents: you can use regex_extract (that seems to suite better your use case) and convert the input string into a float:

from pyspark.sql import functions as F, types as T

df = spark.createDataFrame(
    [(' 00000000995510.32',),
     ('34.32',),
     (' 00000.34',),
     (' 0444444',),
     ('9.',)],
    T.StructType([
        T.StructField('input_string', T.StringType())
    ])
)

df.withColumn('parsed_float',
              F.regexp_extract('input_string', '^(\ 0 |)(\d (\.\d*|))$', 2).cast(T.FloatType()))

This is what you get:

 ------------------ ------------ 
|      input_string|parsed_float|
 ------------------ ------------ 
| 00000000995510.32|    995510.3|
|             34.32|       34.32|
|          00000.34|        0.34|
|           0444444|    444444.0|
|                9.|         9.0|
 ------------------ ------------

For the regex:

(\ 0 |): this captures the initial (optional) followed by one or more 0
(\d (\.\d*|)): this captures the whole figure, described as a sequence of numbers followed by a (optional) sequence composed of a . followed by any number of decimals

The second argument of regex_extract is the group you are interested into; in this case is the second one, i.e., (\d (\.\d*|)).

CodePudding user response：

Instead of regex you might like to use TRIM. I find this easier to read and it better conveys the intention of the code. Note this code will also remove any signs directly after your leading zeros.

import pyspark.sql.functions as F

df = spark.createDataFrame([(' 00000000995510.32',)], ['number'])
df.withColumn('trimmed', F.expr("TRIM(LEADING ' 0' FROM number)")).show()

 ------------------ --------- 
|            number|  trimmed|
 ------------------ --------- 
| 00000000995510.32|995510.32|
 ------------------ ---------

Or if you want an actual number, you could simply cast it to float (or decimal). Note any value which cannot be cast will become NULL.

df.withColumn('trimmed', F.col('number').cast('float')).show()

 ------------------ -------- 
|            number| trimmed|
 ------------------ -------- 
| 00000000995510.32|995510.3|
 ------------------ --------