I'm trying to do the following but for a column in pyspark but no luck. Any idea on isolating just the lowercase characters in column of a spark df?
''.join('x' if x.islower() else 'X' if x.isupper() else x for x in text)
CodePudding user response:
Using the following dataframe as an example
----------
| value|
----------
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
----------
You can use a pyspark.sql function called regexpr_replace to isolate the lowercase letters in the column with the following code
from pyspark.sql import functions
df = (df.withColumn("value",
functions.regexp_replace("value", r'[A-Z]|[0-9]|[,.;@#?!&$]', "")))
df.show()
-----
|value|
-----
| lgt|
| hzsh|
|zcfwu|
| qyhn|
| b|
| vxy|
| foi|
| |
|xywek|
| merw|
-----
CodePudding user response:
You can directly use regex_replace to substitute the lowercase values to any desired value -
In your case you will have to chain regex_replace to get the final output -
Data Preparation
inp_string = """
lRQWg2IZtB
hVzsJhPVH0
YXzc4fZDwu
qRyOUhT5Hn
b85O0H41RE
vOxPLFPWPy
fE6o5iMJ6I
918JI00EC7
x3yEYOCwek
m1eWY8rZwO
""".strip().split()
df = pd.DataFrame({
'value':inp_string
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
----------
| value|
----------
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
----------
Regex Replace
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value", r'[a-z]', "x"))
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value_modified", r'[A-Z]', "X"))
sparkDF.show()
---------- --------------
| value|value_modified|
---------- --------------
|lRQWg2IZtB| xXXXx2XXxX|
|hVzsJhPVH0| xXxxXxXXX0|
|YXzc4fZDwu| XXxx4xXXxx|
|qRyOUhT5Hn| xXxXXxX5Xx|
|b85O0H41RE| x85X0X41XX|
|vOxPLFPWPy| xXxXXXXXXx|
|fE6o5iMJ6I| xX6x5xXX6X|
|918JI00EC7| 918XX00XX7|
|x3yEYOCwek| x3xXXXXxxx|
|m1eWY8rZwO| x1xXX8xXxX|
---------- --------------
