I have a table, approx. 1K rows and two columns. The first row is the emp_id and the second is tel_num. The tel_num column is not formatted all the same...some examples are (555) 555-9876, 18763334455, 505-999-888x222, some have no values...and so on. The goal is to format them all the same 10 digits without the leading 1s or any extensions.
The table looks like the following
| emp_id | tel_num |
|---|---|
| Jon Doe | 18763334455 |
| Cal Foe | 505-999-8888x222 |
| Ho Moe | nan |
| GI joe | 676.909.4321 |
trying to make this...
| Column A | tel_format |
|---|---|
| Jon Doe | (876) 333-4455 |
| Cal Foe | (505) 999-8888 |
| Ho Moe | nan |
| GI Joe | (679) 909-4321 |
I tried this format... I'm using databricks. The current process i tried is somewhat like this...
def formatphone(ph_var):
...some process
return formatted_ph
df = df.withColumn('tel_format', formatphone(df.tel_num))
I can't get it to work.
CodePudding user response:
You can use the following function, assuming that all possible formats are shown in your sample data.
To use this function in withColumn(), you need to create a UDF from it.
@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
if phone_number is None:
return None
if phone_number=='nan':
return None
if phone_number[0]== ' ':
return '(' phone_number[2:5] ') ' phone_number[5:8] '-' phone_number[8:12]
if '-' in phone_number:
return '(' phone_number[0:3] ') ' phone_number[4:7] '-' phone_number[8:12]
if '.' in phone_number:
return '(' phone_number[0:3] ') ' phone_number[4:7] '-' phone_number[8:12]
else:
return None
