I have a dataframe that has a column which contains addresses. I would like to split the addresses so that the ending are in a column Ending and the strings before the the ending item are in a separate column Beginning. The address vary in length eg:
- Main Street
- Jon Smith Close
- The Rovers Avenue
After searching different resources I came up with the following
new_address_df['begining'], new_address_df['ending'] = new_address_df['street'].str.split().str[:-1].apply(lambda x: ' '.join(map(str, x))), new_address_df['street'].str.split().str[-1]
The code works but I am not sure if its the right way to write the code in python. Another option would have been to convert to list, modify the data in list form and then convert back to dataframe. I guess this might not be the best approach.
Is there a way to improve the above code if its not pythonic.
CodePudding user response:
There are certainly alot of ways of doing this :) I would go for using str and rpartition. rpartition splits your string in 3 components, the remaining part, the partition string, and the part after remaining and the partition string. If you just take the first and remaining part you should be done.
df[["begining", "ending"]]=df.street.str.rpartition(" ")[[0,2]]
CodePudding user response:
You might use regular expression for this as follows
import pandas as pd
df = pd.DataFrame({"street":["Main Street","Jon Smith Close","The Rovers Avenue"]})
df2 = df.street.str.extract(r"(?P<Beginning>. )\s(?P<Ending>\S )")
df = pd.concat([df,df2],axis=1)
print(df)
output
street Beginning Ending
0 Main Street Main Street
1 Jon Smith Close Jon Smith Close
2 The Rovers Avenue The Rovers Avenue
Explanation: I used named capturing group which result in pandas.DataFrame with such named columns, which I then concat with original df with axis=1. In pattern I used group are sheared by single whitespace (\s), in group Beginning any character is allowed in group Ending only non-whitespace (\S) characters are allowed.
