I have a column in python Name that reads in football team names and a number such as New Orleans Saints 2
I parse the data into 2 separate & additional columns as Name2: New Orleans Saints & Name3: 2
I do this by using this line of code: NameSeparate=dt['Name'].str.split(r'\s(?:([ -]?\d{,100}\.?\d{1,100}?))', expand=True)
Then dt[Name2] = NameSeparate[0] & dt[Name3] = NameSeparate[1] separates and creates the 2 new columns
I have an issue when the code pulls in San Francisco 49ers 5 as the code parses the 49 instead of the 5
I've tried writing a check loop in but can't get the script to pass through the 49 and just pull the 5
The name will always come in as San Francisco 49ers however the 2nd piece can have either a or - sign in addition to any number following, for example it may come in as: San Francisco 49ers 5, San Francisco 49ers -5, San Francisco 49ers 8.5, San Francisco 49ers -8.5, etc.
Can someone please help me write a check for this issue? Thank you!
CodePudding user response:
In case there is always a or a - in front of the number which should be in the second column, you should remove the ? after [ -] in your regex for splitting the original column:
NameSeparate=dt['Name'].str.split(r'\s(?:([ -]\d{,100}\.?\d{1,100}?))', expand=True)
Otherwise, you also select numbers where no such sign is placed before, as you already noticed.
If there are also entries where the number has no or - before it, you can add an $ to the very end of your regex string, which denotes the end of the string. Since the number seems to always come last, this also should work:
NameSeparate=dt['Name'].str.split(r'\s(?:([ -]?\d{,100}\.?\d{1,100}?))$', expand=True)
