I have a data frame containing a column with user's email addresses. The format of the email address could be anything. I need to create a new column called 'agency' with just the domain of the user's email (in other words, extract the value between '@' and the last '.').
Example:
- '[email protected]' becomes 'mydomain'
- '[email protected]' becomes 'yourdomain'
I don't seem to be able to tackle the syntax to get there...
So far the best I could do was to eliminate the part before @:
Azure_table <- Azure_table %>%
mutate(
agency = gsub(".*@", "", userPrincipalName)
)
Which gives me the following result:

How do I eliminate the text after the last dot (.com, .ca, etc)? Is there a better way of doing this?
Thanks in advance!
CodePudding user response:
The following along with str_extract should suit your needs. Instead of replacing text with an empty string, I just extracted the desired information.
pattern = "(?<=@).*(?=\\.[a-zA-Z] $)"
Test cases:
s1 <- "[email protected]"
s2 <- "[email protected]"
s3 = "[email protected]"
s4 <- "[email protected]"
str_extract(s1, pattern)
[1] "subtel"
str_extract(s2, pattern)
[1] "subtel"
str_extract(s3, pattern)
[1] "hello.something"
str_extract(s4, pattern)
[1] "example.applestore.apple"
