There are many related questions here about this issue, particulalry using left_join from dplyr , but I still can't figure it out.
All I want to do is return LanguageClean in Lookup based on a match to the Language column in df. If there is no match, simply return NA. I want LanguageClean added as a new column to df.
I can see that my code below is replicating the ID, but I don't want it to. The ID column is irrelevant for my purposes here, although I need to retain it in the final dataframe.
df <- structure(list(ID = structure(c(18L, 89L, 42L, 161L, 88L, 71L,
175L, 181L, 133L, 56L, 18L, 89L, 42L, 161L, 88L, 71L, 175L, 181L,
133L, 56L, 18L, 89L, 42L, 161L, 88L, 71L, 175L, 181L, 133L, 56L
), .Dim = c(10L, 3L)), Language = c("en", "", "lv", "en", "en",
"de", "en", "ms", "", "en"), Geo = c("us", "", "-", "us",
"us", "gb", "ca", "us", "-", "us")), class = "data.frame", row.names = c(NA,
-10L))
lookup <- structure(list(Language = c("af", "ar", "ar", "ar", "ar", "ar",
"ar", "ar", "ar", "eu", "be", "zh", "zh", "hr", "da", "nl", "en",
"en", "en", "en", "en", "en", "fo", "fi", "fr", "fr", "gd", "de",
"de", "de", "he", "hu", "id", "it", "ko", "lv", "mk", "mt", "no",
"pt", "rm", "ro", "ru", "sr", "sk", "sb", "es", "es", "es", "es",
"es", "es", "es", "es", "es", "sx", "sv", "ts", "tr", "ur", "vi",
"ji", "sq", "ar", "ar", "ar", "ar", "ar", "ar", "ar", "ar", "bg",
"ca", "zh", "zh", "cs", "nl", "en", "en", "en", "en", "en", "en",
"et", "fa", "fr", "fr", "fr", "ga", "de", "de", "el", "hi", "is",
"it", "ja", "ko", "lt", "ms", "no", "pl", "pt", "ro", "ru", "sz",
"sr", "sl", "es", "es", "es", "es", "es", "es", "es", "es", "es",
"es", "sv", "th", "tn", "uk", "ve", "xh", "zu"), LanguageClean = c("Afrikaans",
"Arabic", "Arabic", "Arabic", "Arabic", "Arabic", "Arabic", "Arabic",
"Arabic", "Basque", "Belarusian", "Chinese", "Chinese", "Croatian",
"Danish", "Dutch", "English", "English", "English", "English",
"English", "English", "Faeroese", "Finnish", "French", "French",
"Gaelic", "German", "German", "German", "Hebrew", "Hungarian",
"Indonesian", "Italian", "Korean", "Latvian", "Macedonian", "Maltese",
"Norwegian", "Portuguese", "Rhaeto-Romanic", "Romanian", "Russian",
"Serbian", "Slovak", "Sorbian", "Spanish", "Spanish", "Spanish",
"Spanish", "Spanish", "Spanish", "Spanish", "Spanish", "Spanish",
"Sutu", "Swedish", "Tsonga", "Turkish", "Urdu", "Vietnamese",
"Yiddish", "Albanian", "Arabic", "Arabic", "Arabic", "Arabic",
"Arabic", "Arabic", "Arabic", "Arabic", "Bulgarian", "Catalan",
"Chinese", "Chinese", "Czech", "Dutch", "English", "English",
"English", "English", "English", "English", "Estonian", "Farsi",
"French", "French", "French", "Irish", "German", "German", "Greek",
"Hindi", "Icelandic", "Italian", "Japanese", "Korean", "Lithuanian",
"Malaysian", "Norwegian", "Polish", "Portuguese", "Romanian",
"Russian", "Sami", "Serbian", "Slovenian", "Spanish", "Spanish",
"Spanish", "Spanish", "Spanish", "Spanish", "Spanish", "Spanish",
"Spanish", "Spanish", "Swedish", "Thai", "Tswana", "Ukrainian",
"Venda", "Xhosa", "Zulu")), class = "data.frame", row.names = c(NA,
-124L))
df <- left_join(df, lookup, by="Language")
CodePudding user response:
The issue is that your lookup table contains multiple entries for some languages. Hence you end up with multiple matches. So solve your issue you could filter out the distinct or unique combinations from your lookup using dplyr::distinct:
library(dplyr)
df <- left_join(df, distinct(lookup, Language, LanguageClean), by = "Language")
df
#> ID.1 ID.2 ID.3 Language Geo LanguageClean
#> 1 18 18 18 en us English
#> 2 89 89 89 <NA>
#> 3 42 42 42 lv - Latvian
#> 4 161 161 161 en us English
#> 5 88 88 88 en us English
#> 6 71 71 71 de gb German
#> 7 175 175 175 en ca English
#> 8 181 181 181 ms us Malaysian
#> 9 133 133 133 - <NA>
#> 10 56 56 56 en us English
