As an exercise in Pandas, I am trying to analyse the best strategies for wordle. I found a list of 5 letter words and I am trying to create a dataFrame with the words and the number of each letter these words have.
I have manage to do this using this code:
import pandas as pd
import numpy as np
import string
words_df = pd.read_csv('https://raw.githubusercontent.com/charlesreid1/five-letter-words/master/sgb-words.txt', sep="\n", header=None)
words_df.rename(columns={0:'words'}, inplace=True)
alphabet_string = string.ascii_lowercase
alphabet_list = list(alphabet_string)
for letter in alphabet_list:
words_df[letter] = 0
letter_count_list = []
for row_index, word in enumerate(words_df['words']):
letter_count_list.append([])
for column_index, letter in enumerate(alphabet_list):
letter_count = word.count(letter)
letter_count_list[row_index].append(letter_count)
words_df.iloc[:,1:] = letter_count_list
words_df
I get exactly what I want, which is the following dataframe:
words a b c d e f g h i j k l m n o p q r s t u v w x y z
0 which 0 0 1 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 there 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
2 their 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
3 about 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0
4 would 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5752 osier 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0
5753 roble 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
5754 rumba 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
5755 biffy 0 1 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
5756 pupal 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 0 0
The code works perfectly, however, my main goal here is to learn Pandas and I know that using a for loop is not the best way to achieve this result. I have tried using the .apply() method, but I could not get it to work. What is the "pandas way" of achieving the same result?
Extra: I know that my Python skills are not the best, I would also appreciate comments on my code.
CodePudding user response:
EDIT: this doesn't work perfectly, as the counts don't increment as there are more of the same letter in a word, e.g. for row 5755, f should be 2 because there are two f's in biffy, but it's only 1.
Original answer:
There's actually a really concise way to do exactly this what you're trying to do: <column>.str.get_dummies(). It takes a separator string, and splits all the strings in that column by the separator, and creates a one-hot encoded dataframe out of them (exactly what you're trying to do):
>>> df['words'].str.get_dummies('')
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
2 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0
5753 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
5754 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
5755 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
5756 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
To add that back to the original dataframe:
tmp_df = df['words'].str.get_dummies('')
df[tmp_df.columns] = tmp_df
Output:
>>> df
words a b c d e f g h i j k l m n o p q r s t u v w x y z
0 which 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 there 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
2 their 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
3 about 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0
4 would 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752 osier 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0
5753 roble 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
5754 rumba 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
5755 biffy 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
5756 pupal 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
[5757 rows x 27 columns]
CodePudding user response:
You can use the built-in collections.Counter function, which returns a dict containing unique elements of a string/list and their counts, then convert it to a dataframe, fill NaNs, and assign it back to the original dataframe:
from collections import Counter
letter_cols = df.columns[1:]
df[letter_cols] = pd.DataFrame(df['words'].apply(Counter).tolist()).fillna(0).astype(int)[letter_cols]
Output:
>>> df
words a b c d e f g h i j k l m n o p q r s t u v w x y z
0 which 0 0 1 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 there 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
2 their 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
3 about 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0
4 would 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752 osier 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0
5753 roble 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
5754 rumba 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
5755 biffy 0 1 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
5756 pupal 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 0 0
[5757 rows x 27 columns]
CodePudding user response:
This how I will go about it and avaoid lambda because it is an anticlimax in pandas
- Add temp column which is a list of each letter in words
- Explode the df on column 1 above
- Apply pd.get_dummies on column 2
- 3 above gives frequency of each letter in 2
- groupby outcome in 4 and sum, gives a tally of each letter per index
- Leverage pd.join to pin 5 back to original df.
- drop temp column from the df
Code below
words_df['words1']=(list(map(','.join, words_df['words'])))#disintegrate words into letters
s= (words_df.drop(columns='words1')#drop new added column
.join(#Join original df to computed outcome
pd.get_dummies(#pd.get_dummies method
words_df.assign(words1=words_df['words1'].str.split(',')).explode('words1')['words1'],prefix='', prefix_sep='')#raw breakdown by letters
.groupby(level=0).sum().reset_index()#groupby index and sum letters per index
)
)
