How to use apply() in pandas with other columns and rows?-CodePudding

As an exercise in Pandas, I am trying to analyse the best strategies for wordle. I found a list of 5 letter words and I am trying to create a dataFrame with the words and the number of each letter these words have.

I have manage to do this using this code:

import pandas as pd
import numpy as np
import string

words_df = pd.read_csv('https://raw.githubusercontent.com/charlesreid1/five-letter-words/master/sgb-words.txt', sep="\n", header=None)

words_df.rename(columns={0:'words'}, inplace=True)

alphabet_string = string.ascii_lowercase
alphabet_list = list(alphabet_string)


for letter in alphabet_list:
  words_df[letter] = 0

letter_count_list = []
for row_index, word in enumerate(words_df['words']):

    letter_count_list.append([])

    for column_index, letter in enumerate(alphabet_list):

        letter_count = word.count(letter)
        letter_count_list[row_index].append(letter_count)

words_df.iloc[:,1:] = letter_count_list
words_df

I get exactly what I want, which is the following dataframe:

    words   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
0   which   0   0   1   0   0   0   0   2   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   there   0   0   0   0   2   0   0   1   0   0   0   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0
2   their   0   0   0   0   1   0   0   1   1   0   0   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0
3   about   1   1   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   1   1   0   0   0   0   0
4   would   0   0   0   1   0   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   1   0   1   0   0   0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5752    osier   0   0   0   0   1   0   0   0   1   0   0   0   0   0   1   0   0   1   1   0   0   0   0   0   0   0
5753    roble   0   1   0   0   1   0   0   0   0   0   0   1   0   0   1   0   0   1   0   0   0   0   0   0   0   0
5754    rumba   1   1   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   1   0   0   1   0   0   0   0   0
5755    biffy   0   1   0   0   0   2   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
5756    pupal   1   0   0   0   0   0   0   0   0   0   0   1   0   0   0   2   0   0   0   0   1   0   0   0   0   0

The code works perfectly, however, my main goal here is to learn Pandas and I know that using a for loop is not the best way to achieve this result. I have tried using the .apply() method, but I could not get it to work. What is the "pandas way" of achieving the same result?

Extra: I know that my Python skills are not the best, I would also appreciate comments on my code.

CodePudding user response：

EDIT: this doesn't work perfectly, as the counts don't increment as there are more of the same letter in a word, e.g. for row 5755, f should be 2 because there are two f's in biffy, but it's only 1.

Original answer:

There's actually a really concise way to do exactly this what you're trying to do: <column>.str.get_dummies(). It takes a separator string, and splits all the strings in that column by the separator, and creates a one-hot encoded dataframe out of them (exactly what you're trying to do):

>>> df['words'].str.get_dummies('')
      a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
0     0  0  1  0  0  0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1     0  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
2     0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
3     1  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  0  0  0  0
4     0  0  0  1  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  1  0  0  0
...  .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0
5753  0  1  0  0  1  0  0  0  0  0  0  1  0  0  1  0  0  1  0  0  0  0  0  0  0  0
5754  1  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0
5755  0  1  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
5756  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  1  0  0  0  0  1  0  0  0  0  0

To add that back to the original dataframe:

tmp_df = df['words'].str.get_dummies('')
df[tmp_df.columns] = tmp_df

Output:

>>> df
      words  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
0     which  0  0  1  0  0  0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1     there  0  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
2     their  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
3     about  1  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  0  0  0  0
4     would  0  0  0  1  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  1  0  0  0
...     ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752  osier  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0
5753  roble  0  1  0  0  1  0  0  0  0  0  0  1  0  0  1  0  0  1  0  0  0  0  0  0  0  0
5754  rumba  1  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0
5755  biffy  0  1  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
5756  pupal  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  1  0  0  0  0  1  0  0  0  0  0

[5757 rows x 27 columns]

CodePudding user response：

You can use the built-in collections.Counter function, which returns a dict containing unique elements of a string/list and their counts, then convert it to a dataframe, fill NaNs, and assign it back to the original dataframe:

from collections import Counter

letter_cols = df.columns[1:]
df[letter_cols] = pd.DataFrame(df['words'].apply(Counter).tolist()).fillna(0).astype(int)[letter_cols]

Output:

>>> df
      words  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
0     which  0  0  1  0  0  0  0  2  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1     there  0  0  0  0  2  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
2     their  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
3     about  1  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  0  0  0  0
4     would  0  0  0  1  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  1  0  0  0
...     ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752  osier  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0
5753  roble  0  1  0  0  1  0  0  0  0  0  0  1  0  0  1  0  0  1  0  0  0  0  0  0  0  0
5754  rumba  1  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0
5755  biffy  0  1  0  0  0  2  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
5756  pupal  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  2  0  0  0  0  1  0  0  0  0  0

[5757 rows x 27 columns]

CodePudding user response：

This how I will go about it and avaoid lambda because it is an anticlimax in pandas

Add temp column which is a list of each letter in words
Explode the df on column 1 above
Apply pd.get_dummies on column 2
3 above gives frequency of each letter in 2
groupby outcome in 4 and sum, gives a tally of each letter per index
Leverage pd.join to pin 5 back to original df.
drop temp column from the df

Code below

words_df['words1']=(list(map(','.join, words_df['words'])))#disintegrate words into letters
s= (words_df.drop(columns='words1')#drop new added column
    .join(#Join original df to computed outcome
        pd.get_dummies(#pd.get_dummies method
            words_df.assign(words1=words_df['words1'].str.split(',')).explode('words1')['words1'],prefix='', prefix_sep='')#raw breakdown by letters
        .groupby(level=0).sum().reset_index()#groupby index and sum letters per index
    )
   )