frequency of string (comma separated) in Python-CodePudding

I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies

Is there a way to pull out the frequency of each of the comma separated strings?

For example, how frequent does the term "Sequoia Capital China" show up?

CodePudding user response：

The solution provided by @Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital' returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital', namely 'Sequoia Capital', 'Sequoia Capital China', 'Sequoia Capital India', 'Sequoia Capital Israel' and 'and Sequoia Capital China'. The following code avoids that issue:

import pandas as pd
from collections import defaultdict

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]

freqs = defaultdict(int)
for group in df['Select Investors']:
    if hasattr(group, 'lower'):
        for investor in group.lower().split(','):
            freqs[investor.strip()]  = 1

Demo

In [57]: freqs['sequoia capital']
Out[57]: 41

In [58]: freqs['sequoia capital china']
Out[58]: 46

In [59]: freqs['sequoia capital india']
Out[59]: 25

In [60]: freqs['sequoia capital israel']
Out[60]: 2

In [61]: freqs['and sequoia capital china']
Out[61]: 1

The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital' by the currently accepted solution.

CodePudding user response：

I made this correct, more pythonic way

import itertools
import collections
import pandas as pd


def fun(x):
    return map(lambda y: y.strip(), str(x).split(','))


# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]

# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))

# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df

Output:

    Investor                                        Frequency
0   Sequoia Capital China                           46
1   SIG Asia Investments                            3
2   Sina Weibo                                      2
3   Softbank Group                                  9
4   Founders Fund                                   16
...     ...     ...
1187    Motive Partners. Apollo Global Management   1
1188    JBV Capital                                 1
1189    Array Ventures                              1
1190    AWZ Ventures                                1
1191    Endiya Partners                             1