I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies
Is there a way to pull out the frequency of each of the comma separated strings?
For example, how frequent does the term "Sequoia Capital China" show up?
CodePudding user response:
The solution provided by @Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital' returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital', namely 'Sequoia Capital', 'Sequoia Capital China', 'Sequoia Capital India', 'Sequoia Capital Israel' and 'and Sequoia Capital China'. The following code avoids that issue:
import pandas as pd
from collections import defaultdict
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
freqs = defaultdict(int)
for group in df['Select Investors']:
if hasattr(group, 'lower'):
for investor in group.lower().split(','):
freqs[investor.strip()] = 1
Demo
In [57]: freqs['sequoia capital']
Out[57]: 41
In [58]: freqs['sequoia capital china']
Out[58]: 46
In [59]: freqs['sequoia capital india']
Out[59]: 25
In [60]: freqs['sequoia capital israel']
Out[60]: 2
In [61]: freqs['and sequoia capital china']
Out[61]: 1
The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital' by the currently accepted solution.
CodePudding user response:
I made this correct, more pythonic way
import itertools
import collections
import pandas as pd
def fun(x):
return map(lambda y: y.strip(), str(x).split(','))
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))
# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df
Output:
Investor Frequency
0 Sequoia Capital China 46
1 SIG Asia Investments 3
2 Sina Weibo 2
3 Softbank Group 9
4 Founders Fund 16
... ... ...
1187 Motive Partners. Apollo Global Management 1
1188 JBV Capital 1
1189 Array Ventures 1
1190 AWZ Ventures 1
1191 Endiya Partners 1
