Access rows with string in dataframe column, which contain 2 or more spaces between words using Pand-CodePudding

I am learning Python on, perhaps real case scenarios, and got a task to filter names of companies which contain more than 3 words. It is in the column named "Company Name" and dataframe is called "data". I managed to get them into the list and eventually also into dataframe. However, in dataframe I found rows at place of columns, and columns at rows. Feels like walking around it.

a,b = data.shape
required_data = []

for i in range(a):
    if data["Company Name"][i].count(" ") >= 2:
        required_data.append(data.iloc[i])
    else:
        pass

required_data1 = pd.concat(required_data, axis=1, ignore_index = True)

required_data1

I would go for axis=0 argument, but it returns, sort of, weird list of items from dataframe. Not sure if this is the right approach and so decided to reach out for the help. Many thanks!

CodePudding user response：

Use str.split to split company names into words and count the length of the list then select right rows:

data = pd.DataFrame({'Company Name': ['American Telephone and Telegraph', 
                                      'America Online',
                                      'Capsule Computer',
                                      'International Business MachinesHP']})

required_data1 = data[data['Company Name'].str.split(r'\s ').str.len().ge(3)]
print(required_data1)

# Output
                        Company Name
0   American Telephone and Telegraph
3  International Business MachinesHP

CodePudding user response：

You can find the answer in here : How do I select rows from a DataFrame based on column values?

In your case, we can use enumerate and .iloc like this:

required_data1 = data["Company Name"].iloc[[i for i,x in enumerate(data["Company Name"]) if x.count(" ")>=1]]