I'm trying to scrap data from investing.com. My code is working except from the table header. My "columns" variable has the names as: data-col-name = "abc", but I don't know how to extract them as column_names.
table_rows = soup.find("tbody").find_all("tr")
table = []
for i in table_rows:
td = i.find_all("td")
row = [cell.string for cell in td]
table.append(row)
columns = soup.find("thead").find_all("th")
column_names =
df_temp = pd.DataFrame(data=table, columns=column_names)
df_dji = df_dji.append(df_temp)
CodePudding user response:
You have to use .text instead of .string
columns = soup.find("thead").find_all("th")
#print(columns)
column_names = [cell.text for cell in columns]
print(column_names)
or use .get_text() or even .get_text(strip=True)
column_names = [cell.get_text() for cell in columns]
print(column_names)
Official documentation shows .string (.text is unofficial method in new versions but probably was official in older versions) but here .string doesn't work - maybe because there is another object <span> inside <th>. And get_text() get all strings from all elements in th and create one string.
EDIT:
If you want to get value form data-col-name= then use
cell['data-col-name']cell.get('data-col-name')cell.attrs['data-col-name']cell.attrs.get('data-col-name')
(and the same is with cell['id'] or cell['class'])
column_names = [cell['data-col-name'] for cell in columns]
column_names = [cell.get('data-col-name') for cell in columns]
# etc.
attrs is normal dictionary so you can use attrs.get(key, default_value), attrs.keys(), attrs.items(), attrs.values() or use like dictionary with for-loop.
