I am new to coding, so take it easy on me! I recently started a pet project which scrapes data from a table and will create a csv of the data for me. I believe I have successfully pulled the data, but trying to put it into a dataframe returns the error "Shape of passed values is (31719, 1), indices imply (31719, 23)". I have tried looking at the length of my headers and my rows and those numbers are correct, but when I try to put it into a dataframe it appears that it is only pulling one column into the dataframe. Again, I am very new to all of this but would appreciate any help! Code below
from bs4 import BeautifulSoup
from pandas.core.frame import DataFrame
import requests
import pandas as pd
url = 'https://www.fangraphs.com/leaders.aspx? pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
#pulling table from HTML
Table1 = soup.find('table', id = 'LeaderBoard1_dg1_ctl00')
#finding and filling table columns
headers = []
for i in Table1.find_all('th'):
title = i.text
headers.append(title)
#finding and filling table rows
rows = []
for j in Table1.find_all('td'):
data = j.text
rows.append(data)
#filling dataframe
df = pd.DataFrame(rows, columns = headers)
#show dataframe
print(df)
CodePudding user response:
You are creating a dataframe with 692 rows with 23 columns as a new dataframe. However looking at the rows array, you only have 1 dimensional array so shape of passed values is not matching with indices. You are passing 692 x 1 to a dataframe with 692 x 23 which won't work.
If you want to create with the data you have, you should just use:
df=pd.DataFrame(rows, columns=headers[1:2])
CodePudding user response:
Alternativly you can achieve your goal directly by using pandas.read_html that processe the data by BeautifulSoup for you:
pd.read_html(url, attrs={'id':'LeaderBoard1_dg1_ctl00'}, header=[1])[0].iloc[:-1]
attrs={'id':'LeaderBoard1_dg1_ctl00'}selects table by idheader=[1]adjusts the header cause there are multiple headers.iloc[:-1]removes the table footer with pagination
Example
import pandas as pd
pd.read_html('https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500',
attrs={'id':'LeaderBoard1_dg1_ctl00'},
header=[1])[0]\
.iloc[:-1]
