Having trouble putting data into a pandas dataframe-CodePudding

I am new to coding, so take it easy on me! I recently started a pet project which scrapes data from a table and will create a csv of the data for me. I believe I have successfully pulled the data, but trying to put it into a dataframe returns the error "Shape of passed values is (31719, 1), indices imply (31719, 23)". I have tried looking at the length of my headers and my rows and those numbers are correct, but when I try to put it into a dataframe it appears that it is only pulling one column into the dataframe. Again, I am very new to all of this but would appreciate any help! Code below

from bs4 import BeautifulSoup
from pandas.core.frame import DataFrame
import requests
import pandas as pd
url = 'https://www.fangraphs.com/leaders.aspx? pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
#pulling table from HTML
Table1 = soup.find('table', id = 'LeaderBoard1_dg1_ctl00')
#finding and filling table columns
headers = []
for i in Table1.find_all('th'):
    title = i.text
    headers.append(title)
#finding and filling table rows
rows = []
for j in Table1.find_all('td'):
    data = j.text
    rows.append(data)
#filling dataframe
df = pd.DataFrame(rows, columns = headers)
#show dataframe
print(df)

CodePudding user response：

You are creating a dataframe with 692 rows with 23 columns as a new dataframe. However looking at the rows array, you only have 1 dimensional array so shape of passed values is not matching with indices. You are passing 692 x 1 to a dataframe with 692 x 23 which won't work.

If you want to create with the data you have, you should just use:

df=pd.DataFrame(rows, columns=headers[1:2])

CodePudding user response：

Alternativly you can achieve your goal directly by using pandas.read_html that processe the data by BeautifulSoup for you:

pd.read_html(url, attrs={'id':'LeaderBoard1_dg1_ctl00'}, header=[1])[0].iloc[:-1]

attrs={'id':'LeaderBoard1_dg1_ctl00'} selects table by id
header=[1] adjusts the header cause there are multiple headers
.iloc[:-1] removes the table footer with pagination

Example

import pandas as pd

pd.read_html('https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500',
            attrs={'id':'LeaderBoard1_dg1_ctl00'},
            header=[1])[0]\
            .iloc[:-1]