Webpage Parsing-CodePudding

I am trying to parse a webpage and store the result in a table. However, I can't proceed any further as the parse result has some unwanted text at the top, middle and bottom of the csv. http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"


response = requests.get(URL)
mywebpage = response.text
mysoup = BeautifulSoup(mywebpage, "html.parser")

print(mysoup)

gives...

<html>
<head>
<title> Euro Millions Winning Numbers</title>
<body>
<pre> Euro Millions Winning Numbers

No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1500, Fri, 4,Feb,2022, 03,25,38,43,49,03,07, 109915000,    1
1499, Tue, 1,Feb,2022, 01,19,36,38,49,06,09,  52442757,    0
1498, Fri,28,Jan,2022, 10,25,29,34,45,09,10,  42779117,    0
...
1451, Tue,17,Aug,2021, 12,31,41,42,47,04,06,  14502700,    0
1450, Fri,13,Aug,2021, 06,12,44,47,49,08,12,  96295864,    1
<hr/><b>All lotteries below have exceeded the 180 days expiry date</b><hr/>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1449, Tue,10,Aug,2021, 09,37,47,48,49,02,07,  80768518,    0
1448, Fri, 6,Aug,2021, 07,14,21,26,32,04,12,  71953143,    0
...
3, Fri,27,Feb,2004, 14,18,19,31,37,04,05,  11880304,    0
  2, Fri,20,Feb,2004, 07,13,39,47,50,02,05,  10111500,    0
  1, Fri,13,Feb,2004, 16,29,32,36,41,07,09,  10143000,    1

This page shows all the draws that used any machine and any ball set in any year.

Data obtained from http://lottery.merseyworld.com/Euro/
</pre>
</body></head></html>

I just want to be able to extract the 1500 rows in a pandas dataframe and get rid of the text that occurs at the beginning, middle and end! Any help is appreciated. Thanks!

CodePudding user response：

This should give you the desired effect:

# Access the <pre> tag
mysoup = mysoup.pre.text.split("\n")

# Create a DataFrame object
df = pd.DataFrame(mysoup)

# Split column 0 based on the seperator ","
df = df[0].str.split(',', expand=True)

# Strip each cell by using a lambda function
df = df.apply(lambda row: row.str.strip(), axis=1)

# Exclude all rows which do not have an entry in column 1
df = df[~df[1].isnull()]

# Set the column names based on the first row
df.columns = df.iloc[0]

# drop the first row since it is the same as our column names
df = df.drop(df.index[0])

CodePudding user response：

You can split() your result and create a dataframe:

df = pd.DataFrame([x.split(',') for x in mysoup.pre.text.split('\n') if x.count(',') == 13])

Set the first row as your column headers:

df.columns = df.iloc[0]

Convert all values of column "No." into numbers:

df['No.'] = pd.to_numeric(df['No.'],errors='coerce').fillna(0).astype(int)

Filter only the rows where "No." is not "0":

df[df['No.'] != 0]

Output

No.	Day	DD	MMM	YYYY	N1	N2	N3	N4	N5	L1	L2	Jackpot	Wins
1500	Fri	4	Feb	2022	03	25	38	43	49	03	07	109915000	1
1499	Tue	1	Feb	2022	01	19	36	38	49	06	09	52442757	0
1498	Fri	28	Jan	2022	10	25	29	34	45	09	10	42779117	0
1497	Tue	25	Jan	2022	06	16	18	39	47	02	04	30632619	0
1496	Fri	21	Jan	2022	05	14	35	36	39	07	12	21857937	0
1495	Tue	18	Jan	2022	03	12	19	24	30	02	05	56246458	0
1494	Fri	14	Jan	2022	06	17	25	31	46	09	12	46600782	0
1493	Tue	11	Jan	2022	04	23	32	33	44	09	12	34392715	0
1492	Fri	7	Jan	2022	02	14	37	44	50	07	11	25451971	0
1491	Tue	4	Jan	2022	18	28	35	36	41	06	11	14189900	0
1490	Fri	31	Dec	2021	07	22	25	43	49	06	07	14788260	1
...