Home > Net >  Webpage Parsing
Webpage Parsing

Time:02-08

I am trying to parse a webpage and store the result in a table. However, I can't proceed any further as the parse result has some unwanted text at the top, middle and bottom of the csv. http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"


response = requests.get(URL)
mywebpage = response.text
mysoup = BeautifulSoup(mywebpage, "html.parser")

print(mysoup)

gives...

<html>
<head>
<title> Euro Millions Winning Numbers</title>
<body>
<pre> Euro Millions Winning Numbers

No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1500, Fri, 4,Feb,2022, 03,25,38,43,49,03,07, 109915000,    1
1499, Tue, 1,Feb,2022, 01,19,36,38,49,06,09,  52442757,    0
1498, Fri,28,Jan,2022, 10,25,29,34,45,09,10,  42779117,    0
...
1451, Tue,17,Aug,2021, 12,31,41,42,47,04,06,  14502700,    0
1450, Fri,13,Aug,2021, 06,12,44,47,49,08,12,  96295864,    1
<hr/><b>All lotteries below have exceeded the 180 days expiry date</b><hr/>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1449, Tue,10,Aug,2021, 09,37,47,48,49,02,07,  80768518,    0
1448, Fri, 6,Aug,2021, 07,14,21,26,32,04,12,  71953143,    0
...
3, Fri,27,Feb,2004, 14,18,19,31,37,04,05,  11880304,    0
  2, Fri,20,Feb,2004, 07,13,39,47,50,02,05,  10111500,    0
  1, Fri,13,Feb,2004, 16,29,32,36,41,07,09,  10143000,    1

This page shows all the draws that used any machine and any ball set in any year.

Data obtained from http://lottery.merseyworld.com/Euro/
</pre>
</body></head></html>

I just want to be able to extract the 1500 rows in a pandas dataframe and get rid of the text that occurs at the beginning, middle and end! Any help is appreciated. Thanks!

CodePudding user response:

This should give you the desired effect:

# Access the <pre> tag
mysoup = mysoup.pre.text.split("\n")

# Create a DataFrame object
df = pd.DataFrame(mysoup)

# Split column 0 based on the seperator ","
df = df[0].str.split(',', expand=True)

# Strip each cell by using a lambda function
df = df.apply(lambda row: row.str.strip(), axis=1)

# Exclude all rows which do not have an entry in column 1
df = df[~df[1].isnull()]

# Set the column names based on the first row
df.columns = df.iloc[0]

# drop the first row since it is the same as our column names
df = df.drop(df.index[0])

CodePudding user response:

You can split() your result and create a dataframe:

df = pd.DataFrame([x.split(',') for x in mysoup.pre.text.split('\n') if x.count(',') == 13])

Set the first row as your column headers:

df.columns = df.iloc[0]

Convert all values of column "No." into numbers:

df['No.'] = pd.to_numeric(df['No.'],errors='coerce').fillna(0).astype(int)

Filter only the rows where "No." is not "0":

df[df['No.'] != 0]

Output

No. Day DD MMM YYYY N1 N2 N3 N4 N5 L1 L2 Jackpot Wins
1500 Fri 4 Feb 2022 03 25 38 43 49 03 07 109915000 1
1499 Tue 1 Feb 2022 01 19 36 38 49 06 09 52442757 0
1498 Fri 28 Jan 2022 10 25 29 34 45 09 10 42779117 0
1497 Tue 25 Jan 2022 06 16 18 39 47 02 04 30632619 0
1496 Fri 21 Jan 2022 05 14 35 36 39 07 12 21857937 0
1495 Tue 18 Jan 2022 03 12 19 24 30 02 05 56246458 0
1494 Fri 14 Jan 2022 06 17 25 31 46 09 12 46600782 0
1493 Tue 11 Jan 2022 04 23 32 33 44 09 12 34392715 0
1492 Fri 7 Jan 2022 02 14 37 44 50 07 11 25451971 0
1491 Tue 4 Jan 2022 18 28 35 36 41 06 11 14189900 0
1490 Fri 31 Dec 2021 07 22 25 43 49 06 07 14788260 1
...
  •  Tags:  
  • Related