I am trying to parse a webpage and store the result in a table. However, I can't proceed any further as the parse result has some unwanted text at the top, middle and bottom of the csv. http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"
response = requests.get(URL)
mywebpage = response.text
mysoup = BeautifulSoup(mywebpage, "html.parser")
print(mysoup)
gives...
<html>
<head>
<title> Euro Millions Winning Numbers</title>
<body>
<pre> Euro Millions Winning Numbers
No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
1500, Fri, 4,Feb,2022, 03,25,38,43,49,03,07, 109915000, 1
1499, Tue, 1,Feb,2022, 01,19,36,38,49,06,09, 52442757, 0
1498, Fri,28,Jan,2022, 10,25,29,34,45,09,10, 42779117, 0
...
1451, Tue,17,Aug,2021, 12,31,41,42,47,04,06, 14502700, 0
1450, Fri,13,Aug,2021, 06,12,44,47,49,08,12, 96295864, 1
<hr/><b>All lotteries below have exceeded the 180 days expiry date</b><hr/>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
1449, Tue,10,Aug,2021, 09,37,47,48,49,02,07, 80768518, 0
1448, Fri, 6,Aug,2021, 07,14,21,26,32,04,12, 71953143, 0
...
3, Fri,27,Feb,2004, 14,18,19,31,37,04,05, 11880304, 0
2, Fri,20,Feb,2004, 07,13,39,47,50,02,05, 10111500, 0
1, Fri,13,Feb,2004, 16,29,32,36,41,07,09, 10143000, 1
This page shows all the draws that used any machine and any ball set in any year.
Data obtained from http://lottery.merseyworld.com/Euro/
</pre>
</body></head></html>
I just want to be able to extract the 1500 rows in a pandas dataframe and get rid of the text that occurs at the beginning, middle and end! Any help is appreciated. Thanks!
CodePudding user response:
This should give you the desired effect:
# Access the <pre> tag
mysoup = mysoup.pre.text.split("\n")
# Create a DataFrame object
df = pd.DataFrame(mysoup)
# Split column 0 based on the seperator ","
df = df[0].str.split(',', expand=True)
# Strip each cell by using a lambda function
df = df.apply(lambda row: row.str.strip(), axis=1)
# Exclude all rows which do not have an entry in column 1
df = df[~df[1].isnull()]
# Set the column names based on the first row
df.columns = df.iloc[0]
# drop the first row since it is the same as our column names
df = df.drop(df.index[0])
CodePudding user response:
You can split() your result and create a dataframe:
df = pd.DataFrame([x.split(',') for x in mysoup.pre.text.split('\n') if x.count(',') == 13])
Set the first row as your column headers:
df.columns = df.iloc[0]
Convert all values of column "No." into numbers:
df['No.'] = pd.to_numeric(df['No.'],errors='coerce').fillna(0).astype(int)
Filter only the rows where "No." is not "0":
df[df['No.'] != 0]
Output
| No. | Day | DD | MMM | YYYY | N1 | N2 | N3 | N4 | N5 | L1 | L2 | Jackpot | Wins |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1500 | Fri | 4 | Feb | 2022 | 03 | 25 | 38 | 43 | 49 | 03 | 07 | 109915000 | 1 |
| 1499 | Tue | 1 | Feb | 2022 | 01 | 19 | 36 | 38 | 49 | 06 | 09 | 52442757 | 0 |
| 1498 | Fri | 28 | Jan | 2022 | 10 | 25 | 29 | 34 | 45 | 09 | 10 | 42779117 | 0 |
| 1497 | Tue | 25 | Jan | 2022 | 06 | 16 | 18 | 39 | 47 | 02 | 04 | 30632619 | 0 |
| 1496 | Fri | 21 | Jan | 2022 | 05 | 14 | 35 | 36 | 39 | 07 | 12 | 21857937 | 0 |
| 1495 | Tue | 18 | Jan | 2022 | 03 | 12 | 19 | 24 | 30 | 02 | 05 | 56246458 | 0 |
| 1494 | Fri | 14 | Jan | 2022 | 06 | 17 | 25 | 31 | 46 | 09 | 12 | 46600782 | 0 |
| 1493 | Tue | 11 | Jan | 2022 | 04 | 23 | 32 | 33 | 44 | 09 | 12 | 34392715 | 0 |
| 1492 | Fri | 7 | Jan | 2022 | 02 | 14 | 37 | 44 | 50 | 07 | 11 | 25451971 | 0 |
| 1491 | Tue | 4 | Jan | 2022 | 18 | 28 | 35 | 36 | 41 | 06 | 11 | 14189900 | 0 |
| 1490 | Fri | 31 | Dec | 2021 | 07 | 22 | 25 | 43 | 49 | 06 | 07 | 14788260 | 1 |
| ... |
