Home > database >  Merge a list to new column dataframe
Merge a list to new column dataframe

Time:01-11

I want to merge a id from a list to a new column in a dataframe, based on other columns values

href id

'/soccer/england/premier-league/everton-arsenal/jkor5WO3/',
'/soccer/italy/serie-a/empoli-udinese/YoCoqht3/',
'/soccer/italy/serie-a/cagliari-torino/feDspYQd/',
'/soccer/spain/laliga/getafe-ath-bilbao/8xCPNQo5/',
'/soccer/spain/laliga2/zaragoza-eibar/xW0HLT6q/',
'/soccer/spain/laliga2/cartagena-sad-tenerife/veaBCjUS/',
'/soccer/spain/laliga2/girona-leganes/OSzr7Uij/',
'/soccer/france/ligue-2/niort-toulouse/8MqRAkTq/',
'/soccer/netherlands/eerste-divisie/jong-ajax-fc-emmen/2J3pyvSc/',
'/soccer/netherlands/eerste-divisie/jong-az-alkmaar-excelsior/tA2lzbs4/'

So, for example, the second href id has 'empoli' and 'udinese' in the url. The href id should be in the row where columns has 'empoli' and 'udinese' or similar values.

dataframe

df = pd.DataFrame({'HomeTeam':['Empoli','Getafe','Cagliari','Everton'],'AwayTeam':['Udinese','Athletic Bilbao','Torino','Arsenal']})

Expected Output

   HomeTeam         AwayTeam                                                 ID
0    Empoli          Udinese     /soccer/italy/serie-a/empoli-udinese/YoCoqht3/
1    Getafe  Athletic Bilbao   /soccer/spain/laliga/getafe-ath-bilbao/8xCPNQo5/
2  Cagliari           Torino    /soccer/italy/serie-a/cagliari-torino/feDspYQd/
3   Everton          Arsenal  /soccer/england/premier-league/everton-arsenal...

CodePudding user response:

As suggested, I have used an implementation involving fuzzywuzzy to make the matches, then afterward I inserted the new column into an existing DataFrame.

Edit: to add in scraping code provided by poster. Also, hardcoded in another team name to show that the proper link matches.

Edit2: changed default scorer parameter in extractOne to be partial ratio instead.

import pandas as pd
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
import requests
from bs4 import BeautifulSoup

def monthToNum(month):
    return {
            'January': 1,
            'February': 2,
            'March': 3,
            'April': 4,
            'May': 5,
            'June': 6,
            'July': 7,
            'August': 8,
            'September': 9, 
            'October': 10,
            'November': 11,
            'December': 12
    }[month]

dates_unique = ['Monday 6 December 2021']
href_ids=list()
for a in dates_unique:
    try:
      r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
    except requests.exceptions.ConnectionError:
      r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
    
    soup = BeautifulSoup(r.text, 'html.parser')
    f = soup.find_all('td', class_="table-main__tt")
    for h in f:
      href_ids.append(h.find('a')['href'])

infos = href_ids


df= pd.DataFrame({'HomeTeam':['Empoli','Getafe','Cagliari','Everton', 'Buxton'],'AwayTeam':['Udinese','Athletic Bilbao','Torino','Arsenal', 'Scarborough']})

dict_ids ={}


for info in infos:
    team_name_pieces = " ".join(info.split("/")[4].split("-"))
    
    dict_ids[info] = team_name_pieces

id_results_column = []
for row in df.iterrows():
    result = process.extractOne(row[1][0]   " "  row[1][1], dict_ids.keys(),scorer=fuzz.partial_ratio)
    id_results_column.append(result[0])


df['ID'] = id_results_column

print(df)

Output: enter image description here

  •  Tags:  
  • Related