Remove duplicates in one list and average corresponding list entries of another list-CodePudding

So I have a pretty large dataset so I need to write something kind of efficient. My data contains release years of albums of various artist in one list and the average songlength of each album in another list.

As an example here is some made up data. The song length is here given in minutes.

release_year=[2017,2017,2019,2020,2020,2021]
avg_songlength=[3,5,3,4,2,3]

I want to get a dataset which removes duplicates in the release_year list and for every duplicate it averages the songlength again. So the result I want to get is:

years_without duplicates=[2017,2019,2020,2021]
avg_length_of_year=[3 5/2,3,4 2/2,3]

I found set() to be efficient for removing duplicates, but I don't know how to combine the entires in the other list then what's an easy way to do this?

CodePudding user response：

One option is to use itertools.groupby:

release_year=[2017,2017,2019,2020,2020,2021] 
avg_songlength=[3,5,3,4,2,3]

from itertools import groupby
from statistics import mean

years_without_duplicates, avg_length_of_year = zip(*(
             (k, mean(list(zip(*g))[1])) for k, g in
             groupby(sorted(zip(release_year, avg_songlength)),
                     lambda x: x[0]))
                                                  )

years_without_duplicates, avg_length_of_year
# ((2017, 2019, 2020, 2021), (4, 3, 3, 3))

Or use collections.defaultdict:

from collections import defaultdict

out = defaultdict(lambda : [0, 0]) # sum / count

for year, sl in zip(release_year, avg_songlength):
    out[year][0]  = sl  # add length
    out[year][1]  = 1   # increment counter of occurrences 
    
d = {k: v[0]/v[1] for k,v in out.items()} # avg = sum / count
years_without_duplicates, avg_length_of_year = zip(*d.items())

CodePudding user response：

Here's a simple way to go about this in base python. The idea here is to store years we've seen in a dictionary and keep track of the total song runtimes as well as the number of songs that contributed to the total. Then at the end we can go over the keys in the dictionary and convert them to the average runtime. Using a dictionary also helps make this data a little more structured than the two separate lists.

release_year=[2017,2017,2019,2020,2020,2021]
avg_songlength=[3,5,3,4,2,3]

year_averages = dict()
for year, length in zip(release_year, avg_songlength):
    if year in year_averages:
        year_averages[year][0]  = length
        year_averages[year][1]  = 1
    else:
        year_averages[year] = [length, 1]

year_averages = {year: lst[0]/lst[1] for year, lst in year_averages.items()}
print(year_averages)

Outputs:

{2017: 4.0, 2019: 3.0, 2020: 3.0, 2021: 3.0}

CodePudding user response：

Convert To A Pandas Dataframe and use aggregation function as np.mean

import pandas as pd
import numpy as np

df = pd.DataFrame({"release_year":[2017,2017,2019,2020,2020,2021],"avg_song_length":[3,5,3,4,2,3]})

print(df)

print(df.groupby("release_year",as_index=False).agg(avg_length_of_year=("avg_song_length",np.mean)))

CodePudding user response：

This is a simple approach using one dictionary to store the sum of each year value, and another to count how many values have been added.

avg_dict = {}
count_dict = {}

for i in range(0, len(release_year)):
    if str(release_year[i]) in avg_dict:
        avg_dict[str(release_year[i])] = avg_dict[str(release_year[i])]   avg_songlength[i]
        count_dict[str(release_year[i])] = count_dict[str(release_year[i])]   1
    else:
        avg_dict[str(release_year[i])] = avg_songlength[i]
        count_dict[str(release_year[i])] = 1

for key in avg_dict:
    avg_dict[key] = avg_dict[key] / count_dict[key]

print(avg_dict) # {'2017': 4.0, '2019': 3.0, '2020': 3.0, '2021': 3.0}