Combine multiple and repeating values in Pandas DataFrame-CodePudding

I'm retrieving data from DB, and I so far I reached a list of dictionaries. In each dict I'm interested in the keys title, authors, publishedAt and tickers. Some dicts have the tickers empty, so I ignore those. Some dicts may have a list of multiple tickers and multiple authors, so in this case I'd like my DataFrame to have one row for each author and ticker, and the title and publishedAt must repeat thoughout those lines.

Example of dictionary:

dict = {'title':'Report',
         'authors':[{'name': 'Mike'},{'name':'John'}],
         'tickers':[{'code':'GOOGL'},{'code':'AAPL'}],
         'publishedAt':'2022-03-31'
        }

The desired outcome would be something like:

Author	Date	Title	Ticker
Mike	2022-03-01	Report	GOOGL
Mike	2022-03-01	Report	AAPL
John	2022-03-01	Report	GOOGL
John	2022-03-01	Report	AAPL

I was trying to do something like this, where publications is my list of dicts:

df = pd.DataFrame(columns=['Author','Date','Title','Ticker'])

for publication in publications:
    qty_tickers = len(publication['tickers'])
    qty_authors = len(publication['authors'])
    qty_total = qty_tickers * qty_authors
    if qty_total >= 1:
        start = len(df) 1
        df.iloc[start:start qty_total,1:2] = [publication['publishedAt'],publication['title']]

This last bit is already not working, as I receive ValueError: could not broadcast input array from shape (2,) into shape (0,1). And I didn't even know how to start with the Authors and Tickers.

CodePudding user response：

One example how to get this result would be

for author in publication['authors']:
    for ticker in publication['tickers']:
        df = pd.concat([
            df,
            pd.DataFrame({
            'Author': author['name'],
            'Date': publication['publishedAt'],
            'Title': publication['title'],
            'Ticker': ticker['code']
            }, index=[0])
            ])

Using pd.concat() from the documentation, this concatenates the original DataFrame df with the newly created DataFrame from a dict. Since there was no index, I adjusted the function call to use the first column. You can of course adjust that to your liking.

CodePudding user response：

Instead of populating the DataFrame dynamically, you should first perform the data wrangling and only then construct the DataFrame with it. In this case, it makes sense to create a list of records/ dictionaries (the rows of the DataFrame) based on publications

import pandas as pd 

publications = [
    {
    'title':'Report',
    'authors':[{'name': 'Mike'},{'name':'John'}],
    'tickers':[{'code':'GOOGL'},{'code':'AAPL'}],
    'publishedAt':'2022-03-31'
    }, 
    {
    'title':'Something',
    'authors':[{'name': 'Someone'},{'name':'Other'}],
    'tickers':[{'code':'ABC'},{'code':'DEF'}],
    'publishedAt':'2022-03-31'
    }
]
    
records = []

for publication in publications:
    for author in publication['authors']:
        for ticker in publication['tickers']:
            rec = {
                'Author': author['name'],
                'Date': publication['publishedAt'],
                'Title': publication['title'],
                'Ticker': ticker['code']
            }
            records.append(rec)
        
df = pd.DataFrame(records)

Output:

>>> df

    Author        Date      Title Ticker
0     Mike  2022-03-31     Report  GOOGL
1     Mike  2022-03-31     Report   AAPL
2     John  2022-03-31     Report  GOOGL
3     John  2022-03-31     Report   AAPL
4  Someone  2022-03-31  Something    ABC
5  Someone  2022-03-31  Something    DEF
6    Other  2022-03-31  Something    ABC
7    Other  2022-03-31  Something    DEF