I'm retrieving data from DB, and I so far I reached a list of dictionaries. In each dict I'm interested in the keys title, authors, publishedAt and tickers. Some dicts have the tickers empty, so I ignore those. Some dicts may have a list of multiple tickers and multiple authors, so in this case I'd like my DataFrame to have one row for each author and ticker, and the title and publishedAt must repeat thoughout those lines.
Example of dictionary:
dict = {'title':'Report',
'authors':[{'name': 'Mike'},{'name':'John'}],
'tickers':[{'code':'GOOGL'},{'code':'AAPL'}],
'publishedAt':'2022-03-31'
}
The desired outcome would be something like:
| Author | Date | Title | Ticker |
|---|---|---|---|
| Mike | 2022-03-01 | Report | GOOGL |
| Mike | 2022-03-01 | Report | AAPL |
| John | 2022-03-01 | Report | GOOGL |
| John | 2022-03-01 | Report | AAPL |
I was trying to do something like this, where publications is my list of dicts:
df = pd.DataFrame(columns=['Author','Date','Title','Ticker'])
for publication in publications:
qty_tickers = len(publication['tickers'])
qty_authors = len(publication['authors'])
qty_total = qty_tickers * qty_authors
if qty_total >= 1:
start = len(df) 1
df.iloc[start:start qty_total,1:2] = [publication['publishedAt'],publication['title']]
This last bit is already not working, as I receive ValueError: could not broadcast input array from shape (2,) into shape (0,1).
And I didn't even know how to start with the Authors and Tickers.
CodePudding user response:
One example how to get this result would be
for author in publication['authors']:
for ticker in publication['tickers']:
df = pd.concat([
df,
pd.DataFrame({
'Author': author['name'],
'Date': publication['publishedAt'],
'Title': publication['title'],
'Ticker': ticker['code']
}, index=[0])
])
Using pd.concat() from the documentation, this concatenates the original DataFrame df with the newly created DataFrame from a dict. Since there was no index, I adjusted the function call to use the first column. You can of course adjust that to your liking.
CodePudding user response:
Instead of populating the DataFrame dynamically, you should first perform the data wrangling and only then construct the DataFrame with it. In this case, it makes sense to create a list of records/ dictionaries (the rows of the DataFrame) based on publications
import pandas as pd
publications = [
{
'title':'Report',
'authors':[{'name': 'Mike'},{'name':'John'}],
'tickers':[{'code':'GOOGL'},{'code':'AAPL'}],
'publishedAt':'2022-03-31'
},
{
'title':'Something',
'authors':[{'name': 'Someone'},{'name':'Other'}],
'tickers':[{'code':'ABC'},{'code':'DEF'}],
'publishedAt':'2022-03-31'
}
]
records = []
for publication in publications:
for author in publication['authors']:
for ticker in publication['tickers']:
rec = {
'Author': author['name'],
'Date': publication['publishedAt'],
'Title': publication['title'],
'Ticker': ticker['code']
}
records.append(rec)
df = pd.DataFrame(records)
Output:
>>> df
Author Date Title Ticker
0 Mike 2022-03-31 Report GOOGL
1 Mike 2022-03-31 Report AAPL
2 John 2022-03-31 Report GOOGL
3 John 2022-03-31 Report AAPL
4 Someone 2022-03-31 Something ABC
5 Someone 2022-03-31 Something DEF
6 Other 2022-03-31 Something ABC
7 Other 2022-03-31 Something DEF
