So, I have a directory with .csv files. For example:
a.csv
id,name
1,john
2,mary
3,alex
b.csv
id,birth
1,01.01.2001
2,05.06.1990
c.csv
id,death
2,01.02.2020
1,-
The result should be one dict where the key is id (int) and value is a dict of all the different values across the the files(dict of dicts). Something like this:
{
1: {"id": 1, "name": "john", "birth": "01.01.2001", "death": -},
2: {"id": 2, "name": "mary", "birth": "05.06.1990",
"death": "01.02.2020"},
3: {"id": 3, "name": "alex", "birth": None, "death": None},
}
So far I've tried to merge all files into one dataframe:
from pathlib import Path
import os
import pandas as pd
files = Path(r'path').rglob('*.csv')
# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]
# lump into one table
all_csvs = pd.concat(all_csvs, axis=1)
But as a result I get a dataframe where 'id' is repeated in three columns.
Any help will be appreciated!
CodePudding user response:
You want merge and not concat. Since you need to merge multiple DataFrames, you can do:
import os
from functools import reduce
all_csvs = [pd.read_csv(file) for file in os.listdir() if file.endswith(".csv")]
df = reduce(lambda left, right: pd.merge(left, right, how="outer", on="id"), all_csvs)
>>> df
id name birth death
0 1 john 01.01.2001 NaN
1 2 mary 05.06.1990 01.02.2020
2 3 alex NaN NaN
#for dictionary output replacing nan with None
my_dict = df.where(df.notnull(), None).set_index("id", drop=False).to_dict(orient="index")
>>> my_dict
{1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': None},
2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}}
CodePudding user response:
You can even do it without pandas if you're so inclined. First, create a defaultdict that'll hold all your csv data. Let the default element of this dict be a dictionary representing a "default" person, i.e. with all keys having a value of None.
import collections
def default_person():
return {'id': None, 'name': None, 'birth': None, 'death': None}
all_csvs = collections.defaultdict(default_person)
The keys in this dict will be the id field, and values will be the dict containing all the information you want.
Next, read each file using csv.DictReader. DictReader reads each row of your csv file as a dictionary, with keys coming from the file's header. Then for each row in each file, update the values of the dictionary at the correct id in the defaultdict we just created:
import csv
files = Path(r'path').rglob('*.csv')
for file in files:
with open(file, "r") as f_in:
reader = csv.DictReader(f_in)
for row_dict in reader:
p_id = row_dict['id'] = int(row_dict['id']) # Convert `id` to integer
all_csvs[p_id].update(row_dict)
Now, all_csvs looks like so:
defaultdict(<function __main__.default_person()>,
{
1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': '-'},
2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}
})
