I have some data loaded from a .dat file that looks like this in a .csv file.
A B C
X Y Z X Y Z X Y Z
1 4 7 10 13 16 19 22 25
2 5 8 11 14 17 20 23 26
3 6 9 12 15 18 21 24 27
I am using the following code to import this data
df = read_csv('data.csv' , delimeter = '\t' , header = [0,1])
the output i get from this is:
A un1 un2 B un3 un4 C un5 un6
X Y Z X Y Z X Y Z
1 4 7 10 13 16 19 22 25
2 5 8 11 14 17 20 23 26
3 6 9 12 15 18 21 24 27
note: un1 is Unnamed 1_Level_0 and so on
I would like to be able to multiindex so I could read print(df['A']['Z']) so that i can get the output:
7
8
9
I imagine that there is a way that I could fill the unnamed columns with the previous column name if they contain the str 'Unnamed' but this seems like a bad way to do it. Is there any way in pandas to use multiindexing to help solve this problem?
CodePudding user response:
You can do that by first replacing the Unnamed: * columns with NA, then call ffill to propagate the previous index value:
# Some sample data
from io import StringIO
string = """
A B C
X Y Z X Y Z X Y Z
1 4 7 10 13 16 19 22 25
2 5 8 11 14 17 20 23 26
3 6 9 12 15 18 21 24 27
"""
df = pd.read_csv(StringIO(string), header=[0,1], sep='\s{3}', engine='python')
# The solution
level0 = df.columns.get_level_values(0) # Get the A, B, C level, which include the Unnamed
level1 = df.columns.get_level_values(1) # Get the X, Y, Z level
level0 = (
level0.where(~level0.str.startswith('Unnamed:')) # replace Unnamed with NA
.to_series().ffill() # convert to Series to get
# the ffill method
)
# Putting the new columns together
df.columns = [level0, level1]
CodePudding user response:
You can create the MultiIndex separately:
from io import StringIO
data = """A B C
X Y Z X Y Z X Y Z
1 4 7 10 13 16 19 22 25
2 5 8 11 14 17 20 23 26
3 6 9 12 15 18 21 24 27"""
n_levels = 2
df = pd.read_csv(StringIO(data), delimiter = '\t', skiprows=n_levels, header=None)
columns = pd.read_csv(StringIO(data), delimiter = '\t', nrows=n_levels, header=None)\
.T.fillna(method='ffill')
df.columns = pd.MultiIndex.from_frame(columns)
df
# A B C
#1 X Y Z X Y Z X Y Z
#0 1 4 7 10 13 16 19 22 25
#1 2 5 8 11 14 17 20 23 26
#2 3 6 9 12 15 18 21 24 27
