I need to create a pandas dataframe based on 4 txt files with comments (to skip while reading) based on the following structure:
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Tondeuse
# Propriétés générales
hauteur=0.5
masse=20.0
prix=110.00
# Propriétés du moteur
impulsion specifique=80
and
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Civic VTEC
# Propriétés générales
hauteur=2.0
masse=3000.0
prix=2968.00
# Propriétés du moteur
impulsion specifique=205
and
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=VelociRAPTOR
# Propriétés générales
hauteur=4.0
masse=2000.0
prix=6000.00
# Propriétés du moteur
impulsion specifique=250
and
# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=La Puissance
# Propriétés générales
hauteur=12.0
masse=15000.0
prix=39000.00
# Propriétés du moteur
impulsion specifique=295
That's the result I need to have:
nom hauteur masse prix impulsion specifique
0 Tondeuse 0.5 20.0 110.0 80
1 Civic VTEC 2.0 3000.0 2968.0 205
2 VelociRAPTOR 4.0 2000.0 6000.0 250
3 La Puissance 12.0 15000.0 39000.0 295
I don't know if it's possible, but that's what i was asked to do
CodePudding user response:
welcome to Stackoverflow! :)
If your txt files have their content like you just showed, you could read them in using pandas as a CSV file.
The pandas.read_csv function has multiple things that will help you here:
- It outputs a dataframe, which is the format you would like to end up with
- Has a
commentinput argument, with which you can define lines that are to be ignored - You can use the
=sign as a separator, which will make you able to split up your data in the wanted sections
Now, let's try to read one of your files using the read_csv function:
import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None)
df
nom Tondeuse
0 hauteur 0.5
1 masse 20.0
2 prix 110.0
3 impulsion specifique 80.0
We're not completely there yet. We want to remove that index column that gives no info, and we want to transpose the dataframe (rows <-> columns) to be able to concatenate all dataframes together. Let's do it!
import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
df
0 nom hauteur masse prix impulsion specifique
1 Tondeuse 0.5 20.0 110.00 80
That's looking way better! Putting index_col=0 makes the lefternmost column be the index column, and the .T at the end transposes your dataframe. Now we just need to put this inside of a loop and make a complete script out of it!
import pandas as pd
import glob
import os
files = glob.glob(os.path.join(path, '*.csv'))
all_dfs = []
for file in files:
current_df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
all_dfs.append(current_df)
total_df = pd.concat(all_dfs)
total_df
0 nom hauteur masse prix impulsion specifique
1 La Puissance 12.0 15000.0 39000.00 295
1 Civic VTEC 2.0 3000.0 2968.00 205
1 VelociRAPTOR 4.0 2000.0 6000.00 250
1 Tondeuse 0.5 20.0 110.00 80
Notice that you still have that lefternmost column with the index number, I did not clean it out because I wasn't sure of what you wanted there.
Also, importantly, you need to be aware that if there is a slight difference in the names of the columns in your files (e.g. impulsion specifique vs impulsion spécifique) this will bring errors. You will need to create error handling procedures for these. Or maybe enforcing a certain schema, but that is out of the scope of this question.
I hope this helps!
CodePudding user response:
Your data files look very close to configuration files. You can use configparser to generate a dictionary from each file:
from pathlib import Path
from configparser import ConfigParser
data = []
for file in Path("data").glob("*.txt"):
parser = ConfigParser()
# INI file requires a section header. Yours don't have one.
# So let's give it one called DEFAULT
parser.read_string("[DEFAULT]\n" file.read_text())
data.append(dict(parser.items("DEFAULT")))
df = pd.DataFrame(data)
