I want to read into three omics datasets - methylation, cnv, and mRNA expression.
First, I removed duplicated columns and sort by index for all 3 dataframes mrna, meth, and cna separately. These dataframes are then concatenated to form one dataframe common.
Eventually, I want the common dataframe to only contain rows that are the same in all three dataframes mrna, meth, and cna.
import re
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
Now, I only want to keep the rows in common if the row value is the same across all the three original dataframes mrna, cna, and meth. Using the answer provided here (pandas - filter dataframe by another dataframe by row elements), this is my attempt:
# Keep only rows that are common in all three dataframes
mrna_keys = list(mrna.columns.values)
cna_keys = list(cna.columns.values)
meth_keys = list(meth.columns.values)
i_common_mrna = common.set_index(mrna_keys).index
i_common_cna = common.set_index(cna_keys).index
i_common_meth = common.set_index(meth_keys).index
i_mrna_common = mrna.set_index(mrna_keys).index
i_cna_common = cna.set_index(cna_keys).index
i_meth_common = meth.set_index(meth_keys).index
common = common[~i_common_mrna.isin(i_mrna_common) & ~i_common_cna.isin(i_cna_common) & ~i_common_meth.isin(i_meth_common)]
common
Traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-215-c824a7d76e25> in <module>()
15 meth_keys = list(meth.columns.values)
16
---> 17 i_common_mrna = common.set_index(mrna_keys).index
18 i_common_cna = common.set_index(cna_keys).index
19 i_common_meth = common.set_index(meth_keys).index
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
5449
5450 if missing:
-> 5451 raise KeyError(f"None of {missing} are in the columns")
5452
5453 if inplace:
Second attempt: Based on @lmielke's suggestion, I've tried the following alternative.
import re
dfs = [mrna, meth, cna]
# Keep only rows that are common in all three dataframes
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()]) for df in dfs]), columns=dfs[0].columns)
common
However, it returns 0 rows.
Example of the first few rows/columns of mrna dataframe (as a dictionary):
{'Hugo_Symbol': {0: 'AACS',
1: 'FSTL1',
2: 'ELMO2',
3: 'CREB3L1',
4: 'RPS11',
5: 'PNMA1',
6: 'MMP2',
7: 'SAMD4A',
8: 'SMARCD3',
9: 'A4GNT'},
'TCGA-02-0001-01': {0: -0.5909,
1: -0.9099,
2: -2.3351,
3: 0.2216,
4: 0.6798,
5: -2.48,
6: 0.7912,
7: -1.4578,
8: -3.8009,
9: 3.4868},
'TCGA-02-0003-01': {0: -0.5154,
1: 0.0896,
2: -1.17,
3: 0.1255,
4: 0.2374,
5: -3.2629,
6: 1.2846,
7: -1.474,
8: -2.9891,
9: -0.1511},
'TCGA-02-0004-01': {0: 1.0314,
1: 1.923,
2: 0.0792,
3: 2.5117,
4: -0.741,
5: -0.4146,
6: 2.5185,
7: 0.0058,
8: 0.032,
9: -1.7115},
'TCGA-02-0007-01': {0: 0.6932,
1: -5.6511,
2: -2.8365,
3: 2.0026,
4: -0.6326,
5: -1.3741,
6: -3.437,
7: -1.047,
8: -4.185,
9: 2.1816}
Example of the first few rows/columns of cna dataframe (as a dictionary):
{'Hugo_Symbol': {0: 'ACAP3',
1: 'ACTRT2',
2: 'AGRN',
3: 'ANKRD65',
4: 'ATAD3A',
5: 'ATAD3B',
6: 'ATAD3C',
7: 'AURKAIP1',
8: 'B3GALT6',
9: 'C1orf159'},
'TCGA-02-0001-01': {0: 0.242,
1: 0.242,
2: 0.242,
3: 0.242,
4: 0.242,
5: 0.242,
6: 0.242,
7: 0.242,
8: 0.242,
9: 0.242},
'TCGA-02-0003-01': {0: 0.007,
1: 0.007,
2: 0.007,
3: 0.007,
4: 0.007,
5: 0.007,
6: 0.007,
7: 0.007,
8: 0.007,
9: 0.007},
'TCGA-02-0006-01': {0: 0.005,
1: 0.005,
2: 0.005,
3: 0.005,
4: 0.005,
5: 0.005,
6: 0.005,
7: 0.005,
8: 0.005,
9: 0.005},
'TCGA-02-0007-01': {0: -0.072,
1: -0.072,
2: -0.072,
3: -0.072,
4: -0.072,
5: -0.072,
6: -0.072,
7: -0.072,
8: -0.072,
9: -0.072}}
Example of the first few rows/columns of meth dataframe (as a dictionary):
{'Hugo_Symbol': {0: 'MEOX2',
1: 'COX8C',
2: 'IMPA2',
3: 'TTC8',
4: 'TMEM186',
5: 'RETSAT',
6: 'TULP1',
7: 'TAF15',
8: 'CCDC88B',
9: 'EPB41L3'},
'TCGA-02-0001-01': {0: 0.0414512366035448,
1: 0.981055336789946,
2: 0.0185800932177329,
3: 0.0378532148581555,
4: 0.85883476246704,
5: 0.0289562030712872,
6: 0.895705096206116,
7: 0.0241634415603257,
8: 0.645029762467676,
9: 0.0255462091070292},
'TCGA-02-0003-01': {0: 0.103889979051244,
1: 0.989162923386702,
2: 0.127842916524696,
3: 0.0614877612251458,
4: 0.718748599381741,
5: 0.0370521775128637,
6: 0.875983369162448,
7: 0.0202500492785966,
8: 0.338662158729623,
9: 0.87970085305538},
'TCGA-02-0006-01': {0: 0.0607886115184704,
1: 0.979538805487233,
2: 0.0196988790233788,
3: 0.0617298630873257,
4: 0.887325606857566,
5: 0.0287770828033724,
6: 0.936288398598192,
7: 0.0212839350099898,
8: 0.540677138952204,
9: 0.167646943077876},
'TCGA-02-0007-01': {0: 0.0255189003468858,
1: 0.985665537780805,
2: 0.0184844499535323,
3: 0.08404493340112,
4: 0.862945344666792,
5: 0.0299693670464541,
6: 0.930821357939767,
7: 0.0223801816866076,
8: 0.319356999493038,
9: 0.0193433290278056}}
CodePudding user response:
Given that your dfs contain hashable values, you might try something like this:
dfs = [pd.DataFrame(np.random.randint(low=100, high=103, size=(5, 2)), columns=['A', 'B']) for _ in range(3)]
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()])\
for df in dfs]), columns=dfs[0].columns)
print(f"common: \n{common}")
CodePudding user response:
What exactly would the ideal output look like, is this what you're trying to do? I feel like there's some confusion between columns and rows...
dfs = [mrna, meth]
def do_stuff(df):
df = pd.DataFrame(df).T
df.columns = df.iloc[0]
df = df[1:]
return df
mrna, meth = map(do_stuff, dfs)
print(pd.concat([mrna, meth], axis=1).dropna())
Output:
Hugo_Symbol AACS FSTL1 ELMO2 CREB3L1 RPS11 PNMA1 MMP2 \
TCGA-02-0001-01 -0.5909 -0.9099 -2.3351 0.2216 0.6798 -2.48 0.7912
TCGA-02-0003-01 -0.5154 0.0896 -1.17 0.1255 0.2374 -3.2629 1.2846
TCGA-02-0007-01 0.6932 -5.6511 -2.8365 2.0026 -0.6326 -1.3741 -3.437
Hugo_Symbol SAMD4A SMARCD3 A4GNT MEOX2 COX8C IMPA2 \
TCGA-02-0001-01 -1.4578 -3.8009 3.4868 0.041451 0.981055 0.01858
TCGA-02-0003-01 -1.474 -2.9891 -0.1511 0.10389 0.989163 0.127843
TCGA-02-0007-01 -1.047 -4.185 2.1816 0.025519 0.985666 0.018484
Hugo_Symbol TTC8 TMEM186 RETSAT TULP1 TAF15 CCDC88B \
TCGA-02-0001-01 0.037853 0.858835 0.028956 0.895705 0.024163 0.64503
TCGA-02-0003-01 0.061488 0.718749 0.037052 0.875983 0.02025 0.338662
TCGA-02-0007-01 0.084045 0.862945 0.029969 0.930821 0.02238 0.319357
Hugo_Symbol EPB41L3
TCGA-02-0001-01 0.025546
TCGA-02-0003-01 0.879701
TCGA-02-0007-01 0.019343
