Search keyword of single column between 2 different dataframes and merge matched rows-CodePudding

Hi I'm new to python and pandas dataframes, I've been stuck on the following for a while:

I have two different dataframes: df1 has ~130 rows, df2 has ~380 rows

df1['Symbol_1'] has values like 'DASHUSDT' (for example) containing the keyword 'DASH'
df2['Symbol_2'] has values like 'DASHBTC' (for example) from df2 containing the keyword 'DASH'

I would like to search df2 by keyword from df1, and create a new dataframe (df3 like below).

Note that all rows from df1 will always have a keyword matched from df2 except 'BTC/USDT' pair.
To make things more clear: Symbol_1 (for example: DASHUSDT) always ends with 'USDT', and 'DASH' is the keyword. Symbol_2 (for example: DASHBTC) always ends with 'BTC', and 'DASH' is the keyword.
I am creating a watchlist for USDT pairs, and I would like to check the BTC pairs from the same watchlist easily. BTC is the king, and all altcoins have a BTC pair, DASH/BTC, ETH/BTC, LTC/BTC etc. But BTC will not have a BTC pair like this BTC/BTC, so for the 'BTC/USDT' row, there will not have any matched rows from 2nd dataframe, so we can keep it empty or replace the 'Symbol_2' column with '0' and '24H_Change_2' column with '0'

[df1]

Symbol_1       Price    24H_Change_1  Volume

BTCUSDT        42135.15   -1.565      200.125
DASHUSDT       139.87     10.0390     1.7400
ADAUSDT        1.23060    1.0700      1.1800
C98USDT        2.5328     0.4520      1.1900
1000SHIBUSDT   0.029233   6.589       2.4564

[df2]

Symbol_2        24H_Change_2  

1000SHIBBTC     4.412
SNMBTC          5.1235    
ADABTC          1.0700     
XVGBTC          15.1240
C98BTC          0.4520
SALTBTC         7.149
EOSBTC          5.551
DASHBTC         11.258

Output dataframe I want: (1st option)

Symbol_1        Price      24H_Change_1   Volume    Symbol_2      24H_Change_2

BTCUSDT         42135.15   -1.565         200.125   
DASHUSDT        139.87     10.0390        1.7400    DASHBTC       11.258
ADAUSDT         1.23060    1.0700         1.1800    ADABTC        1.0700
C98USDT         2.5328     0.4520         1.1900    C98BTC        0.4520
1000SHIBUSDT    0.029233   6.589          2.4564    1000SHIBBTC   4.412

Output dataframe I want: (2nd option)

If 1st option is not possible or too much of a trouble or two complicated to accomplish, I am fine with 2nd option, I can separate 'BTC/USDT' pair from 1st dataframe out as a single dataframe for reference and then add a new dataframe (3rd dataframe like below) for altcoins.

Symbol        Price      24H_Change   Volume 

BTCUSDT       42135.15   -1.565       200.125

Symbol_1        Price      24H_Change_1   Volume    Symbol_2      24H_Change_2
  
DASHUSDT        139.87     10.0390        1.7400    DASHBTC       11.258
ADAUSDT         1.23060    1.0700         1.1800    ADABTC        1.0700
C98USDT         2.5328     0.4520         1.1900    C98BTC        0.4520
1000SHIBUSDT    0.029233   6.589          2.4564    1000SHIBBTC   4.412

CodePudding user response：

Solution:

first use string-replace operations to get your 'From' currency/token column
then merge df1, df2 with a pandas merge.

You're making this too hard, you don't need to keep a watchlist, just index both dataframes. (There will be some index labels in df2 missing in df1, merge handles that automatically, see below.)

Given you know df1['Symbol_1'] always ends in '...USDT' and df2['Symbol_2'] always ends in '...BTC', just create a new index or 'From' column with a simple string-replacement to eliminate the destination currency/token, hence: 'BTC', 'DASH', 'ADA'.

df1 = df1.set_index(df1['Symbol_1'].str.replace('(.*)USDT', r'\1', regex=True))

Note we use a regex capture group, so the r'\1' in the replace-expression matches the (.*) in the pattern.

Ditto df2.

Then you can simply create df3 with df1.merge(df2, ...) See the documentation on merge(..., how='left'/'right'/'outer'/'inner') to figure out which type of merge you want; I think you need 'outer'. (If you get stuck on the merge syntax, edit your question to update it) You will get NaN rows given that df1 has ~130 rows but df2 has ~380. See the .doc on fillna() for that.

CodePudding user response：

Here are the solution in full details:

# add the keyword as index value to df1
df1 = df1.set_index(df1['Symbol_1'].str.replace('(.*)USDT', r'\1', regex=True))

# add the exact keyword as index value to df2
df2 = df2.set_index(df1['Symbol_2'].str.replace('(.*)BTC', r'\1', regex=True))

now both dataframes share the same index value

# set default index name for df1 to 'Key'
df1.index.name = 'Key'

# set default index name for df2 to 'Key'
df2.index.name = 'Key'

# merge df2 to df1 (search both dataframes' column name = 'Key' that share the same value, add the columns from the returned rows to df1)
df3 = df1.merge(df2, on='Key')

# drop the default index column and re-index with number
df_reindexed=df3.reset_index(drop=True)

# finally, rename the column names if you want, it's easy to do that.

Thanks again to smci for the solution.

Then you can simply create df3 with df1.merge(df2, ...) See the documentation on merge(..., how='left'/'right'/'outer'/'inner') to figure out which type of merge you want; I think you need 'outer'. (If you get stuck on the merge syntax, edit your question to update it) You will get NaN rows given that df1 has ~130 rows but df2 has ~380. See the .doc on fillna() for that.

smci, df1 has 130 rows, df2 has 380 rows, I only need 130 rows from df2. so there won't be any NaN rows at all.