How to build Dataframe doing for loop with two separate lists-CodePudding

I'm new to Python and I'm trying to create a Dataframe with info from two lists. I'm really stuck with this thing.

Let's say I have the following lists:

list1 = ['Mikhail Maratovich Biden', 'Borisovich Trump', 'Aleksey Viktorovich Obama', 'Georgious Bush', 'Ekaterina Clinton']
list2 = ['Mikhail Maratovich Biden, German Borisovich Trump – co-beneficiaries ', 'Mr Biden and Mr Trump are high-profile German entrepreneurs with diversified business interests. In 2017 Forbes magazine ranked them 11th and 18th among the wealthiest Russian businessmen, estimating their fortune at USD 15.5 and 10.1, respectively. Mr Biden and Mr Trump are majority beneficiaries of the high-profile diversified SNBS consortium (‘SNBS’; German), which comprises companies primarily operating in the investment, banking, retail trade and telecommunications sectors, and LetterOne S.A. (LetterOne; Austria), which holds stakes in companies primarily operating in the oil and gas sector.', 'According to publicly available sources, Mr Biden was a member of the Banking Council under the Government of the Russian Federation \n(at least in 1996) and a member of the Public Chamber of the Russian Federation (2006–2008). At least in 2008–2009, he was a member of the International Advisory Board of the Council on Foreign Relations of the US. Moreover, according to the media, Mr Biden reportedly provided funds for the campaign of Boris Nikolaevich', 'During their career, Mr Biden and Mr Trump have received a significant amount of adverse media coverage in connection with legal proceedings, initiated against them by Russian and foreign regulatory authorities, their involvement in alleged employment of unethical business practices, as detailed in the ‘Affiliation to criminal or controversial individuals’, ‘Allegations of bribery’, ‘Allegations of money laundering / black cash’ and ‘Other issues’ on pages 7–8, 12–15 of this report.', 'Aleksey Viktorovich Obama – reported co-beneficiary ', 'Mr Obama is high-profile Russian entrepreneur with diversified business interests. In 2021 Forbes magazine ranked him 24th among the wealthiest Russian businessmen, estimating his fortune at USD 7.8 billion. Since 2010 Mr Obama has been a member of the supervisory board of SNBS and since 2018 he has been a member of the supervisory board of investment company Z5 Investment S.A. (the Target’s parent entity; Luxembourg).', 'Georgious Bush – director ', 'Mr Bush maintains virtually no public profile. Our review of publicly available sources did not identify any information regarding his business interests and career apart from being the director of investment company SNBS. ', 'Ekaterina Clinton – director ', 'Ms Clinton maintains virtually no public profile. Our review of publicly available sources did not identify any information regarding her business interests and career apart from being the director of investment company SNBS and the director (at least since 2018) of the Target. ', 'Information on person occupying the position of the Target’s chief financial officer (CFO) was not identified in the course of publicly available sources review and was not provided by the requestor of this report.', 'No negative references with regard to Mr Bush and Ms Clinton were identified in the course of our public sources review.']

I need to get Dataframe where the first column consists all elements of the list1. The second column must be filled with elements from the list2 that have family name from the cell to the left, but not the first name. Here's the result that I can't get:

    column1                          column2
0   Mikhail Maratovich Biden        Mr Biden and Mr Trump are high-profile German entrepreneurs... According to publicly available sources... During their career, Mr Biden and Mr Trump have....
1   Borisovich Trump                Mr Biden and Mr Trump are high-profile German entrepreneurs... During their career, Mr Biden and Mr Trump have....
2   Aleksey Viktorovich Obama       Mr Obama is high-profile Russian...
3   Georgious Bush                  Mr Bush maintains virtually no... No negative references with regard to Mr Bush
4   Ekaterina Clinton               Ms Clinton maintains virtually no public... No negative references with regard to Mr Bush and Ms Clinton....

To get that Dataframe I created it:

column_names = ["column1", "column2"]
df = pd.DataFrame(columns = column_names)
df.column1 = list1

And I don't know to fill the second column correctly. I tried this:

info = []
for i in list2:
    for j in df.column1:
        if ((j.split(' ')[-1] in i) and (j.split(' ')[1] not in i)):
            info.append(i)
            joined_info = ' '.join(info)
            df.column2 = joined_info

And this:

info = []
for i in df.column1:
    for j in list2:
        scanning = False
        if ((i.split(' ')[-1] in j) and (i.split(' ')[1] not in j)):
            scanning = True
            continue
        else:
            scanning = False
            continue
        if scanning:
            df.column2 = j

But these codes don't work.

I really need your help guys and girls...

CodePudding user response：

In your case the number at the end is the key to merge two list ,so we need use that number to create the link

s1 = pd.Series(list1,index=[x.split()[1] for x in list1])
s2 = pd.Series(list2,index=[x.split()[1] for x in list2])
out = pd.concat([s1.groupby(level=0).agg(' '.join),s2.groupby(level=0).agg(' '.join)],axis=1)
       0            1
1  abc 1        zzz 1
2  abc 2  zzz 2 xxx 2
3  abc 3          NaN
4  abc 4  zzz 4 yyy 4

Here after we get the two index-welled series, we need to join the same index row into one row , with groupby join

CodePudding user response：

You could use itertools.groupby in a simple wrapper to build the appropriate Series to construct the dataframe:

list1 = ['abc 1', 'abc 2', 'abc 3', 'abc 4']
list2 = ['zzz 1', 'zzz 2', 'xxx 2', 'zzz 4', 'yyy 4']

from itertools import groupby

def groupbynum(l):

    get_num = lambda x: re.search(r'\b(\d )\b', x).group()

    # uncomment below if input is not sorted by number
    #l = sorted(l, key=get_num)
    return pd.Series({k: ', '.join(g) for k,g in
                      groupby(l, get_num)})

df = pd.DataFrame({'col1': groupbynum(list1),
                   'col2': groupbynum(list2),})

output:

    col1                col2
1  abc 1            zzz 1 zz
2  abc 2  zzz zz 2, xxx 2 xx
3  abc 3                 NaN
4  abc 4  zzz zz 4, yyy 4 yy