I have a csv file that has a column called "Authors". In that column, each row has a couple of authors separated by commas. In the code below the function, getAuthorNames gets all the author names in that column and returns an array with all their names.
Then the function authCount counts how many times an individual name appears in the Author column. At first, I was doing it with a couple of hundred rows and had no issues. Now I am trying to do it with 20,000 rows and it has taken a couple of hours and still no results. I believe it is the nested for loops and if statement that is causing it to take so long. Any advice on how to speed up the process would help. Should I be using lambda? Is there a built it pandas function that could help?
This is what the input data looks like: |Title|Authors|ID| |:----|:-----:|-:| |XXX|Wang J, Wang H|XXX| |XXX|Wang J,Han H|XXX|
And this is what the output would look like |Author|Count| |:----|-----:| |Wang J| 2 | |Wang H| 1 | |Han H| 1 |
Here is the code:
import pandas as pd
df = pd.read_csv (r'C:\Users\amos.epelman\Desktop\Pubmedpull3GC.csv')
def getAuthorNames(dataFrame):
arrayOfAuthors = []
numRows = dataFrame.shape[0]
cleanDF = dataFrame.fillna("0")
for i in range (0,numRows):
miniArray = cleanDF.at[i,"Authors"].split(",")
arrayOfAuthors = miniArray
return arrayOfAuthors
def authCount(dataFrame):
authArray = getAuthorNames(dataFrame)
numAuthors = len(authArray)
countOfAuth = [0] * numAuthors
newDF = pd.DataFrame({"Author Name": authArray, "Count": countOfAuth})
refDF = dataFrame.fillna("0")
numRows= refDF.shape[0]
for i in range (0,numAuthors):
for j in range (0,numRows):
if newDF.at[i, "Author Name"] in refDF.at[j,"Authors"]:
newDF.at[i,"Count"] = 1
sortedDF = newDF.sort_values(["Count"], ascending = False)
noDupsDF = sortedDF.drop_duplicates(subset ="Author Name", keep = False)
return noDupsDF
finalDF = authCount(df)
file_name = 'GC Pubmed Pull3 Author Names with Count.xlsx'
finalDF.to_excel(file_name)
CodePudding user response:
you could try using Counter and a lambda function to eliminate your nested for loop over two dataframes, which seems like it would be a slow way to add a new column
from collections import Counter
Then to get the "Counts" column
author_counts = Counter(list(refDF["Authors"]))
newDF["Count"] = newDF.apply(lambda r: author_counts[r["Author Name"]], axis=1)
CodePudding user response:
# take series of authors and split at comma and expand into dataframe
authors = df[‘authors’].str.split(pat=“,”, expand=True)
# get unique values of authors
authors = pd.unique(authors)
