Lets say I have column "OU":
OU
CORP:Jenny Smith:
STORE:Mary Poppins:
STORE:Tony Stark:
STORE:Carmen Sandiego:
NEWS:Peter Parker:
NEWS:Clark Kent:
I want to parse this column up to the first ":" and keep only the words before the ":". Then any word that repeats is left only at one. So the finished data should look like this:
OU
CORP
STORE
NEWS
Would I need to do something in the pandas.read_csv(file, usecols=['OU']) when I read the original CSV file?
In reference to an answer below, this is also how that one row looks in a text editor:
OU
CORP:Jenny Smith:
"CORP:John Smith:,John Smith:"
STORE:Mary Poppins:
STORE:Tony Stark:
STORE:Carmen Sandiego:
NEWS:Peter Parker:
NEWS:Clark Kent:
CodePudding user response:
You can use the semicolon as separator and supply the column title manually, skipping the first title row of the csv file. Then you drop_duplicates:
pd.read_csv(file, sep=":", header=None, skiprows=1, usecols=[0], names=['OU']).drop_duplicates()
Result:
OU
0 CORP
1 STORE
4 NEWS
