I want to change value of my duplicate. Here is my dataframe:
sku
FAT-001
FAT-001
FAT-001
FAT-002
FAT-002
my expected data frame will be look like this
sku
FAT-001 #don't want to change first value of duplicate
FAT-001-01
FAT-001-02
FAT-002
FAT-002-01
CodePudding user response:
df["number"] = df.groupby("sku").cumcount()
df.apply(lambda x: x.sku ("" if x.number == 0 else "-" str(x.number).rjust(2,"0")), axis = 1)
or:
df["number"] = "-" df.groupby("sku").cumcount().astype(str).str.rjust(2, '0')
df.number[df["number"].eq("-" "0" * 2)] = ""
df.sku df.number
my output:
0 FAT-001
1 FAT-001-01
2 FAT-001-02
3 FAT-002
4 FAT-002-01
dtype: object
explanation:
what is groupby? it is a sql-inspired command that give you some element base on every unique element... for example: length, max, list or other...
df = pd.DataFrame([
[1,2],
[1,3],
[1,4],
[2,5],
[2,6],
], columns=["id","number"])
df.groupby("id").agg({"number": len})
give you:
number
id
1 3
2 2
the number of every unique elements, or
df.groupby("id").agg({"number": list})
give you
number
id
1 [2, 3, 4]
2 [5, 6]
and you can try max or min or first...
using agg, you can specify what you want for every column... if you have more than one (except what was groupby), you can specify something for each column...
except agg, there are other methods for goupped dataframes: like cumcount, that set index every row for each group, I mean it reset index for every group:
df.groupby("sku").cumcount()
output:
0 0
1 1
2 2
3 0
4 1
your first FAT-001 get index: 0, next: 1, ... and for FAT-002, first get index 0 again...
so, we have two part that you want, now... therefore we must find a way to join them for every row: axis:1 in apply means for every row
so, you have one exception: you don't want index for every row of each group... so change it to "", empty:
df.apply(lambda x: "" if x.number == 0 else str(x.number), axis = 1)
:
0
1 1
2 2
3
4 1
dtype: object
your 0 and 3 rows are for new-group...
next, your desired format is: 01,02,... a 0 for every index. pandas has a method that convert every string to string with desired length with an arbitrary char: rjust(desird_length, arbitrary_char)
how works: if you call that as rjust(2,"0"), it dos not change "22" or other 2-char, 3-char, ... strings, but, if your string length be 1 like 1 will be converted to 01 and ... (notice that there are a method called ljust too :))
df["number"] = df.groupby("sku").cumcount()
df.apply(lambda x: "" if x.number == 0 else str(x.number).rjust(2,"0"), axis = 1)
0
1 01
2 02
3
4 01
dtype: object
and if statement can written as:
if x.number == 0:
return ""
else:
return "-" str(x.number).rjust(2,"0")
and some points:
- what is
astype(str): it convert every element to string, and works likestr(x), but for every element. why? to add "-" before it and usingrjust. - what is
eq? it isis-equal?and returnTruefor each row if be equal to value, andFalseotherwise. - why
df.number[df["number"].eq("-" "0" * 2)] = ""? because we convert all first elements of each group to"" - why
"-" "0" * 2? because we add"-"and ljust in previous line so we must use correct value:"-00". and why"0" * 2? because you can use every number forljust lengthlike 10 and set it there, too
CodePudding user response:
Similar to @MoRe's answer, use groupby.cumcount to create groups; then you could use str.zfill to pad 0s and mask the first elements of each group:
groups = df.groupby('sku').cumcount()
df['new'] = df['sku'] ('-' groups.astype('string').str.zfill(2)).mask(groups.eq(0), '')
Output:
sku new
0 FAT-001 FAT-001
1 FAT-001 FAT-001-01
2 FAT-001 FAT-001-02
3 FAT-002 FAT-002
4 FAT-002 FAT-002-01
CodePudding user response:
For an alternative one-liner:
df.sku = df.sku df.groupby('sku').cumcount().apply(lambda x: f"-{x:02d}" if x > 0 else '')
Output:
sku
0 FAT-001
1 FAT-001-01
2 FAT-001-02
3 FAT-002
4 FAT-002-01
