I'm trying to format the family IDs on a fam file whose sample and family IDs are the same, and coded in the following way:
Continent_Breed_Ind-ID
The idea would be to transform column 1 into something that only contains continent breed, but keeping the other columns.
Mock dataset:
Continent1_Breed1_Ind-ID1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2_Ind-ID2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1_Ind-ID1 Continent2_Breed1_Ind-ID1 0 0 0 -9
Desired outcome:
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
I have tried using sed as follows:
sed -r 's/_[^_]*//2g' file.fam
But that only gives me the first column.
Any ideas?
CodePudding user response:
You may use this simple sed command:
sed 's/_[^_]* / /' file
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
Here:
_[^_]*: Matches_followed by 0 or more non-_characters followed by a space- We replace this match by a space to get the space between first and second column back
PS: Note that there is no global flag used here.
CodePudding user response:
1st solution: With your shown samples, please try following sed command. Using -E option to ERE(extended regular expression) here.
sed -E 's/^([^_]*)(_[^_]*)_[^[:space:]] (.*$)/\1\2\3/' Input_file
2nd solution: With GNU awk using match function of it with capturing group capability try following:
awk 'match($0,/^([^_]*)(_[^_]*)_[^[:space:]] (.*$)/,arr){print arr[1] arr[2] arr[3]}' Input_file
CodePudding user response:
gawk 'sub("_[^_] $",_,$!_)_' mawk 'sub("_[^_] "," ")_'
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
CodePudding user response:
You can use
awk '{sub(/_[^_]*$/, "", $1)}1' file > newfile
sed 's/^\([^_ ]*_[^_ ]*\)_[^_ ]*/\1/' file > newfile
See the online demo #1 and demo #2.
Details:
- The
awksolution finds and removes the first occurrence of a_char and then zero or more chars other than_till end of string (withsub(/_[^_]*$/, "", $1)) in the first field, and1prints the result - The sed solution finds:
^- start of string\([^_ ]*_[^_ ]*\)- Group 1 (\1in RHS refers to this value): zero or more chars other than space and_, and underscore and then again zero or more chars other than space and__- an underscore[^_ ]*- zero or more chars other than space and_.
And the match is replaced with Group 1 value.
CodePudding user response:
This might work for you (GNU sed):
sed 's/_/\n/2;s/\n\S*//' file
Replace the second _ by a newline and then remove the newline and any non-white space following it.
