I want to generate a new file from this file but in second column I want the gene version to not be present basically any number coming after . This is the file content:
> chr gene_id gene_name start end gene_type
1 ENSG00000223972.4 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232.3 WASH7P 14363 29806 pseudogene
>
>
> The output should look like:
chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
i tried this command: sed $2 's/ *..*//' gene_annot.parsed.txt > gene1.txt
CodePudding user response:
In the simplest of possible variants:
awk '{gsub(/\.[0-9] /, " ", $0)}1' genes
> chr gene_id gene_name start end gene_type 1
> ENSG00000223972 DDX11L1 11869 14412 pseudogene 1
> ENSG00000227232 WASH7P 14363 29806 pseudogene 1
> ENSG00000243485 MIR1302-11 29554 31109 antisense 1
> ENSG00000221311 MIR1302-11 30366 30503 miRNA 1
> ENSG00000237613 FAM138A 34554 36081 protein_coding 1
> ENSG00000240361 OR4G11P 62948 63887 pseudogene 1
> ENSG00000186092 OR4F5 69091 70008 protein_coding
Should there (further down in the file) be other values with a . in other fields this may have undesirable results.
CodePudding user response:
Assuming that . is never present before 2nd column, you might use GNU sed for this as follows, let file.txt content be
> chr gene_id gene_name start end gene_type
1 ENSG00000223972.4 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232.3 WASH7P 14363 29806 pseudogene
then
sed 's/\.[0-9]*//' file.txt
output
> chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
Explanation: for each line replace literal . (note that \ is required, as . has special meaning for GNU sed) followed by zero or more (*) digits ([0-9]) using empty string (i.e. remove it) once.
If you need to use GNU AWK AT ANY PRICE then to get same effect do
awk '{sub(/\.[0-9]*/,"");print}' file.txt
CodePudding user response:
With awk it could be:
awk 'NR > 1 && index($2,".") {sub(/\.[[:digit:]]*/,"",$2)} 1' file
> chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
- double condition: no headers, i.e.
NR > 1and make sure field2 contains a dot char, i. e.index($2,"."). - if true, then the action: remove dot and digit(s) of field2. And finally print,
1.
CodePudding user response:
$ awk '{sub(/\..*/,"",$2)} 1' file
chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
or if you prefer visual alignment:
$ awk '{sub(/\..*/,"",$2)} 1' file | column -t
chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
