I have a FASTA file containing DNA sequences with headers of the form:
>EGRPR088-07|Tetrapturus belone
AGTC...
and need to split after the | delimiter.
The final output should be a character matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
"EGRPR088-07|" "Tetrapturus belone" "A" "G" "T" "C" ...
Using the ape package in R, to read in a FASTA file (using read.FASTA()) and then converting it via as.character.DNAbin(as.matrix.DNAbin(x)), where x is the FASTA file, I've managed to get
[,1] [,2] [,3] [,4]
EGRPR088-07|Tetrapturus belone "a" "g" "t" "c" ...
which is close to what I need (conversion of nucleotides to lower case is irrelevant). However, there's no way to access the header.
Am I missing something here? Perhaps this is easier in the terminal? If so, what is the specific bash sequence to accomplish this?
CodePudding user response:
Would you please try the bash script:
#!/bin/bash
#
# split the sequence into characters, quote, and print
#
flush() {
if [[ -n $seq ]]; then
mapfile -t ary < <(fold -w 1 <<< "$seq" | sed -E 's/(.)/"&"/')
echo "${ary[*]}"
seq= # clear the seq variable
fi
}
while IFS= read -r line; do
if [[ $line = ">"* ]]; then # header line
flush
echo -n "$line" | sed -E 's/^>([^|] \|)(.*)/"\1" "\2" /'
else # sequence line(s)
seq="$seq$line" # consider multi-line sequence
fi
done < file.fasta
flush
Example of input fasta file:
>EGRPR088-07|Tetrapturus belone
AGTC
AGGC
>EGRPR088-08|Tetrapturus belone
GGAG
AAAT
Output:
"EGRPR088-07|" "Tetrapturus belone" "A" "G" "T" "C" "A" "G" "G" "C"
"EGRPR088-08|" "Tetrapturus belone" "G" "G" "A" "G" "A" "A" "A" "T"
If the fasta file is large, bash script may take a long time to execute. Then please consider other language, such as perl:
perl -e '
use strict;
use warnings;
my @a;
sub flush {
if (@a) {
print(join(" ", map {"\"" . $_ . "\""} @a), "\n");
@a = ();
}
}
while (<>) {
chop;
if (/^>/) {
flush(@a);
@a = /^>([^|] \|)(.*)/;
} else {
push(@a, split("", $_));
}
}
flush(@a);
' file.fasta
[Edit]
As the bash script outputs the result on multiple lines of the script, a single
redirection within the script may not work enough.
If you want to redirect the output to a file, put a command:
exec > out.txt
at the beginning (right after the #!/bin/bash line) of the script.
Alternatively, you can save the script file such as myscript.sh, then
invoke with:
bash myscript.sh > out.txt
