Splitting a FASTA file - separating ID from species name and outputting a character matrix-CodePudding

I have a FASTA file containing DNA sequences with headers of the form:

>EGRPR088-07|Tetrapturus belone
AGTC...

and need to split after the | delimiter.

The final output should be a character matrix:

          [,1]                 [,2] [,3] [,4] [,5] [,6]
"EGRPR088-07|" "Tetrapturus belone" "A"  "G"  "T"  "C"  ...

Using the ape package in R, to read in a FASTA file (using read.FASTA()) and then converting it via as.character.DNAbin(as.matrix.DNAbin(x)), where x is the FASTA file, I've managed to get

                                  [,1] [,2] [,3] [,4]
EGRPR088-07|Tetrapturus belone    "a"  "g"  "t"  "c"  ...

which is close to what I need (conversion of nucleotides to lower case is irrelevant). However, there's no way to access the header.

Am I missing something here? Perhaps this is easier in the terminal? If so, what is the specific bash sequence to accomplish this?

CodePudding user response：

Would you please try the bash script:

#!/bin/bash

#
# split the sequence into characters, quote, and print
#
flush() {
    if [[ -n $seq ]]; then
        mapfile -t ary < <(fold -w 1 <<< "$seq" | sed -E 's/(.)/"&"/')
        echo "${ary[*]}"
        seq=                            # clear the seq variable
    fi
}

while IFS= read -r line; do
    if [[ $line = ">"* ]]; then         # header line
        flush
        echo -n "$line" | sed -E 's/^>([^|] \|)(.*)/"\1" "\2" /'
    else                                # sequence line(s)
        seq="$seq$line"                 # consider multi-line sequence
    fi
done < file.fasta
flush

Example of input fasta file:

>EGRPR088-07|Tetrapturus belone
AGTC
AGGC
>EGRPR088-08|Tetrapturus belone
GGAG
AAAT

Output:

"EGRPR088-07|" "Tetrapturus belone" "A" "G" "T" "C" "A" "G" "G" "C"
"EGRPR088-08|" "Tetrapturus belone" "G" "G" "A" "G" "A" "A" "A" "T"

If the fasta file is large, bash script may take a long time to execute. Then please consider other language, such as perl:

perl -e '
    use strict;
    use warnings;

    my @a;
    sub flush {
        if (@a) {
            print(join(" ", map {"\"" . $_ . "\""} @a), "\n");
            @a = ();
        }
    }

    while (<>) {
        chop;
        if (/^>/) {
            flush(@a);
            @a = /^>([^|] \|)(.*)/;
        } else {
            push(@a, split("", $_));
        }
    }
    flush(@a);
' file.fasta

[Edit]
As the bash script outputs the result on multiple lines of the script, a single redirection within the script may not work enough. If you want to redirect the output to a file, put a command:

exec > out.txt

at the beginning (right after the #!/bin/bash line) of the script.
Alternatively, you can save the script file such as myscript.sh, then invoke with:

bash myscript.sh > out.txt