Home > database >  Splitting a FASTA file - separating ID from species name and outputting a character matrix
Splitting a FASTA file - separating ID from species name and outputting a character matrix

Time:01-05

I have a FASTA file containing DNA sequences with headers of the form:

>EGRPR088-07|Tetrapturus belone
AGTC...

and need to split after the | delimiter.

The final output should be a character matrix:

          [,1]                 [,2] [,3] [,4] [,5] [,6]
"EGRPR088-07|" "Tetrapturus belone" "A"  "G"  "T"  "C"  ...

Using the ape package in R, to read in a FASTA file (using read.FASTA()) and then converting it via as.character.DNAbin(as.matrix.DNAbin(x)), where x is the FASTA file, I've managed to get

                                  [,1] [,2] [,3] [,4]
EGRPR088-07|Tetrapturus belone    "a"  "g"  "t"  "c"  ...

which is close to what I need (conversion of nucleotides to lower case is irrelevant). However, there's no way to access the header.

Am I missing something here? Perhaps this is easier in the terminal? If so, what is the specific bash sequence to accomplish this?

CodePudding user response:

Would you please try the bash script:

#!/bin/bash

#
# split the sequence into characters, quote, and print
#
flush() {
    if [[ -n $seq ]]; then
        mapfile -t ary < <(fold -w 1 <<< "$seq" | sed -E 's/(.)/"&"/')
        echo "${ary[*]}"
        seq=                            # clear the seq variable
    fi
}

while IFS= read -r line; do
    if [[ $line = ">"* ]]; then         # header line
        flush
        echo -n "$line" | sed -E 's/^>([^|] \|)(.*)/"\1" "\2" /'
    else                                # sequence line(s)
        seq="$seq$line"                 # consider multi-line sequence
    fi
done < file.fasta
flush

Example of input fasta file:

>EGRPR088-07|Tetrapturus belone
AGTC
AGGC
>EGRPR088-08|Tetrapturus belone
GGAG
AAAT

Output:

"EGRPR088-07|" "Tetrapturus belone" "A" "G" "T" "C" "A" "G" "G" "C"
"EGRPR088-08|" "Tetrapturus belone" "G" "G" "A" "G" "A" "A" "A" "T"

If the fasta file is large, bash script may take a long time to execute. Then please consider other language, such as perl:

perl -e '
    use strict;
    use warnings;

    my @a;
    sub flush {
        if (@a) {
            print(join(" ", map {"\"" . $_ . "\""} @a), "\n");
            @a = ();
        }
    }

    while (<>) {
        chop;
        if (/^>/) {
            flush(@a);
            @a = /^>([^|] \|)(.*)/;
        } else {
            push(@a, split("", $_));
        }
    }
    flush(@a);
' file.fasta

[Edit]
As the bash script outputs the result on multiple lines of the script, a single redirection within the script may not work enough. If you want to redirect the output to a file, put a command:

exec > out.txt

at the beginning (right after the #!/bin/bash line) of the script.
Alternatively, you can save the script file such as myscript.sh, then invoke with:

bash myscript.sh > out.txt
  •  Tags:  
  • Related