Home > OS >  How to split Hmm databse (Pfam-A.hmm) into individual files?
How to split Hmm databse (Pfam-A.hmm) into individual files?

Time:01-21

I have downloaded the Pfam database, but in order to proceed with my work I would need to split it into different individual files. I tried to do it with the command hmmfetch:

Usage: hmmfetch [options] -f <hmmfile> <keyfile>  (retrieves all HMMs in <keyfile>)

Following this procedure I am able to retrieve some Hmms, but I have to specify the name in the keyfile. This approach is not convenient as I have to retrieve all the Hmms that are present in the original file.

The next thing I tried to do is to split the original file into individual ones using the following command:

csplit --digits=2  --quiet --prefix=hmm Pfam-A.hmm "//// 1" "{*}"

This worked perfectly fine to split the file into individual ones, the only thing that I could not figure out is how to give each file the name of the hmm. Each hmm file looks like this:

HMMER3/f [3.1b2 | February 2015]
NAME  120_Rick_ant
ACC   PF12574.11
DESC  120 KDa Rickettsia surface antigen
LENG  238
ALPH  amino
RF    no
MM    no
CONS  yes
CS    no
MAP   yes
DATE  Tue Oct 12 02:07:11 2021
NSEQ  2
EFFN  0.449219
CKSUM 3984216663
GA    25 25;
TC    39.8 39.6;
NC    23.6 21.2;
BM    hmmbuild HMM.ann SEED.ann
SM    hmmsearch -Z 61295632 -E 1000 --cpu 4 HMM pfamseq
STATS LOCAL MSV      -10.8956  0.70336
STATS LOCAL VITERBI  -11.6161  0.70336
STATS LOCAL FORWARD   -5.3029  0.70336
HMM          A        C        D        E        F        G        H        I        K        L        M        N        P        Q        R        S        T        V        W        Y   
            m->m     m->i     m->d     i->m     i->i     d->m     d->d
  COMPO   2.48852  4.43316  2.82069  2.56851  3.39369  2.73712  3.79297  2.89060  2.54228  2.53662  3.76796  3.01951  3.39446  3.08353  3.05948  2.67787  2.83658  2.66102  4.89473  3.44979
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.03268  3.83303  4.55537  0.61958  0.77255  0.00000        *
      1   3.11165  4.58599  4.12585  3.76620  3.12182  3.93147  4.43434  2.32453  3.53431  0.92536  3.15834  4.04543  4.37407  3.91210  3.71656  3.49871  3.40796  2.35149  4.98612  3.70011      1 l - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.03268  3.83303  4.55537  0.61958  0.77255  0.48576  0.95510
      2   1.07216  4.17353  3.42348  3.21371  4.01396  2.99897  4.24029  3.13365  3.22896  3.01700  4.05375  3.37300  3.73453  3.57391  3.48180  2.52446  2.79912  2.79493  5.44509  4.24110      2 a - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.03268  3.83303  4.55537  0.61958  0.77255  0.48576  0.95510
      3   2.91965  5.02079  2.47306  1.08285  4.36227  3.24954  3.83381  3.80837  2.70946  3.43216  4.40865  2.91254  3.85246  3.05076  3.11366  2.90651  3.22382  3.49656  5.54134  4.26436      3 e - - -
...
//

Using my commands approach this file is called "hmm01", but I would like it to be named "120_Rick_ant.hmm". Does anyone one know something that could do the trick? Thanks in advance!

CodePudding user response:

A basic solution using GNU/BSD awk:

#!/bin/bash

while read -r id filename
do
    echo mv "$filename" "$id".hmm
done < <(awk '$1 == "NAME" {print $2,FILENAME; nextfile}' hmm*)
  •  Tags:  
  • Related