Home > OS >  Create a csv file based on variables in AWK
Create a csv file based on variables in AWK

Time:01-30

This looks to be relatively straightforward for some people, but in my case I have spent a lot of time and it doesn't work. The thing I want to do is to create a csv file delimited by comma using as information the name of the fastq in the list provided fastq_1 M1, fastq_2 M2 and variables. The name of the csv header should be as follows sample, fastq_1, fastq_2, strandedness and each variable and name must match in the same column of the header.

fastq folder

S1_1.fastq.gz
S1_2.fastq.gz
S2_1.fastq.gz
S2_2.fastq.gz 
S3_1.fastq.gz
S3_2.fastq.gz
S4_1.fastq.gz
S4_2.fastq.gz

# variables
sample="mouse"
M1=$(ls *_1.fastq.gz)
M2=$(ls *_2.fastq.gz)
strandedness="paired"

#code
awk '
BEGIN      { OFS=",";
             print "sample", "fastq_1", "fastq_2", "strandedness"
           }
FNR==NR    {
             print $sample, $M1, $M2, $strandedness
           }' > output.csv

Desired output

sample, fastq_1, fastq_2, strandedness  #header
mouse, S1_1.fastq.gz, S1_2.fastq.gz, paired #values
mouse, S2_1.fastq.gz, S2_2.fastq.gz, paired #values
mouse, S3_1.fastq.gz, S3_2.fastq.gz, paired #values
mouse, S4_1.fastq.gz, S4_2.fastq.gz, paired #values

I would be pleased if someone could help me to solve this problem

CodePudding user response:

Pure bash might be easier than awk for that:

#!/bin/bash

sample=mouse
strandedness=paired
fastq_folder=./
{
    # header
    printf '%s, %s, %s, %s\n' sample fastq_1 fastq_2 strandedness

    # values
    for fastq_1 in "$fastq_folder"/*_1.fastq.gz
    do
        fastq_2="${fastq_1%_1.fastq.gz}_2.fastq.gz"

        [[ -f $fastq_2 ]] || continue # you may display an error message

        printf '%s, %s, %s, %s\n' \
            "$sample" \
            "${fastq_1##*/}" \
            "${fastq_2##*/}" \
            "$strandedness"
    done
} > output.csv

output.csv:

sample, fastq_1, fastq_2, strandedness
mouse, S1_1.fastq.gz, S1_2.fastq.gz, paired
mouse, S2_1.fastq.gz, S2_2.fastq.gz, paired
mouse, S3_1.fastq.gz, S3_2.fastq.gz, paired
mouse, S4_1.fastq.gz, S4_2.fastq.gz, paired

remark: Adding a space after the commas may seem prettier, but in CSV terms, doing so is adding a space character to the data.

CodePudding user response:

$ ls fastq_folder
S1_1.fastq.gz  S2_1.fastq.gz  S3_1.fastq.gz  S4_1.fastq.gz
S1_2.fastq.gz  S2_2.fastq.gz  S3_2.fastq.gz  S4_2.fastq.gz

$ cat tst.awk
BEGIN {
    OFS=","
    print "sample", "fastq_1", "fastq_2", "strandedness"
    for (i=1; i<ARGC; i  ) {
        sub(".*/","",ARGV[i])
        file1 = file2 = ARGV[i]
        sub(/_1/,"_2",file2)
        print sample, file1, file2, strandedness
    }
    exit
}

$ awk -v sample="$sample" -v strandedness="$strandedness" -f tst.awk fastq_folder/*_1.fastq.gz
sample,fastq_1,fastq_2,strandedness
mouse,S1_1.fastq.gz,S1_2.fastq.gz,paired
mouse,S2_1.fastq.gz,S2_2.fastq.gz,paired
mouse,S3_1.fastq.gz,S3_2.fastq.gz,paired
mouse,S4_1.fastq.gz,S4_2.fastq.gz,paired

The above assumes the files are always paired as you stated in a comment and there aren't so many files as to exceed the shell's ARGS_MAX.

  •  Tags:  
  • Related