This is a follow-up of a previous question about using a Python dictionary to generate a list of files to include as input for a single step. In this case, I'm interested in merging BAM files for a single sample that have been generated by mapping FASTQ files from multiple runs.
I am running into an error in my rule combine_bams only for a single sample:
InputFunctionException in line 116 of /oak/stanford/scg/lab_jandr/walter/tb/mtb/workflow/Snakefile:
Error:
WildcardError:
No values given for wildcard 'samp'.
Wildcards:
samp=10561-7352-culture_S24
mapper=bwa
ref=H37Rv
Traceback:
File "/oak/stanford/scg/lab_jandr/walter/tb/mtb/workflow/Snakefile", line 118, in <lambda>
It seems that samp is defined correctly in the wildcards list, so I'm not sure why an error is called. Any suggestions would be great, my snakemake file is below. Thank you!
# Define samples:
RUNS, SAMPLES = glob_wildcards(config['fastq_dir'] "{run}/{samp}_L001_R1_001.fastq.gz")
# Create sample dictionary so that each sample (key) has list of runs (values) associated with it.
sample_dict = {}
for key, val in zip(SAMPLES,RUNS):
sample_dict.setdefault(key, []).append(val)
#print(sample_dict)
# Constrain mapper and filter wildcards.
wildcard_constraints:
mapper="[a-zA-Z2] ",
filter="[a-zA-Z2] ",
run = '|'.join([re.escape(x) for x in RUNS]),
samp = '|'.join([re.escape(x) for x in SAMPLES]),
ref = '|'.join([re.escape(x) for x in config['ref']])
# Define a rule for running the complete pipeline.
rule all:
input:
trim = expand(['results/{samp}/{run}/trim/{samp}_trim_1.fq.gz'], zip, run = RUNS, samp = SAMPLES),
kraken=expand('results/{samp}/{run}/kraken/{samp}_trim_kr_1.fq.gz', zip, run = RUNS, samp = SAMPLES),
bams=expand('results/{samp}/{run}/bams/{samp}_{mapper}_{ref}_sorted.bam', zip, run = RUNS, samp = SAMPLES, ref = config['ref']*len(RUNS), mapper = config['mapper']*len(RUNS)), # When using zip, need to use vectors of equal lengths for all wildcards.
per_samp_run_stats = expand('results/{samp}/{run}/stats/{samp}_{mapper}_{ref}_combined_stats.csv', zip, run = RUNS, samp = SAMPLES, ref = config['ref']*len(RUNS), mapper = config['mapper']*len(RUNS)),
combined_bams=expand('results/{samp}/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam', samp = np.unique(SAMPLES),ref=config['ref'], mapper=config['mapper'])
# Trim reads for quality.
rule trim_reads:
input:
p1='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz',
p2='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R2_001.fastq.gz'
output:
trim1='results/{samp}/{run}/trim/{samp}_trim_1.fq.gz',
trim2='results/{samp}/{run}/trim/{samp}_trim_2.fq.gz'
log:
'results/{samp}/{run}/trim/{samp}_trim_reads.log'
shell:
'workflow/scripts/trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'
# # Filter reads taxonomically with Kraken.
rule taxonomic_filter:
input:
trim1='results/{samp}/{run}/trim/{samp}_trim_1.fq.gz',
trim2='results/{samp}/{run}/trim/{samp}_trim_2.fq.gz'
output:
kr1='results/{samp}/{run}/kraken/{samp}_trim_kr_1.fq.gz',
kr2='results/{samp}/{run}/kraken/{samp}_trim_kr_2.fq.gz',
kraken_report='results/{samp}/{run}/kraken/{samp}_kraken.report',
kraken_stats = 'results/{samp}/{run}/kraken/{samp}_kraken_stats.csv'
log:
'results/{samp}/{run}/kraken/{samp}_kraken.log'
threads: 8
shell:
'workflow/scripts/run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_report} &>> {log}'
# Map reads.
rule map_reads:
input:
ref_path='/labs/jandr/walter/tb/data/refs/{ref}.fa',
kr1='results/{samp}/{run}/kraken/{samp}_trim_kr_1.fq.gz',
kr2='results/{samp}/{run}/kraken/{samp}_trim_kr_2.fq.gz'
output:
bam='results/{samp}/{run}/bams/{samp}_{mapper}_{ref}_sorted.bam'
params:
mapper='{mapper}'
log:
'results/{samp}/{run}/bams/{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"workflow/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
# Combine reads and remove duplicates (per sample).
rule combine_bams:
input:
bams = lambda wildcards: expand('results/{samp}/{run}/bams/{samp}_{mapper}_{ref}_sorted.bam', run=sample_dict[wildcards.samp])
output:
combined_bam = 'results/{samp}/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'results/{samp}/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
CodePudding user response:
In rule combine_bams, when using lambda expression you will need to provide the values of all {} wildcards. Right now there is only run information provided. One way to fix this is to include kwarg allow_missing=True to expand:
bams = lambda wildcards: expand(
"results/{samp}/{run}/bams/{samp}_{mapper}_{ref}_sorted.bam",
run=sample_dict[wildcards.samp],
allow_missing=True,
)
This will be the same as:
bams = lambda wildcards: expand(
"results/{samp}/{run}/bams/{samp}_{mapper}_{ref}_sorted.bam",
run=sample_dict[wildcards.samp],
samp="{samp}",
mapper="{mapper}",
ref="{ref}",
)
