Quantcast
Channel: Active questions tagged config - Stack Overflow
Viewing all articles
Browse latest Browse all 5049

Is it possible to use wildcards in config files for a Snakemake pipeline?

$
0
0

I'm new to building Snakefiles and for my bioinformatics research, I'm trying to loop my rules over multiple samples. I looked for similar questions and answers, but I can't seem to fix this problem. It may be because I still don't really understand how Snakemake works exactly. If you guys can help me out that would be great.

At the moment I have multiple rules, which currently works for one sample:

# variables for every speciesSAMPLE = "SRR8528338"SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()configfile: "./envs/contigs/" + SAMPLE +".yaml"var_variables = expand("results/4_mapped_contigs/" + SAMPLE +"/var/Contig{nr}_AT_sort.var", nr = config["contig_nrs"])#make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)rule all:    input:         var_variables#        make_contig_consensusrule convert_to_fBAM:    input:"results/4_mapped_contigs/" + SAMPLE +"/sam/Contig{nr}_AT.sam"    output:"results/4_mapped_contigs/" + SAMPLE +"/bam/Contig{nr}_AT.bam"    shell:"samtools view -bS {input} > {output}"rule sort_fBAM:    input:"results/4_mapped_contigs/" + SAMPLE +"/bam/Contig{nr}_AT.bam"    output:"results/4_mapped_contigs/" + SAMPLE +"/sorted_bam/Contig{nr}_AT_sort.bam"    shell:"samtools sort -m5G {input} -o {output}"rule convert_to_fpileup:    input:"results/4_mapped_contigs/" + SAMPLE +"/sorted_bam/Contig{nr}_AT_sort.bam"    output:"results/4_mapped_contigs/" + SAMPLE +"/pileup/Contig{nr}_AT_sort.pileup"    shell:"samtools mpileup -B {input} > {output}"rule SNP_calling:    input:"results/4_mapped_contigs/" + SAMPLE +"/pileup/Contig{nr}_AT_sort.pileup"    output:"results/4_mapped_contigs/" + SAMPLE +"/var/Contig{nr}_AT_sort.var"    shell:"varscan pileup2cns {input} ""--min-freq-for-hom 0.6 ""--min-coverage 5 ""--min-var-freq 0.6 ""--p-value 0.1 ""--min-reads2 5 ""> {output}"rule make_contig_consensus:    input:"src/read_var.py"    output:"results/5_consensus_contigs/{sample}"    params:"{sample}"    shell:"python3 {input} {params}"

The config file differs for every sample (the numbers of contigs). For SRR8528338, it looks like this:

contig_nrs:    1: ./results/4_mapped_contigs/SRR8528338/var/Contig1_AT_sort.var    2: ./results/4_mapped_contigs/SRR8528338/var/Contig2_AT_sort.var    3: ./results/4_mapped_contigs/SRR8528338/var/Contig3_AT_sort.var    ...    2146: ./results/4_mapped_contigs/SRR8528338/var/Contig2146_AT_sort.var 

However, I want to loop all these rules over multiple samples as referred to in the "SAMPLES" variable.Now I tried using double braces before, which worked for multiple samples. (Changing all 'SAMPLES' to {{sample}} and adding: , sample = SAMPLES). Then my code should be looking like this:

# variables for every speciesSAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()for sample in SAMPLES:    configfile: "./envs/contigs/" + sample +".yaml"var_variables = expand("results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var", sample = SAMPLES, nr = config["contig_nrs"])make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)rule all:    input:         var_variables#        make_contig_consensusrule convert_to_fBAM:    input:"results/4_mapped_contigs/{{sample}}/sam/Contig{nr}_AT.sam"    output:"results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"    shell:"samtools view -bS {input} > {output}"rule sort_fBAM:    input:"results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"    output:"results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"    shell:"samtools sort -m5G {input} -o {output}"rule convert_to_fpileup:    input:"results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"    output:"results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"    shell:"samtools mpileup -B {input} > {output}"rule SNP_calling:    input:"results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"    output:"results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var"    shell:"varscan pileup2cns {input} ""--min-freq-for-hom 0.6 ""--min-coverage 5 ""--min-var-freq 0.6 ""--p-value 0.1 ""--min-reads2 5 ""> {output}"rule make_contig_consensus:    input:"src/read_var.py"    output:"results/5_consensus_contigs/{sample}"    params:"{sample}"    shell:"python3 {input} {params}"

However, when I run this I get an error. I'm not exactly sure, but I think it is because of the for loop (sample in SAMPLES):

Missing input files for rule all:results/4_mapped_contigs/SRR8528338/var/Contig1266_AT_sort.varresults/4_mapped_contigs/SRR8528338/var/Contig1299_AT_sort.var...

Now I was wondering: is there a way to expand the config file by using wildcards? Something like:

configfile: expand("./envs/contigs/{sample}.yaml", sample = SAMPLES)

Doing this will give me the error:

TypeError in line 4expected str, bytes or os.PathLike object, not list

or do you have other solutions for this problem?

Thank you!


Update:

I've been trying some things out and I think it would be useful to change the config file into a nested dictionary instead of separate ones. It should look something like this:

    contigs:         SRR8528336: - 1                     - 2                     - ...                     - 2113         SRR8528337: - 1                      ...          ...    exons:         SRR8528336: - 1                      ...                     - 1827         SRR8528337: - 1                       ...                     - 1826          ...

So for example, if I want to run for the samples: SRR8528338 until SRR8528340 I give this as input:

SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

and call the contigs by sample name:

var_variables = expand("results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var", nr = config["contigs"][wildcards.sample])

or exons by:

expand("results/7_exons/{{sample}}/var/exon{nr}_AT_sort.var", nr = config["exons"][wildcards.sample])

How does the 'wildcards.sample' works exactly if I only want to obtain the value?


Solution (and next problem) 31/7/2020

I made my changes according to bli, which is now working now:

# variables for every speciesSAMPLES = "SRR8528347 SRR8528355 SRR8528356".split()configfile: "./envs/config_contigs.yaml"# bam = []# sort_bam = []# fpileup = []var_variables = []make_contig_consensus = []blat_variables = []extract_hits_psl = []for sample in SAMPLES:    contig_nrs = config[sample]    for nr in contig_nrs:        # bam.append("results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam".format(sample=sample, nr=nr))        # sort_bam.append("results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam".format(sample=sample, nr=nr))        # fpileup.append("results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup".format(sample=sample, nr=nr))        var_variables.append("results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var".format(sample=sample, nr=nr))        make_contig_consensus.append("results/A05_consensus_contigs/{sample}/Contig{nr}.fasta".format(sample=sample, nr=nr))        blat_variables.append("results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl".format(sample=sample, nr=nr))        extract_hits_psl.append("results/A07_mapped_exons/{sample}/".format(sample=sample, nr=nr))rule all:    input:        # bam,        # sort_bam,        # fpileup,        var_variables,        make_contig_consensus,        blat_variables,        extract_hits_pslrule convert_to_fBAM:    input:"results/A04_mapped_contigs/{sample}/sam/Contig{nr}_AT.sam"    output:"results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"    shell:"samtools view -bS {input} > {output}"rule sort_fBAM:    input:"results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"    output:"results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"    shell:"samtools sort -m5G {input} -o {output}"rule convert_to_fpileup:    input:"results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"    output:"results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"    shell:"samtools mpileup -B {input} > {output}"rule SNP_calling:    input:"results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"    output:"results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"    shell:"varscan pileup2cns {input} ""--min-freq-for-hom 0.6 ""--min-coverage 5 ""--min-var-freq 0.6 ""--p-value 0.1 ""--min-reads2 5 ""> {output}"rule make_contig_consensus:    input:        script = "src/read_var.py",        file = "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"    output:"results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"    params:"{sample}"    shell:"python3 {input.script} {params}"rule BLAT_assembled:    input:"data/exons/exons_AT.fasta","results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"    output:"results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"    shell:"blat ""-t=dnax ""-q=dnax ""-stepSize=5 ""-repMatch=2253 ""-minScore=0 ""-minIdentity=0 ""{input} {output}"rule extract_hits_psl:    input:        script = "src/extract_hits_psl.py"        # file = "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"    output:"results/A07_mapped_exons/{sample}/"    params:"{sample}"    shell:"python {input.script} {params}"

config_contigs.yaml:

SRR8528347:    - 1    - ...    - 5SRR8528348:    - 1    - ...    - 5...

Now calling them from the .yaml is working, but the rules should be run in the same order as written (from top to bottom). When running this, the rules are run in a different order and therefore gives an error because the files don't exist yet. I read that the output of the order before should be the same as the input after, but it is not working.


Viewing all articles
Browse latest Browse all 5049

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>