YAML Tool Specification
This document describes how to create custom tool definitions for ShennongTools using YAML configuration files.
Overview
ShennongTools uses structured YAML files to define bioinformatics tools and their commands. This approach enables:
- Standardized tool interfaces: Consistent parameter handling across all tools
- Automatic environment management: Conda dependencies specified in YAML
- Template-based execution: Flexible shell and Python script templates
- Built-in validation: Parameter types and requirements enforced automatically
- Easy extensibility: Add new tools by creating YAML files
Basic Structure
A tool YAML file contains these top-level sections:
tool_name: string # Unique tool identifier
description: string # Short summary of the tool
citation: string # DOI or citation information
environment: # Conda environment specification
channels: [channel1, ...]
dependencies:
- pkg=version
- pip:
- pip_pkg
commands: # Individual tool commands
command_name:
# Command specification
Required Fields
Environment Specification
Define conda environment dependencies:
Command Definition
Each command represents a callable unit within a tool:
commands:
view:
description: "Extract alignments in SAM or BAM format"
binary: samtools
help_flag: "view --help"
inputs:
input:
datatype: [bam, sam, cram]
required: true
description: "Input alignment file"
outputs:
output:
datatype: [bam, sam]
required: false
description: "Output file (stdout if not specified)"
params:
flags:
datatype: string
default: ""
description: "Additional samtools view flags"
threads:
datatype: integer
default: 1
description: "Number of threads"
shell: >
{{ binary }} view {{ flags }}
{% if threads > 1 %}-@ {{ threads }}{% endif %}
{{ input }}
{% if output %} -o {{ output }}{% endif %}
Parameter Types
params
Additional parameters and options:
params:
quality_threshold:
datatype: integer
default: 20
description: "Minimum mapping quality"
enable_splicing:
datatype: boolean
default: true
description: "Allow spliced alignments"
similarity:
datatype: numeric
default: 0.95
description: "Similarity threshold"
sample_name:
datatype: string
default: "sample"
description: "Sample identifier"
Datatype System
ShennongTools uses a unified datatype system for validation and example generation.
Global Datatypes
Defined in inst/config/datatypes.yaml
:
file_types:
fastq:
description: "FASTQ sequence files"
extensions: [".fastq", ".fq", ".fastq.gz", ".fq.gz"]
example_value: "reads.fastq.gz"
bam:
description: "Binary Alignment Map files"
extensions: [".bam"]
example_value: "alignment.bam"
value_types:
integer:
description: "Integer numbers"
example_value: 10
string:
description: "Text strings"
example_value: "sample_name"
Template System
ShennongTools uses the Jinjar templating engine for command generation.
Template Rules
Complete Example
Here’s a complete YAML file for a hypothetical alignment tool:
tool_name: myaligner
description: "Fast sequence alignment tool"
citation: "doi:10.1000/example"
environment:
channels:
- bioconda
- conda-forge
dependencies:
- myaligner=2.1.0
- samtools=1.19.2
commands:
build_index:
description: "Build alignment index from reference"
binary: myaligner-build
help_flag: "--help"
inputs:
reference:
datatype: fasta
required: true
description: "Reference genome FASTA file"
outputs:
index:
datatype: string
required: true
description: "Index name prefix"
params:
threads:
datatype: integer
default: 4
description: "Number of threads"
extras:
datatype: string
default: ""
description: "Additional build arguments"
shell: >
{{ binary }} {{ reference }} {{ index }}
-t {{ threads }} {{ extras }}
align:
description: "Align reads to reference"
binary: myaligner
help_flag: "--help"
inputs:
index:
datatype: string
required: true
description: "Index name prefix"
read1:
datatype: fastq
required: true
description: "Forward reads"
read2:
datatype: fastq
required: false
description: "Reverse reads"
outputs:
output:
datatype: sam
required: true
description: "Output alignment file"
params:
threads:
datatype: integer
default: 4
description: "Number of threads"
sensitivity:
datatype: string
default: "sensitive"
description: "Alignment sensitivity (fast|sensitive|very-sensitive)"
extras:
datatype: string
default: ""
description: "Additional alignment arguments"
shell: >
{{ binary }} -x {{ index }}
{% if read2 %}-1 {{ read1 }} -2 {{ read2 }}{% else %}-U {{ read1 }}{% endif %}
-S {{ output }} -p {{ threads }}
--{{ sensitivity }} {{ extras }}
Testing Your YAML
Before using a custom tool, test the YAML configuration:
library(ShennongTools)
# Test template rendering
template <- "{{ binary }} -i {{ input }} -o {{ output }} -t {{ threads }}"
params <- list(
binary = "mytool",
input = "test.fastq",
output = "result.bam",
threads = 4
)
rendered <- sn_test_template(template, params)
print(rendered)
# Output: mytool -i test.fastq -o result.bam -t 4
Best Practices
-
Use descriptive parameter names:
input1
/input2
instead ofin1
/in2
- Provide sensible defaults: Users shouldn’t need to specify every parameter
- Include comprehensive descriptions: Help users understand each parameter
- Follow datatype conventions: Use standard datatypes when possible
-
Test templates thoroughly: Use
sn_test_template()
to verify rendering - Keep templates readable: Use proper indentation and spacing
- Handle optional parameters gracefully: Use conditional logic appropriately
Contributing Tool Definitions
To contribute a new tool to ShennongTools:
- Create a YAML file following this specification
- Test thoroughly with representative data
- Submit a pull request to the GitHub repository
- Include example usage in the pull request description
Your contribution will help expand the ShennongTools ecosystem for the entire bioinformatics community!