Generate mock biological data files based on specified datatypes and parameters. This function creates realistic but minimal test files for bioinformatics workflows without requiring large example datasets in the package.
Usage
sn_generate_mockdata(
datatype,
output_file = NULL,
size = "minimal",
n_records = NULL,
seed = 123,
compress = NULL,
options = list()
)
Arguments
- datatype
Character. The type of file to generate (from datatypes.yaml).
- output_file
Character. Path where the generated file should be saved. If NULL, a temporary file will be created in tempdir().
- size
Character. Size category: "minimal" (default), "small", "medium", "large".
- n_records
Integer. Number of records to generate (overrides size if specified).
- seed
Integer. Random seed for reproducible generation (default: 123).
- compress
Logical. Whether to compress output files (default: auto-detected from extension). For fastq files, compression is enabled by default. For fasta and gtf files, compression is disabled by default.
- options
List. Additional options specific to the datatype. For FASTQ files, supported options include: - `read_length`: Integer or character. Read length in bp. Common values: 50, 75, 100, 150, 250, 300. Can also use shortcuts: "short" (50bp), "medium" (100bp), "long" (150bp), "extra_long" (250bp), "ultra_long" (300bp). Default: 150. - `adapters`: Character or logical. Adapter type: "none", "illumina"/"truseq", "nextera", or TRUE/FALSE for backward compatibility. Default: "illumina". - `adapter_contamination_rate`: Numeric. Fraction of reads with adapter sequences (0-1). Default: 0.35 (realistic for fresh sequencing data). - `min_quality`: Integer. Minimum Phred quality score. Default: 25. - `max_quality`: Integer. Maximum Phred quality score. Default: 40. - `error_rate`: Numeric. Sequencing error rate (0-1). Default: 0.015. - `read_type`: Character. "single", "R1", or "R2" for paired-end. Default: "single".
Details
The function supports generating mock files for all datatypes defined in the global datatypes.yaml configuration. Common supported datatypes include:
**Sequence Files:** - `fasta`: Nucleotide or protein sequences (properly formatted with line breaks) - `fastq`: Raw sequencing reads with quality scores and adapters
**Alignment Files:** - `sam`: Sequence alignment (text format) - `bam`: Binary alignment (requires samtools for realistic headers)
**Annotation Files:** - `gtf`: Gene annotation in GTF format - `gff`: Gene annotation in GFF format - `bed`: Genomic intervals in BED format
**Variant Files:** - `vcf`: Variant calls in VCF format
**Data Tables:** - `csv`, `tsv`, `txt`: Tabular data files - `json`: JSON structured data - `yaml`: YAML configuration files
**Realistic Data Generation:** The function automatically ensures that related biological files use consistent reference information: - FASTQ reads include realistic error rates and adapters - GTF/GFF annotations use matching chromosome names with the reference - VCF/SAM files reference the same genome coordinates - Multiple genes per chromosome for realistic annotations
See also
Other mock data generation:
sn_cleanup_mockdata_examples()
,
sn_generate_mockdata_batch()
,
sn_generate_rnaseq_dataset()
,
sn_get_example_value_with_mockdata()
Examples
if (FALSE) { # \dontrun{
# Generate minimal mock files
sn_generate_mockdata("fasta", "test_genome.fa")
sn_generate_mockdata("fastq", "test_reads.fastq", size = "small")
sn_generate_mockdata("gtf", "test_annotation.gtf", n_records = 100)
# Generate temporary files
temp_fasta <- sn_generate_mockdata("fasta", NULL, size = "small")
temp_fastq <- sn_generate_mockdata("fastq", NULL, size = "medium")
# Generate compressed files (auto-detected or explicit)
sn_generate_mockdata("fastq", "reads.fastq.gz") # Auto-compressed
sn_generate_mockdata("vcf", "variants.vcf", compress = TRUE)
# Generate FASTQ with custom sequencing parameters
sn_generate_mockdata("fastq", "reads_150bp.fastq.gz",
options = list(
read_length = 150, adapters = "illumina",
adapter_contamination_rate = 0.1, min_quality = 30
)
)
# Generate FASTQ with different read lengths
sn_generate_mockdata("fastq", "short_reads.fq.gz",
options = list(read_length = "short")
) # 50bp
sn_generate_mockdata("fastq", "long_reads.fq.gz",
options = list(read_length = "extra_long")
) # 250bp
# Generate FASTQ without adapters (clean reads)
sn_generate_mockdata("fastq", "clean_reads.fq.gz",
options = list(adapters = "none")
)
# Test adapter trimming with fastp (need to specify adapter sequences explicitly)
# sn_run("fastp", "trim", input1 = "reads_R1.fq.gz", input2 = "reads_R2.fq.gz",
# output1 = "trimmed_R1.fq.gz", output2 = "trimmed_R2.fq.gz",
# extras = "--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
# --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT")
# Generate with custom options
sn_generate_mockdata("fasta", "proteins.fa",
options = list(sequence_type = "protein", max_length = 200)
)
# Compatible files are generated automatically
ref <- sn_generate_mockdata("fasta", "reference.fa", size = "small")
reads_r1 <- sn_generate_mockdata("fastq", "reads_R1.fastq.gz", size = "medium")
reads_r2 <- sn_generate_mockdata("fastq", "reads_R2.fastq.gz", size = "medium")
genes <- sn_generate_mockdata("gtf", "genes.gtf", size = "small")
# These files will use consistent chromosome names and coordinates
} # }