Skip to contents

Generate mock biological data files based on specified datatypes and parameters. This function creates realistic but minimal test files for bioinformatics workflows without requiring large example datasets in the package.

Usage

sn_generate_mockdata(
  datatype,
  output_file = NULL,
  size = "minimal",
  n_records = NULL,
  seed = 123,
  compress = NULL,
  options = list()
)

Arguments

datatype

Character. The type of file to generate (from datatypes.yaml).

output_file

Character. Path where the generated file should be saved. If NULL, a temporary file will be created in tempdir().

size

Character. Size category: "minimal" (default), "small", "medium", "large".

n_records

Integer. Number of records to generate (overrides size if specified).

seed

Integer. Random seed for reproducible generation (default: 123).

compress

Logical. Whether to compress output files (default: auto-detected from extension). For fastq files, compression is enabled by default. For fasta and gtf files, compression is disabled by default.

options

List. Additional options specific to the datatype. For FASTQ files, supported options include: - `read_length`: Integer or character. Read length in bp. Common values: 50, 75, 100, 150, 250, 300. Can also use shortcuts: "short" (50bp), "medium" (100bp), "long" (150bp), "extra_long" (250bp), "ultra_long" (300bp). Default: 150. - `adapters`: Character or logical. Adapter type: "none", "illumina"/"truseq", "nextera", or TRUE/FALSE for backward compatibility. Default: "illumina". - `adapter_contamination_rate`: Numeric. Fraction of reads with adapter sequences (0-1). Default: 0.35 (realistic for fresh sequencing data). - `min_quality`: Integer. Minimum Phred quality score. Default: 25. - `max_quality`: Integer. Maximum Phred quality score. Default: 40. - `error_rate`: Numeric. Sequencing error rate (0-1). Default: 0.015. - `read_type`: Character. "single", "R1", or "R2" for paired-end. Default: "single".

Value

Character. Path to the generated file (invisibly).

Details

The function supports generating mock files for all datatypes defined in the global datatypes.yaml configuration. Common supported datatypes include:

**Sequence Files:** - `fasta`: Nucleotide or protein sequences (properly formatted with line breaks) - `fastq`: Raw sequencing reads with quality scores and adapters

**Alignment Files:** - `sam`: Sequence alignment (text format) - `bam`: Binary alignment (requires samtools for realistic headers)

**Annotation Files:** - `gtf`: Gene annotation in GTF format - `gff`: Gene annotation in GFF format - `bed`: Genomic intervals in BED format

**Variant Files:** - `vcf`: Variant calls in VCF format

**Data Tables:** - `csv`, `tsv`, `txt`: Tabular data files - `json`: JSON structured data - `yaml`: YAML configuration files

**Realistic Data Generation:** The function automatically ensures that related biological files use consistent reference information: - FASTQ reads include realistic error rates and adapters - GTF/GFF annotations use matching chromosome names with the reference - VCF/SAM files reference the same genome coordinates - Multiple genes per chromosome for realistic annotations

Examples

if (FALSE) { # \dontrun{
# Generate minimal mock files
sn_generate_mockdata("fasta", "test_genome.fa")
sn_generate_mockdata("fastq", "test_reads.fastq", size = "small")
sn_generate_mockdata("gtf", "test_annotation.gtf", n_records = 100)

# Generate temporary files
temp_fasta <- sn_generate_mockdata("fasta", NULL, size = "small")
temp_fastq <- sn_generate_mockdata("fastq", NULL, size = "medium")

# Generate compressed files (auto-detected or explicit)
sn_generate_mockdata("fastq", "reads.fastq.gz") # Auto-compressed
sn_generate_mockdata("vcf", "variants.vcf", compress = TRUE)

# Generate FASTQ with custom sequencing parameters
sn_generate_mockdata("fastq", "reads_150bp.fastq.gz",
  options = list(
    read_length = 150, adapters = "illumina",
    adapter_contamination_rate = 0.1, min_quality = 30
  )
)

# Generate FASTQ with different read lengths
sn_generate_mockdata("fastq", "short_reads.fq.gz",
  options = list(read_length = "short")
) # 50bp
sn_generate_mockdata("fastq", "long_reads.fq.gz",
  options = list(read_length = "extra_long")
) # 250bp

# Generate FASTQ without adapters (clean reads)
sn_generate_mockdata("fastq", "clean_reads.fq.gz",
  options = list(adapters = "none")
)

# Test adapter trimming with fastp (need to specify adapter sequences explicitly)
# sn_run("fastp", "trim", input1 = "reads_R1.fq.gz", input2 = "reads_R2.fq.gz",
#        output1 = "trimmed_R1.fq.gz", output2 = "trimmed_R2.fq.gz",
#        extras = "--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
#                  --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT")

# Generate with custom options
sn_generate_mockdata("fasta", "proteins.fa",
  options = list(sequence_type = "protein", max_length = 200)
)

# Compatible files are generated automatically
ref <- sn_generate_mockdata("fasta", "reference.fa", size = "small")
reads_r1 <- sn_generate_mockdata("fastq", "reads_R1.fastq.gz", size = "medium")
reads_r2 <- sn_generate_mockdata("fastq", "reads_R2.fastq.gz", size = "medium")
genes <- sn_generate_mockdata("gtf", "genes.gtf", size = "small")
# These files will use consistent chromosome names and coordinates
} # }