| Title: | Tidy Multilocus Amplicon Genotypes |
|---|---|
| Description: | Variant determination and genotyping from high throughput sequences from multilocus amplicon libraries, typically sequenced in Illumina MiSeq or similar. It provides a set of core functions for the central steps: demultiplex by locus, truncate reads, variant calling, and genotype calling. Additionally, it provides a set of functions for diagnosis and estimation of best running parameters and multiple extensions for genotype/variants manipulation and reformatting. Output variants and genotypes are output in 'tidy' format, thus facilitating reformatting, manipulation and potential connection to other R packages. |
| Authors: | Miguel Camacho [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-6385-7963>), Jennifer Leonard [fnd] |
| Maintainer: | Miguel Camacho <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.8 |
| Built: | 2026-05-15 12:28:24 UTC |
| Source: | https://github.com/csmiguel/tidygenr |
add allele_no to genotypes
add_allele_no(gen)add_allele_no(gen)
gen |
genotypes |
Performs pairwise alignments between DNA sequence variants contained in a
tidyGenR variants table and a reference FASTA database using
DECIPHER::AlignPairs(). The best matching reference sequence(s) for
each variant are returned and appended as additional columns to the variants
table.
align_variants_ref(variants, ref_fasta, n_best = 1, reduced = TRUE, ...)align_variants_ref(variants, ref_fasta, n_best = 1, reduced = TRUE, ...)
variants |
A tidyGenR variants table containing DNA sequence variants. |
ref_fasta |
Path to a FASTA file containing reference DNA sequences. |
n_best |
Integer specifying the number of top-scoring alignments to
retain for each query sequence. Defaults to |
reduced |
Logical. If |
... |
Additional arguments passed to
|
This function is intended for exploratory annotation of sequence variants, such as matching alleles against reference haplotypes, species databases, or known locus variants.
Reference sequences are read using
Biostrings::readDNAStringSet(). Variant sequences are extracted from
the tidyGenR variants object using tidy2sequences() and aligned
against all reference sequences using pairwise alignment.
Query sequences are internally identified using MD5 hashes to facilitate unambiguous tracking across the workflow.
A tidy variants table with additional columns containing alignment information for the best matching reference sequence(s).
data("variants") ref_al <- system.file("extdata/reference_alleles.fasta", package = "tidyGenR") align_variants_ref(variants, ref_al)data("variants") ref_al <- system.file("extdata/reference_alleles.fasta", package = "tidyGenR") align_variants_ref(variants, ref_al)
Reads AmpliSAT results as tidy variants.
amplisas2tidy(fp)amplisas2tidy(fp)
fp |
Character vector with paths to 'txt' files from AmpliSAT results. E.x. from 'allseqs', 'filtered', 'clustered'. |
Converts text files in folder with AmpliSAS (Sebastian et al. 2016) results into a tidy variants dataframe. MD5 is regenerated as the one in AmpliSAT does not coincide with the MD5 associated with the DNA sequence.
Tidy variants (see 'variant_call()”).
Sebastian et al. (2016). AMPLISAS: a web server for multilocus genotyping using next‐generation amplicon sequencing data. Molecular Ecology Resources, 16(2), 498-510.
fp <- list.files(system.file("extdata", "amplisas", package = "tidyGenR"), full.names = TRUE ) amplisas2tidy(fp)fp <- list.files(system.file("extdata", "amplisas", package = "tidyGenR"), full.names = TRUE ) amplisas2tidy(fp)
Check input raw FASTA/Q files
check_raw_reads(freads, rreads = NULL, low_readcount = 10)check_raw_reads(freads, rreads = NULL, low_readcount = 10)
freads |
Character vector with file paths to forward reads. |
rreads |
Character vector with file paths to reverse reads. |
low_readcount |
Numeric threshold to warn on number of reads in FASTA (DEFAULT = 10). |
Multiple checks on raw input FASTQ:
Reads in FASTQ above threshold.
Unique sample names.
Not orphan F/R files for paired reads.
For single-end data check 3 is excluded and NULL is returned in checks[4].
List with 4 elements:
'f_reads', character vector with basename of forward reads.
'r_reads', character vector with basename of reads reads or NULL for single-end experiments.
'snames', character vector with sample names.
'checks', Logical vector with passed checks 1-3.
freads <- list.files(system.file("extdata", "raw", package = "tidyGenR"), pattern = "1.fastq.gz", full.names = TRUE) check_raw_reads(freads)freads <- list.files(system.file("extdata", "raw", package = "tidyGenR"), pattern = "1.fastq.gz", full.names = TRUE) check_raw_reads(freads)
All FASTQ have a number of reads above threshold.
check1_lowcount(list_fp, low_readcount = low_readcount)check1_lowcount(list_fp, low_readcount = low_readcount)
list_fp |
List of length 1 or 2 character vectors with absolute F (and R) file paths. |
low_readcount |
'low_readcount' from 'check_raw_reads()'. |
Logical, TRUE if test is passed. Warning if FALSE.
Sample names are unique.
check2_unique_snames(snames = snames)check2_unique_snames(snames = snames)
snames |
List of length 1 or 2 character vectors with sample names. |
Logical, TRUE if test is passed. STOPs if FALSE.
Not orphan F/R FASTQ files.
check3_non_orphan(list_fp, snames = snames)check3_non_orphan(list_fp, snames = snames)
list_fp |
List of length 1 or 2 character vectors with absolute F (and R) file paths. |
snames |
List of length 1 or 2 character vectors with sample names. |
Logical, TRUE if test is passed. Warning if FALSE. NULL for single end reads.
Comparision between multiple tidy variants or genotypes.
compare_calls(td, dest_file = FALSE, creads = FALSE)compare_calls(td, dest_file = FALSE, creads = FALSE)
td |
List of named tidy dataframes with at least 'sample', 'locus', 'sequence'. |
dest_file |
Path to EXCEL file to write results. Default (FALSE), no file is written. |
creads |
Account for reads. dataframes in 'td' must have 'reads' column with the number of reads supporting each variant. If FALSE (default) distances between calls and plots are produced on the basis of presence (1)/absense (0) of the variant. If TRUE, read count are accounted for estimating distances and producing the plots. Default (FALSE). |
A list of 4 elements:
tidy_dat: combined tidy number of reads for all elements in 'td'. If 'creads = FALSE', cells are coded as '1' (presence) and '0' (absence).
variants: list with dataframes in wide format with all variants as columns. Variants are named 'locus_{first 6 characters of md5 of DNA seq}'. Zeroes '0' are included where no variant was found. If 'creads = FALSE', cells are coded as '1' (presence) and '0' (absence)
comp: T/F matrix. samples x variants. TRUE, the cellij across all matrices has the same value; FALSE, any of them has a different value.
dist: pairwise distances between matrices (analogue to Manhattan distance).
plot1: heatmap with frequency of calls (>1 == 1) across elements compared.
plot2: heatmap with calls for all elements compared. Black tiles indicate calls absent in the element but present in another. element.
plot3: boxplot of reads ordered by frecuency. (output only if 'creads' == TRUE).
plot4: MDS plot (output only when length(td) > 2).
(it writes an EXCEL file with the results to 'dest_file'.)
data("variant_calls") compare_calls(variant_calls, creads = TRUE)data("variant_calls") compare_calls(variant_calls, creads = TRUE)
Create cutadapt command
cutadapt_command( mode, interpreter, samples_file, cutadapt, extraArgs, overlap, e, g, G, o, p, readsfw, readsrv, log_out )cutadapt_command( mode, interpreter, samples_file, cutadapt, extraArgs, overlap, e, g, G, o, p, readsfw, readsrv, log_out )
mode |
"pe", paired-end; "se", single-end; or "linked" (beta, non-tested), for linked primers. |
interpreter |
Path to interpreter. |
samples_file |
Text file with sample names. |
cutadapt |
Path to cutadapt executable. |
extraArgs |
Other arguments for cutadapt (eg "–max-n 0 –max-expected-errors 3 –minimum-length=20"). |
overlap |
–overlap (see cutadapt documentation). |
e |
-e (see cutadapt documentation). |
g |
Fasta file with forward primers. |
G |
Fasta file with reverse primers. |
o |
Path to demultiplexed forward reads. |
p |
Path to demultiplexed reverse reads. |
readsfw |
Path to input forward reads. |
readsrv |
Path to input reverse reads. |
log_out |
Path to write cutadapt log file. |
cutadapt command based on the 'mode" ('pe', 'se', 'linked').
Sequencing reads are demultiplexed by locus into separate FASTQ files
(locus.sample.[1|2].fastq.gz) based on the locus-specific primer
sequences using cutadapt.
demultiplex( interpreter = "/bin/bash", cutadapt = system2("which", "cutadapt", stdout = TRUE), freads, rreads = NULL, primers = NULL, sh_out = tempfile(fileext = ".sh"), outdir = tempdir(), log_out = tempfile(fileext = ".log"), temp_folder = tempdir(), mode = "pe", overlap = 15, e = 0.15, extraArgs = "", run = TRUE )demultiplex( interpreter = "/bin/bash", cutadapt = system2("which", "cutadapt", stdout = TRUE), freads, rreads = NULL, primers = NULL, sh_out = tempfile(fileext = ".sh"), outdir = tempdir(), log_out = tempfile(fileext = ".log"), temp_folder = tempdir(), mode = "pe", overlap = 15, e = 0.15, extraArgs = "", run = TRUE )
interpreter |
Path to interpreter. |
cutadapt |
Path to cutadapt executable. |
freads |
Character vector with file paths to forward reads. |
rreads |
Character vector with file paths to reverse reads. |
primers |
Dataframe with primers |
sh_out |
File name for cutadapt command. |
outdir |
Directory to write demultiplexed FASTQ files. Created if it does not exist. |
log_out |
Path to write cutadapt log file. |
temp_folder |
Directory to save temp files. |
mode |
"pe", paired-end; "se", single-end; or "linked" (beta, non-tested), for linked primers. |
overlap |
–overlap (see cutadapt documentation). |
e |
-e (see cutadapt documentation). |
extraArgs |
Other arguments for cutadapt (eg "–max-n 0 –max-expected-errors 3 –minimum-length=20"). |
run |
T/F, whether to run the script (T) or just write it (F). |
This function creates a bash script to run cutadapt. By turning on running 'run = T', the function will try to execute the bash script produced (written to 'sh_out'). If the function produces any errors it is recommended to turn off running 'run = FALSE' and inspect the bash script produced to detect mis-specified cutadapt arguments or erroneous paths, and to solve them using the cutadapt documentation.
Demultiplexed 'sample.locus.[12].fastq.gz'.
data("primers") freads <- list.files(system.file("extdata", "raw", package = "tidyGenR"), pattern = "1.fastq.gz", full.names = TRUE) rreads <- list.files(system.file("extdata", "raw", package = "tidyGenR"), pattern = "2.fastq.gz", full.names = TRUE) demultiplex( freads = freads, rreads = rreads, primers = primers, run = FALSE)data("primers") freads <- list.files(system.file("extdata", "raw", package = "tidyGenR"), pattern = "1.fastq.gz", full.names = TRUE) rreads <- list.files(system.file("extdata", "raw", package = "tidyGenR"), pattern = "2.fastq.gz", full.names = TRUE) demultiplex( freads = freads, rreads = rreads, primers = primers, run = FALSE)
Reads FASTQ files and computes frequency of unique sequences per sample. Depth for unique sequences are organized in samples x unique sequences tables. Unique Ssequences are ordered in descending frequency.
dereplicate( fs, min_sam_fr = 2, min_loc_fr = 0.001, by = "_([a-zA-Z0-9]*_[F|R])", out_xlsx = NULL )dereplicate( fs, min_sam_fr = 2, min_loc_fr = 0.001, by = "_([a-zA-Z0-9]*_[F|R])", out_xlsx = NULL )
fs |
Character vector with paths to all FASTQ to de-replicate. |
min_sam_fr |
Numeric. Minimum number of sequence counts in a sample to be retained (cell number). |
min_loc_fr |
Numeric. Minimum frequency of de-replicated sequence in the group to be retained.
If |
by |
Regex pattern to group FASTQ files in the list. Passed to |
out_xlsx |
File name to write tables with de-replicated sequences (Default: NULL; no file is written). |
The by parameter allows flexible grouping of files in the list. However, the results are not added within each group; individual results for each sample are always returned. For instance, given 3 files s1_loc1_F.fq, s1_loc1_R.fq and s2_loc1_F.fq:
"([a-zA-Z0-9]*_[a-zA-Z0-9]*)", returns s1_loc1 and s2_loc2.
"_([a-zA-Z0-9]*)_", returns loc1 and loc2.
"([a-zA-Z0-9]*_[F|R])", returns loc1_F, loc1_R and loc2_F.
The min_sam_fr and min_loc_fr filters drop data not passing the filters, so they will become
zero or absent when combined.
If a path to an EXCEL file is set, each element in the list is written to a
different sheet in the workbook.
List of extracted groups (see 'by'). Each element is the list is a dataframe with:
column 1: 'sequence', DNA sequence of read.
column 2: 'md5', md5 hash of DNA sequence.
column >= 3: frequency (integers) of sequences per sample passing 'min_sam_fr' and 'min_loc_fr' filters.
fq <- list.files(system.file("extdata", "truncated", package = "tidyGenR"), pattern = "fastq.gz", full.names = TRUE) dereplicate(fq)fq <- list.files(system.file("extdata", "truncated", package = "tidyGenR"), pattern = "fastq.gz", full.names = TRUE) dereplicate(fq)
Computed as the sum of absolute difference across cells. Analogue to Manhattan distance.
dist_m(m1, m2, prop = FALSE)dist_m(m1, m2, prop = FALSE)
m1 |
Matrix 1. |
m2 |
Matrix 2. |
prop |
If TRUE, it divides number of differences by length(matrix). |
Numeric distance, analogue to Manhattan distance.
Requires a list of matrices in which the first column are samples, and col2...coln are numeric. Ex. number of observed variants in each cell.
dist_m_all(ld)dist_m_all(ld)
ld |
List of dataframes. |
Dataframe with pairwise distances between all combinations. Absolute distance 'dist_Manh' and relative 'dist_Manhp'.
DADA2 is run with a set of desired parameters for a set of FASTQ files. The element
'clustering' output by dada2::dada() containing all relevant statistics
from cluster formation are output in tidy format as the first element. These
tidy results are plotted in 4 different ways.
explore_dada( fs, sample_locus = "(^[a-zA-Z0-9]*)_([a-zA-Z0-9]*)", value_na = 10, reduced = TRUE, omega_a = 0.9, band_size = 16, pool = FALSE, vline = NULL, hline_fr = NULL, p_titles = NULL )explore_dada( fs, sample_locus = "(^[a-zA-Z0-9]*)_([a-zA-Z0-9]*)", value_na = 10, reduced = TRUE, omega_a = 0.9, band_size = 16, pool = FALSE, vline = NULL, hline_fr = NULL, p_titles = NULL )
fs |
Character vector with full paths to FASTQ files. |
sample_locus |
Regex expression with groups to extract sample (group 1) and loci (group 2) from "(^[a-zA-Z0-9])_([a-zA-Z0-9])". |
value_na |
Numeric to replace 'NA' or infinite values assigned to 'pval' or 'birth_pval' in clustering element from 'dada-class'. |
reduced |
If TRUE, a reduced number of columns is returned. If FALSE, all columns from from 'dada-class' clustering are returned. |
omega_a |
"OMEGA_A" passed to |
band_size |
"BAND_SIZE" passed to |
pool |
Passed to |
vline |
Numeric x-intersection to annotate in plots p1 and p3. |
hline_fr |
Numeric y-intersection to annotate 'frequency' in plots. |
p_titles |
character(4) with plot names. Passed to 'ggtitle()' in plots p1:p4. |
'OMEGA_A', 'BAND_SIZE' and 'pool = T/F' parameters can have a strong
effect in the variants called by dada2::dada(). This function explores the
relation of read count, relative frequency of the variants and the
'birth_pval' assigned to the clusters for any given values of starting
parameters. Critical parameters 'pool' and 'BAND_SIZE' can be specified as
arguments. Additional parameters can be set with dada2::setDadaOpt().
The 'log(-log(birth_pval))' proposed in Rosen et al. (2012) is computed and
represented in plots. For represenation purposes, a virtually infinite value
of 'log(-log(birth_pval)) = 10', is assigned by default to the first cluster
and to 'birth_pval = 0'. Variants in each, sample/locus combination are
ranked by abundance and plotted in the legend. Ranks >= 3 are
named as "3". For biallelic markers, a rank >= 3 implies a likely false
positive. This, visual representation can be used to decide to tune
variant_call().
omega_a: threshold for variants to be significant overabundant
'log(-log(birth_pval))' (see Rosen et al. 2012). For exploration, it is
recommended to run explore_dada() with a large omega_a.
band_size: positive numbers set a band size in Needleman-Wunsch alignments. In this context, ends free alignment is performed. Zero turns off banding, triggering full Needleman-Wunsch alignments, in which gapless alignment is performed (see issue).
pool: calling variants pooling samples can increase sensitivity (see dicussion).
List with tidy 'dada-class' clustering element and plots.
tidy_dada: tidy 'clustering' element from 'dada-class' merged across loci.
p1: plot 1, frequency of variants (sample x locus) against 'log(-log(birth_pval))'.
p2: plot 2, read count of variants against their frequency.
p3: plot 3, p1 facetted by locus.
p4: plot 4, p2 facetted by locus.
Rosen et al. (2012). Denoising PCR-amplified metagenome data. BMC Bioinformatics, 13(1).
fq <- list.files(system.file("extdata", "truncated", package = "tidyGenR"), pattern = "F_filt.fastq.gz", full.names = TRUE) explore_dada(fq, value_na = 10, reduced = TRUE, pool = FALSE, vline = 2, hline_fr = 0.1, omega_a = 0.9, band_size = 16 )fq <- list.files(system.file("extdata", "truncated", package = "tidyGenR"), pattern = "F_filt.fastq.gz", full.names = TRUE) explore_dada(fq, value_na = 10, reduced = TRUE, pool = FALSE, vline = 2, hline_fr = 0.1, omega_a = 0.9, band_size = 16 )
Filter variants based on frequency and depth
filter_variants(tidy_var, maf = 0, ad = 0, invert = FALSE)filter_variants(tidy_var, maf = 0, ad = 0, invert = FALSE)
tidy_var |
Dataframe or tibble with tidy variants. |
maf |
Miminum variant frequency. For each sample, variants with a proportion of number of reads across all variants > maf are retained. |
ad |
Allele depth. Minimum number of reads supporting a variant. Variants supported by > ad reads are retained. |
invert |
FALSE (default) returns variants passing filters; TRUE inverts the selection. |
Tidy tibble with filtered variants.
dataframe with filtered variants/alleles. Variants are named so column names in df are "locus", "sample", "sequence", "reads", "allele".
data("variants") filter_variants(variants, ad = 100, maf = 0.2)data("variants") filter_variants(variants, ad = 100, maf = 0.2)
Performs straight genotyping from variants based on ploidy, ADt.
genotype(variants, ploidy = 2, ADt = 10) geno_direct(variants, ploidy, ADt)genotype(variants, ploidy = 2, ADt = 10) geno_direct(variants, ploidy, ADt)
variants |
Tidy variants. |
ploidy |
'1', force haploid genotypes; '2', force diploid genotypes; 'poly', all variants are assumed to be real alleles. |
ADt |
Threshold of AD to discriminate hemi/homo-zygotes. |
For 'ploidy':
'1', force haploid genotypes.
'2', force diploid genotypes. If number of alleles is == 1, ADt criterium is applied to differentiate between "homozygotes" and "hemizygotes".
'poly', all variants are assumed to be real alleles. No ADt filter is applied.
Tidy genotypes. Dataframe with tidy genotypes. 'sample, 'locus', 'allele', 'allele_no', 'reads', 'nt', 'md5', 'sequence'. reads, supporting allele copies are corrected: 'reads' from tidy variants are divided between resulting alleles in homozygotes and hemizygotes.
data("variants") genotype(variants[1:100,], ADt = 10, ploidy = 2 )data("variants") genotype(variants[1:100,], ADt = 10, ploidy = 2 )
Conversion of genotype data between formats.
gen_tidy2compact, from tidy to compact.
gen_compact2wide, from compact to wide.
gen_tidy2wide, from tidy into wide.
gen_wide2structure, from wide to STRUCTURE-formatted text file.
gen_tidy2genalex, from tidy to GENALEX-formatted xlsx or text file.
gen_tidy2integers, recode 'allele' in tidy genotypes as integers.
gen_tidy2compact(gen, delim = "/") gen_compact2wide(gen) gen_tidy2wide(gen) gen_wide2structure(gen, write_out, popdata = FALSE, delim = "/") gen_tidy2integers(gen) gen_tidy2genalex(gen, popdata, write_out, data_name = "dataset1")gen_tidy2compact(gen, delim = "/") gen_compact2wide(gen) gen_tidy2wide(gen) gen_wide2structure(gen, write_out, popdata = FALSE, delim = "/") gen_tidy2integers(gen) gen_tidy2genalex(gen, popdata, write_out, data_name = "dataset1")
gen |
Genotypes: 'tidy' in 'gen_tidy2compact()', 'gen_tidy2wide()', 'gen_tidy2genalex()', and gen_tidy2integers()'; 'compact' in 'gen_compact2wide()'; 'wide' in 'gen_wide2structure()'. |
delim |
Allele delimiter in genotype calls. Default "/". E.x. "AA/BB". |
write_out |
File to write structure formatted genotypes (gen_wide2structure), or GENALEX-formatted genotypes (in 'gen_tidy2genalex'). In 'gen_tidy2genalex', genotypes are written to tab-delimited "txt" or "xlsx", depending on the extension ("txt", "xlsx"). |
popdata |
Dataframe with 'sample' and 'population' columns. Populations are added to STRUCTURE output. If FALSE (Default), popdata is not added to STRUCTURE output. Mandatory in 'gen_tidy2genalex'. |
data_name |
Name of dataset to print to GENALEX xlsx. |
Genotypes from genotype() are returned as tidy data. Tidy data implies one data point per row. Each row from tidy genotypes represents an allele call for a given sample and locus. Rows with missing data are excluded. Thus, tidy genotypes contain at least the three columns: sample, locus and allele; bit often contain more columns (eg, reads, allele_no, nt, sequence, md5, population, etc.). The tidy format can be expanded column-wise and row-wise without altering the data structure, and it can be easily manipulated. Some of the more handy functions here-included involve genotype conversion between different formats:
compact, each row correspond to the genotype call for a given locus in a sample. Therefore, it only contains three columns, sample, locus and genotype. Default separator for alleles under genotype is '/'. For diploid data, hemizygotes are recoded as "allele1" instead of "allele1/NA". All other column metadata is lost.
wide, genotypes are recoded in samples (rows) x loci (columns) dataframe. The first column samples, contain sample names. All other columns contain loci names. Each cell contains genotypes in the A/B format. Diploid genotypes are coded as A/B. Cells with missing data have NA_character.
For the STRUCTURE format, '-9' are introduced as missing data in STRUCTURE output. In hemizygous calls A, the missing allele is encoded as missing data -9. Ploidy is retrieved from attributes. Only, diploid genotypes are allowed.
gen_tidy2compact, compact genotypes with at least 'sample' ,'locus' and 'genotype' columns.
gen_compact2wide, wide genotypes.
gen_tidy2wide, wide genotypes.
gen_wide2structure, plain text file formatted for STRUCTURE.
gen_tidy2genalex, 'xls' file with genotype data for GENALEX.
gen_tidy2integers, tidy genotypes with alleles recoded as integers (1..n).
data("genotypes") gencom <- gen_tidy2compact(genotypes) gen_compact2wide(gencom) gen_tidy2wide(genotypes) gen_tidy2integers(genotypes) # read metadata for sample populations gen_str <- tempfile(fileext = ".str") meta <- read.csv(system.file("extdata/metadata.csv", package = "tidyGenR")) gen_wide2structure(gen_tidy2wide(genotypes), write_out = gen_str, popdata = meta ) genalex_txt <- tempfile(fileext = ".txt") gen_tidy2genalex(genotypes, popdata = meta, write_out = genalex_txt, )data("genotypes") gencom <- gen_tidy2compact(genotypes) gen_compact2wide(gencom) gen_tidy2wide(genotypes) gen_tidy2integers(genotypes) # read metadata for sample populations gen_str <- tempfile(fileext = ".str") meta <- read.csv(system.file("extdata/metadata.csv", package = "tidyGenR")) gen_wide2structure(gen_tidy2wide(genotypes), write_out = gen_str, popdata = meta ) genalex_txt <- tempfile(fileext = ".txt") gen_tidy2genalex(genotypes, popdata = meta, write_out = genalex_txt, )
Filtered genotypes of 20 loci and 91 samples of Rattus baluensis used in Camacho-Sanchez et al. (2026).
tidy dataframe or tibble.
Sample name.
Locus name.
Allele name.
Allele ordinal number.
Number of reads supporting allele.
Length in number of nucleotides of allele sequence.
MD5 hash of allele sequence.
DNA sequence of allele.
Plot MDS from distances between compared calls
mds_comp(df)mds_comp(df)
df |
Dataframe with pairwise distances from 'dist_m()'. One row per unique comparison. |
It takes a multiple sequence alignment and data to create a NEXUS file which can be read by PopART (https://popart.maths.otago.ac.nz).
out_popart( msa, x, blocks = c("DATA", "TRAITS"), sname, xgroups, outnex = tempfile(fileext = ".nex") )out_popart( msa, x, blocks = c("DATA", "TRAITS"), sname, xgroups, outnex = tempfile(fileext = ".nex") )
msa |
Multiple sequence alignment as |
x |
Dataframe with trait data. |
blocks |
NEXUS blocks to add to file. Either |
sname |
Column name in 'x' with sequence names which must be
identical to sequence names in |
xgroups |
Column name in |
outnex |
Path to write NEXUS file. |
No return value. Called for its side effect of writing a NEXUS file to
outnex. Invisibly returns NULL.
data("genotypes") x <- dplyr::filter(genotypes, locus == "abcg8") nr <- seq_len(nrow(x)) y <- dplyr::mutate(x, sname = paste0("seq_", nr)) msa <- tidy2sequences(y, fasta_header = "{sname}") |> DECIPHER::AlignSeqs() out_popart(msa, y, sname = "sname", xgroups = "sample", blocks = c("DATA", "TRAITS") )data("genotypes") x <- dplyr::filter(genotypes, locus == "abcg8") nr <- seq_len(nrow(x)) y <- dplyr::mutate(x, sname = paste0("seq_", nr)) msa <- tidy2sequences(y, fasta_header = "{sname}") |> DECIPHER::AlignSeqs() out_popart(msa, y, sname = "sname", xgroups = "sample", blocks = c("DATA", "TRAITS") )
A dataframe containing 27 primer pairs used to amplify intron loci in Rattus baluensis (Igea et al. 2010; Camacho-Sanchez et al. 2018).
A data frame with 27 rows and 3 variables.
Locus name.
5' to 3' forward primer sequence.
5' to 3' reverse primer sequence.
Igea et al. (2010); Camacho-Sanchez et al. (2018).
Camacho-Sanchez et al. (2018). Interglacial refugia on tropical mountains: novel insights from the summit rat (Rattus baluensis), a Borneo mountain endemic. Diversity and Distributions.
Igea et al. (2010) Novel intron markers to study the phylogeny of closely related mammalian species. BMC Evolutionary Biology.
Creates dataframe with reads distribution across samples (rows) and loci (columns).
reads_loci_samples( path, pattern_fq = "1.fastq.gz", sample_locus = "(^[a-zA-Z0-9]*).([a-zA-Z0-9]*)", all.variants = FALSE, var_id = "md5" )reads_loci_samples( path, pattern_fq = "1.fastq.gz", sample_locus = "(^[a-zA-Z0-9]*).([a-zA-Z0-9]*)", all.variants = FALSE, var_id = "md5" )
path |
Folder with FASTQ files or tidy variants. |
pattern_fq |
Pattern to select FASTQ files from the folder |
sample_locus |
Sample and locus patterns to extract from file names. Regex group 1 is assumed to be sample name and group 2 locus name. By default, both groups are a alphanumeric string of any length separated by any non-alphanumeric character. |
all.variants |
(T/F) TRUE, one column per variant; F, one column per locus. Only for tidy variants. |
var_id |
'c('allele', 'md5', 'sequence')' column to use for naming variant. |
Dataframe with columns being loci and rows being samples. Samples are in first column.
reads_loci_samples( path = system.file("extdata", "truncated", package = "tidyGenR"), pattern_fq = "F_filt.fastq.gz") # with variants dataframe data("variants") reads_loci_samples(path = variants)reads_loci_samples( path = system.file("extdata", "truncated", package = "tidyGenR"), pattern_fq = "F_filt.fastq.gz") # with variants dataframe data("variants") reads_loci_samples(path = variants)
It generates a dataframe with tracked reads per samples per locus. through the pipeline, from dataframes or FASTQ files.
reads_track(l, sample_pattern = "^[A-Za-z0-9]*")reads_track(l, sample_pattern = "^[A-Za-z0-9]*")
l |
List of elements to retrieved reads from. |
sample_pattern |
Pattern used to extract sample names from FASTQ files. |
'l' is a named list. It can be either a dataframe with two mandatory fields ('sample' and 'reads'), or a character vector with 2 elements: 1, the path to a directory containing FASTQ files; and ,2, a pattern matching desired FASTQ files. By, default, sample names matching 'sample_pattern' '^[A-Za-z0-9]*' are extracted from FASTQ files. 'left_join()' is used iteratively from the first to the last element in 'l', so the elements in 'l' need to be nested from element 1 to the last.
Dataframe with tracked reads through the pipeline. The first column is named 'sample' and the following
# from folder with FASTQ files path2truncated <- system.file("extdata", "truncated", package = "tidyGenR") l <- list( truncated = c(path2truncated, "F_filt.fastq.gz")) reads_track(l)# from folder with FASTQ files path2truncated <- system.file("extdata", "truncated", package = "tidyGenR") l <- list( truncated = c(path2truncated, "F_filt.fastq.gz")) reads_track(l)
Removes all rows with hemizygous alleles from tidy genotypes.
remove_hemizygotes(gen)remove_hemizygotes(gen)
gen |
Tidy diploid genotypes. |
If "A/B" -> "A/B"; if "A" -> NULL. It only is applicable to diploid data.
Tidy genotypes without hemizygous alleles.
data("genotypes") remove_hemizygotes(genotypes)data("genotypes") remove_hemizygotes(genotypes)
Removes monomorphic (or polymorphic) loci from tidy genotypes.
remove_monomorphic(gen, invert = FALSE)remove_monomorphic(gen, invert = FALSE)
gen |
Tidy genotypes (output from 'genotype'). |
invert |
T/F, if T, polymorphic loci are removed. |
Tidy genotypes with mono/poly-morphic loci.
data(genotypes) remove_monomorphic(genotypes)data(genotypes) remove_monomorphic(genotypes)
Removes FASTA/FASTQ files with a number of reads less than or equal to
min_reads from a specified directory.
remove_poor_fastq(path2fastq, pattern = "fastq.gz", min_reads = 0)remove_poor_fastq(path2fastq, pattern = "fastq.gz", min_reads = 0)
path2fastq |
Character. Path to the directory containing FASTQ files. |
pattern |
Character. Pattern used to identify FASTQ files in
|
min_reads |
Integer. FASTQ files with a number of reads less than or equal to this threshold are removed. |
No return value. Called for its side effects: FASTA/FASTQ files in
path2fastq with a number of reads less than or equal to
min_reads are removed from disk. Invisibly returns NULL.
tmp <- tempdir() fq <- file.path(tmp, "no_reads.fastq") file.create(fq) remove_poor_fastq(tmp, pattern = "fastq", min_reads = 0)tmp <- tempdir() fq <- file.path(tmp, "no_reads.fastq") file.create(fq) remove_poor_fastq(tmp, pattern = "fastq", min_reads = 0)
Rename alleles in a dataframe based on reference alleles
rename_alleles(df_alleles, ref_alleles, replacements = FALSE)rename_alleles(df_alleles, ref_alleles, replacements = FALSE)
df_alleles |
Dataframe with at least the following columns: 'allele' 'sequence'. |
ref_alleles |
Dataframe with at least the same columns as 'df_alleles', or path multifasta FASTA with allele names in fasta headers. |
replacements |
T/F, if TRUE, a dataframe with the replacements correspondence is returned. Optionally, a path to a multifasta with fasta headers being allele names. |
Dataframe as 'df_alleles' with renamed alleles as in 'ref_alleles'.
data("genotypes") ref_al <- system.file("extdata/reference_alleles.fasta", package = "tidyGenR") rename_alleles(df_alleles = genotypes, ref_alleles = ref_al)data("genotypes") ref_al <- system.file("extdata/reference_alleles.fasta", package = "tidyGenR") rename_alleles(df_alleles = genotypes, ref_alleles = ref_al)
Detect N strings and remove them from aligned BStringSet MSA.
rm_n_msa( msa, pattern_n = paste0(rep("N", 10), collapse = ""), outfasta = FALSE )rm_n_msa( msa, pattern_n = paste0(rep("N", 10), collapse = ""), outfasta = FALSE )
msa |
Multiple sequence alignment stored as a "DNAMultipleAlignment" or "DNAStringSet". |
pattern_n |
N string to match. |
outfasta |
Character vector with path to write FASTA file with MSA. |
When using 'mergePairs(justConcatenate = T)', a string of 10 Ns is introduced between paired-end concatenated reads. This function detects the coordinates of the N string in the MSA and removes it. It uses 'matchPattern()' to detect patterns. WARNING: it has not been tested for multiple hits of the pattern. It could potentially work with other patterns in addition to N strings, but it has not been tested.
Filtered BStringSet MSA.
Write FASTA and/or return BioString object with selected sequences.
tidy2sequences( td, fasta_header = "{locus}_{allele}", filename = FALSE, seq = "sequence" )tidy2sequences( td, fasta_header = "{locus}_{allele}", filename = FALSE, seq = "sequence" )
td |
Tidy variants or genotypes. Dataframe with at least one column with DNA sequences. |
fasta_header |
Naming of FASTA headers. The string is passed to 'glue' for forming the FASTA headers and selecting distinct rows. |
filename |
Path to write FASTA file, or FALSE, to prevent writing DNA sequences to a FASTA file. |
seq |
Name of the column in 'td' with DNA sequences. |
'fasta_header' is a flexible selector and constructor for FASTA headers. The variables included are used to filter distinct rows in 'td'. Examples of outputs:
"{locus}_{allele}", all different allele sequences per locus.
"{sample}{locus}{variant}", all sequences for all the variants for all samples, thus repeated sequences from the same variant corresponding to different samples.
"{md5}", all different DNA sequences.
"{sample}", one sequence per sample. Since one sample matches to many sequences, the first occurrence in the dataframe is selected.
'DNAStringSet' object with selected sequences.
data("genotypes") tidy2sequences( td = genotypes, fasta_header = "{locus}_{allele}", filename = FALSE )data("genotypes") tidy2sequences( td = genotypes, fasta_header = "{locus}_{allele}", filename = FALSE )
Truncation lengths for forward and reverse reads for the dataset containing 30 primer pairs used to amplify intron loci in Rattus baluensis (Camacho-Sanchez et al. 2026).
Data frame with three columns:
Locus name.
Truncation length for forward reads.
Truncation length for reverse reads.
Truncates reads using 'filterAndTrim()' from DADA2. It automatically detects single-end or paired-end sequencing files.
trunc_amp( loci = NULL, in_dir, trunc_fr, fw_pattern, rv_pattern = NULL, outdir = tempdir(), filt_name = "_F_filt.fastq.gz", mode_trun = "pe", multithread = FALSE, max_ee = 3, trunc_q = 2 )trunc_amp( loci = NULL, in_dir, trunc_fr, fw_pattern, rv_pattern = NULL, outdir = tempdir(), filt_name = "_F_filt.fastq.gz", mode_trun = "pe", multithread = FALSE, max_ee = 3, trunc_q = 2 )
loci |
Character vector with loci. If NULL, all loci detected by 'check_names_demultiplexed()' are processed. |
in_dir |
Path to folder with demultiplexed reads "sample.locus. [1|2].fastq.gz". |
trunc_fr |
Flexible argument that sets truncation length for F (and R) reads.
|
fw_pattern |
Pattern matching specific extension of forward reads. |
rv_pattern |
Pattern matching specific extension of reverse reads. |
outdir |
Path where filtered reads are written. Created if it does not exist. |
filt_name |
Pattern to append to FASTA names: '{sample}_{locus}{filt_name}'. |
mode_trun |
'se': single-end; 'pe', paired-end. |
multithread |
T/F, see 'filterAndTrim()'. |
max_ee |
Maximum expected errors. See 'filterAndTrim()'. |
trunc_q |
Trim low-quality bases from 3' end. See 'filterAndTrim()'. |
List with dataframes of number of reads before and after truncation. It writes truncated sequence to "outdir"
dem <- system.file("extdata", "demultiplexed", package = "tidyGenR") # single end trunc_amp( mode_trun = "se", in_dir = dem, fw_pattern = "1.fastq.gz", trunc_fr = 200 ) # paired-end data("trunc_fr") trunc_amp( mode_trun = "se", loci = c("chrna9", "nfkbia"), in_dir = dem, fw_pattern = "1.fastq.gz", rv_pattern = "2.fastq.gz", trunc_fr = trunc_fr, max_ee = 3, trunc_q = 2 )dem <- system.file("extdata", "demultiplexed", package = "tidyGenR") # single end trunc_amp( mode_trun = "se", in_dir = dem, fw_pattern = "1.fastq.gz", trunc_fr = 200 ) # paired-end data("trunc_fr") trunc_amp( mode_trun = "se", loci = c("chrna9", "nfkbia"), in_dir = dem, fw_pattern = "1.fastq.gz", rv_pattern = "2.fastq.gz", trunc_fr = trunc_fr, max_ee = 3, trunc_q = 2 )
Convert sequence-based variants to length-based
var_seq2len(variants)var_seq2len(variants)
variants |
Dataframe with tidy variants. |
Recodes dataframe with sequence-based variants, such as the output from 'variant_call' to length-based variants. The sequence length 'nt', instead of the nucleotide sequence, is used for determining variants. Each variant is named after its number of nucleotides. Number of reads from previous variants with equal lengths are aggreated. Legacy for microsatellite data.
Dataframe with length-based variants and the variables: 'locus' 'sample' 'variant' 'reads'.
data("variants") var_seq2len(variants)data("variants") var_seq2len(variants)
'variant_call' Call variants for multiple loci.
variant_call( loci = NULL, in_dir, fw_pattern = "_F_filt.fastq.gz", rv_pattern = NULL, sample_locus = "(^[a-zA-Z0-9]*).([a-zA-Z0-9]*)", c_unmerged = FALSE, pool = FALSE, error_function = loessErrfun, multithread = FALSE, chim_rm = "consensus", ad = 1, maf = 0, omega_a_f = getDadaOpt()$OMEGA_A, omega_a_r = getDadaOpt()$OMEGA_A, band_size = getDadaOpt()$BAND_SIZE )variant_call( loci = NULL, in_dir, fw_pattern = "_F_filt.fastq.gz", rv_pattern = NULL, sample_locus = "(^[a-zA-Z0-9]*).([a-zA-Z0-9]*)", c_unmerged = FALSE, pool = FALSE, error_function = loessErrfun, multithread = FALSE, chim_rm = "consensus", ad = 1, maf = 0, omega_a_f = getDadaOpt()$OMEGA_A, omega_a_r = getDadaOpt()$OMEGA_A, band_size = getDadaOpt()$BAND_SIZE )
loci |
Character vector with loci to detect from 'sample_locus' in sample names. If NULL, all loci detected according to the pattern in the target directory are used. |
in_dir |
Path to folder with truncated files. |
fw_pattern |
Pattern matching files with F reads. It has to match everything after locus name. eg see parenthesis in "samplex_locus(_2_filt.fastq.gz)" |
rv_pattern |
Pattern matching files with R reads. If left NULL, single-end sequencing will be assumed. |
sample_locus |
Patterns to extract from FASTQ file names.
Group 1 captures
sample name and group 2 captures locus name.
(DEFAULT: |
c_unmerged |
F/R sequences that were not merged in mergePairs are concatenated using a stretch of 10 N's. |
pool |
Passed to 'dada()'. Denoising is done in pooled samples (T) or by sample (F). |
error_function |
Use default 'loessErrfun' for regular Illumina quality codification and 'loess_err_mod4' for binned NovaSeq qualities. |
multithread |
T/F, passed to 'multithread' in 'dada' and 'learnErrors()'. |
chim_rm |
If FALSE, no chimera removal is performed. If == "character", it is passed to 'method' in 'removeBimeraDenovo()' (DEFAULT = 'consensus'). |
ad |
Allele Depth. Passed to filter_variants. |
maf |
Minimum Allele Frequency. Passed to filter_variants. |
omega_a_f |
"OMEGA_A" passed to 'dada' in forward reads (Default: getDadaOpt()$OMEGA_A). |
omega_a_r |
"OMEGA_A" passed to 'dada' in reverse reads (Default: getDadaOpt()$OMEGA_A). |
band_size |
"BAND_SIZE" passed to 'dada' (DEFAULT: getDadaOpt()$BAND_SIZE). |
Allows single-end and paired-end data. Be careful with the use of 'c_unmerged'. It will trigger the 'justConcatenate' argument in 'mergePairs', and 10 N's will be used to concatenate non-overlapping F and R reads. Use 'c_unmerged' carefully, as it will generate artificial variant sequences. Default is deactivated. Variants are reformatted and to a "tibble" and filtered according to 'maf' and 'ad', which are added as attributes to output. Variables of output are 'sample', 'locus', 'sequence' (DNA sequence), 'variant' (name of variant), 'reads' (number of reads supporting that variant), 'nt' (sequence length), 'md5' (md5 checksum).
tidy tibble with locus, sample, sequence, variant (name), nt(sequence length), md5.
# truncated fastq truncated <- system.file("extdata", "truncated", package = "tidyGenR") # variant calling variant_call(in_dir = truncated)# truncated fastq truncated <- system.file("extdata", "truncated", package = "tidyGenR") # variant calling variant_call(in_dir = truncated)
'variant_call_dada()' Call variants for a single locus using DADA2. It is nested in 'variant_call'.
'loess_err_mod4()' Replaces default 'loessErrfun' when estimating error matrices for NovaSeq binned qualities.
'chimera_removal()' Wrapping function to remove chimeras using 'removeBimeraDenovo()'.
variant_call_dada( locus, in_dir, fw_pattern = "_F_filt.fastq.gz", rv_pattern = NULL, sample_locus = "(^[a-zA-Z0-9]*)_([a-zA-Z0-9]*)", c_unmerged = FALSE, pool = FALSE, error_function = loessErrfun, multithread = FALSE, chim_rm = "consensus", omega_a_f = getDadaOpt()$OMEGA_A, omega_a_r = getDadaOpt()$OMEGA_A, band_size = getDadaOpt()$BAND_SIZE ) loess_err_mod4(trans) chimera_removal(seqtab, chim_rm) dada2list(dada_object, names)variant_call_dada( locus, in_dir, fw_pattern = "_F_filt.fastq.gz", rv_pattern = NULL, sample_locus = "(^[a-zA-Z0-9]*)_([a-zA-Z0-9]*)", c_unmerged = FALSE, pool = FALSE, error_function = loessErrfun, multithread = FALSE, chim_rm = "consensus", omega_a_f = getDadaOpt()$OMEGA_A, omega_a_r = getDadaOpt()$OMEGA_A, band_size = getDadaOpt()$BAND_SIZE ) loess_err_mod4(trans) chimera_removal(seqtab, chim_rm) dada2list(dada_object, names)
locus |
Locus name. |
in_dir |
Path to folder with truncated files. |
fw_pattern |
Pattern matching files with F reads. |
rv_pattern |
Pattern matching files with R reads. If left NULL, single-end sequencing will be assumed. |
sample_locus |
Patterns to extract from FASTQ file names.
Group 1 captures
sample name and group 2 captures locus name.
(DEFAULT: |
c_unmerged |
F/R sequences that were not merged in mergePairs are concatenated using a stretch of 10 N's. |
pool |
Passed to 'dada()'. Denoising is done in pooled samples (T) or by sample (F). |
error_function |
Use default 'loessErrfun()' for regular Illumina quality codification and 'loess_err_mod4()' for binned NovaSeq qualities. Passed to 'dada(errorEstimationFunction)'. |
multithread |
T/F, passed to 'dada(multithread)' and 'learnErrors(multithread)'. |
chim_rm |
If FALSE, no chimera removal is performed. If chim_rm is "character", it is passed to 'removeBimeraDenovo(method)'. |
omega_a_f |
"OMEGA_A" passed to 'dada' in forward reads (Default: 'getDadaOpt()$OMEGA_A'). |
omega_a_r |
"OMEGA_A" passed to 'dada' in reverse reads (Default: 'getDadaOpt()$OMEGA_A'). |
band_size |
"BAND_SIZE" passed to 'dada' (DEFAULT: 'getDadaOpt()$BAND_SIZE'). |
trans |
See loessErrfun for details. |
seqtab |
Features table from dada2. |
dada_object |
Object from |
names |
Character vector to name the elements in the list. By default, DADA2 names are basenames from FAST(Q). |
'variant_call_dada()' Allows single-end and paired-end data. 'c_unmerged' will trigger the 'justConcatenate' argument in 'mergePairs', and 10 N's will be used to concatenate non-overlapping F and R reads. Use 'concateNotMerged' carefully, as it will generate artificial variant sequences. Default is deactivated.
'loess_err_mod4()' This is a beta function which has been shown to work in my experience and other users' experience. See discussion in https://github.com/benjjneb/dada2/issues/1307
'dada2list()' DADA2 functions return a dataframe whenever the length of the object is == 1 or a list if length > 1. To smooth prevent from bugs, this function standardizes the output of dada2 to lists when length == 1. Unaltered when input is a list.
'variant_call_dada()' Numeric array. Column names are variant sequences; row names are sample names and each number in the array indicate the number of reads supporting variant 'i' in sample 'j'.
'chimera_removal()' 'seqtab' like object (see dada2).
'dada2list()' List with dada2 object/s.
List of 5 tidy variants dataframes with variant calls using different strategies.
A list of 5 dataframes.
pool = F; band_size = 16
pool = F; band_size = 0
pool = T; band_size = 16
pool = T; band_size = 0
AmpliSAT
Filtered tidy variants of Rattus baluensis from Trusmadi (Camacho-Sanchez et al. 2026), corresponding to 27 loci from 44 samples.
tidy dataframe or tibble.
Sample name.
Locus name.
Variant name.
Number of reads supporting allele.
Length in number of nucleotides of allele sequence.
MD5 checksum of allele sequence.
DNA sequence of allele.