| Title: | Microbiome Integrated Reconstruction and Estimation |
|---|---|
| Description: | An integrated framework for microbiome reconstruction from sequencing data. It leverages tools like Kraken2 for taxonomic classification and combines cell barcodes, UMIs, and k-mer-based quantification to reconstruct microbial signals. Designed for both bulk and single-cell sequencing data, the package enables taxonomic and quantitative profiling of microbial communities. |
| Authors: | Yun Peng [aut, cre] (ORCID: <https://orcid.org/0000-0003-2801-3332>) |
| Maintainer: | Yun Peng <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.1.9000 |
| Built: | 2026-05-13 08:47:49 UTC |
| Source: | https://github.com/WangLabCSU/mire |
True taxa are detected on multiple barcodes and with a proprotional number of
total and unique k-mer sequences across barcodes, measured as a significant
Spearman correlation between the number of total and unique k-mers across
barcodes. (padj < 0.05)
blsd( kmer, method = "spearman", ..., min_kmer_len = 3L, min_cells = 3L, p.adjust = "BH" )blsd( kmer, method = "spearman", ..., min_kmer_len = 3L, min_cells = 3L, p.adjust = "BH" )
kmer |
kmer file returned by |
method |
A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. |
... |
Other arguments passed to cor.test. |
min_kmer_len |
An integer, the minimal number of kmer to filter taxa.
SAHMI use |
min_cells |
An integer, the minimal number of cell per taxid. SAHMI use
|
p.adjust |
Pvalue correction method, a character string. Can be abbreviated. Details see p.adjust. |
A polars DataFrame
https://github.com/sjdlabgroup/SAHMI
This function integrates three signal denoising strategies—cell-line-based contaminant filtering, sample-level correlation filtering, and barcode-level correlation testing— to identify reliable microbial taxa and quantify their abundance across samples. It returns a table of taxa counts (per barcode), after filtering out likely contaminants and false positives.
denoise_counts( kreports, kmers, umis, samples = NULL, taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"), drop_unmatched_taxa = TRUE, quantile = 0.95, alpha = 0.05, min_reads = 3L, min_minimizer_n_unique = 3L, min_samples = 3L, min_kmer_len = 3L, min_cells = 3L, cor_threshold = 0, p_threshold = 0.05, padj_threshold = 0.05, ..., study = "current study", alternative = "greater", p.adjust = "BH", method = "spearman" )denoise_counts( kreports, kmers, umis, samples = NULL, taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"), drop_unmatched_taxa = TRUE, quantile = 0.95, alpha = 0.05, min_reads = 3L, min_minimizer_n_unique = 3L, min_samples = 3L, min_kmer_len = 3L, min_cells = 3L, cor_threshold = 0, p_threshold = 0.05, padj_threshold = 0.05, ..., study = "current study", alternative = "greater", p.adjust = "BH", method = "spearman" )
kreports |
A character vector of paths to Kraken2 report files,
typically output by |
kmers |
A character vector of paths to k-mer quantification files,
typically output by |
umis |
A character vector of paths to UMI quantification files,
typically output by |
samples |
A character of sample identifier for each element in |
drop_unmatched_taxa |
A boolean value, indicates whether taxa not found
in celllines data should be dropped. Default: |
quantile |
Probabilities with values in |
alpha |
Level of significance. |
min_reads |
An integer, the minimal number of the total reads to filter
taxa. SAHMI use |
min_minimizer_n_unique |
An integer, the minimal number of the unique
number of minimizer to filter taxa. SAHMI use |
min_samples |
An integer, the minimal number of samples per taxid. SAHMI
use |
min_kmer_len |
An integer, the minimal number of kmer to filter taxa.
SAHMI use |
min_cells |
An integer, the minimal number of cell per taxid. SAHMI use
|
cor_threshold |
Minimum correlation coefficient required in sample-level
filtering. Default is |
p_threshold |
Significance threshold for sample-level correlation
p-values. Default is |
padj_threshold |
Adjusted p-value threshold for barcode-level denoising.
Default is |
... |
Other arguments passed to cor.test. |
study |
A string of the study name, used to differentiate with cell line data. |
alternative |
A string specifying the alternative hypothesis, must be one of "two.sided", "greater" (default) or "less". You can specify just the initial letter. |
p.adjust |
Pvalue correction method, a character string. Can be abbreviated. Details see p.adjust. |
method |
A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. |
This function processes Kraken2 output and associated FASTQ files, extracting relevant taxonomic information and preparing them for further downstream analysis.
koutreads( kreport, koutput, reads, ofile, tag_ranges1 = NULL, tag_ranges2 = NULL, taxonomy = c("D__Bacteria", "D__Fungi", "D__Viruses"), exclude = c("9606"), koutput_batch = NULL, fastq_batch = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL ) tag(tag, ranges)koutreads( kreport, koutput, reads, ofile, tag_ranges1 = NULL, tag_ranges2 = NULL, taxonomy = c("D__Bacteria", "D__Fungi", "D__Viruses"), exclude = c("9606"), koutput_batch = NULL, fastq_batch = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL ) tag(tag, ranges)
kreport |
Path to the Kraken2 report file. |
koutput |
Path to the Kraken2 output file. |
reads |
A character vector of FASTQ file paths, either the original reads used as input to Kraken2 or the classified output reads (recommended for efficiency as they are smaller). Accepts one file for single-end or two files for paired-end. If only one file is used in Kraken2 to generate the koutput file, only the second read sequence will be extracted to match the koutput's Lowest Common Ancestor (LCA). This is commonly observed in data such as 10x Genomics, where the first read contains only UMI and barcode information, while the second read carries the biological sequence. You can always change the input files to treat the actual biological read as read1 if needed. |
ofile |
A character string. Path to the output file that will store the
matched reads extracted based on Kraken2 classification. The output is
compressed if the extension is |
tag_ranges1, tag_ranges2
|
A list of sequence ranges for extracting tags
from the first/second read in single-end/paired-end data. If |
taxonomy |
A character vector. The set of taxonomic groups to include
(default: |
exclude |
A character vector of taxids to exclude sequences from usage.
Typically used to exclude the host taxid (e.g., |
koutput_batch, fastq_batch
|
Integer. Number of FASTQ records/Koutput
lines to accumulate before dispatching a chunk to worker threads for
processing. This controls the granularity of parallel work and affects
memory usage and performance.
Default is |
chunk_bytes |
Integer specifying the size in bytes used for compressing
and writing records in batches to disk. Default is |
compression_level |
Integer from 1 to 12 (default: |
nqueue |
Integer. Maximum number of buffers per thread, controlling the
amount of in-flight data awaiting writing. Default: |
threads |
Integer. Number of threads to use. Default: |
odir |
A string of directory to save the output files. Please see
|
tag |
An character label used to label the extracted content. |
ranges |
A range or a list of ranges specifying the subsequence(s) to
process. Must be created using the |
This function filters Kraken2 classification output (koutput) by taxonomic
identifiers derived from a Kraken2 report (kreport). It extracts lines
matching the desired taxonomy, ranks, taxa, taxids, and descendants
and writes the filtered results to an output file.
kractor_koutput( kreport, koutput, ofile, taxonomy = c("D__Bacteria", "D__Fungi", "D__Viruses"), ranks = NULL, taxa = NULL, taxids = NULL, exclude = NULL, descendants = TRUE, batch_size = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL )kractor_koutput( kreport, koutput, ofile, taxonomy = c("D__Bacteria", "D__Fungi", "D__Viruses"), ranks = NULL, taxa = NULL, taxids = NULL, exclude = NULL, descendants = TRUE, batch_size = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL )
kreport |
Path to the Kraken2 report file. |
koutput |
Path to the Kraken2 output file. |
ofile |
A character string. Path to the output file storing the filtered
Kraken2 output lines that pass taxonomic and exclusion filters. If the
filename ends with |
taxonomy |
Character vector. The set of taxonomic groups to include
(default: |
ranks |
Character vector. The taxonomic ranks to filter by (optional). |
taxa |
Character vector. Specific taxa to include (optional). |
taxids |
Character vector. A list of taxid values to filter by (optional). |
exclude |
A character vector of taxids to exclude sequences from usage. |
descendants |
Logical. Whether to include descendants of the selected
taxa (default: |
chunk_bytes |
Integer specifying the size in bytes used for compressing
and writing records in batches to disk. Default is |
compression_level |
Integer from 1 to 12 (default: |
nqueue |
Integer. Maximum number of buffers per thread, controlling the
amount of in-flight data awaiting writing. Default: |
threads |
Integer. Number of threads to use. Default: |
odir |
A string of directory to save the output files. Please see
|
None. The function generates a filtered Kraken2 output file
containing entries corresponding to the specified taxonomy, ranks,
taxa, taxids, and descendants extracted from the input koutput.
This function extracts reads corresponding to selected classifications from a
Kraken2 output file (koutput). Only reads classified to selected taxa will
be extracted from the provided sequence file (reads).
kractor_reads( koutput, reads, ofile1 = NULL, ofile2 = NULL, batch_size = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL )kractor_reads( koutput, reads, ofile1 = NULL, ofile2 = NULL, batch_size = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL )
koutput |
Path to the Kraken2 output file. |
reads |
A character vector of FASTQ file paths. Accepts one file for single-end or two files for paired-end. |
ofile1 |
Output FASTQ file path for the first read ( |
ofile2 |
Optional path to the output FASTQ file for |
batch_size |
Integer. Number of FASTQ records to accumulate before
dispatching a chunk to worker threads for processing. This controls the
granularity of parallel work and affects memory usage and performance.
Default is |
chunk_bytes |
Integer specifying the size in bytes used for compressing
and writing records in batches to disk. Default is |
compression_level |
Integer from 1 to 12 (default: |
nqueue |
Integer. Maximum number of buffers per thread, controlling the
amount of in-flight data awaiting writing. Default: |
threads |
Integer. Number of threads to use. Default: |
odir |
A string of directory to save the output files. Please see
|
This function runs the Kraken2 classifier on one or two FASTQ files (single-
or paired-end). It wraps around the blit::kraken2() command interface and
adds optional support for environment setup (e.g., Conda or custom PATH).
kraken2( reads, ..., db = NULL, kreport = "kraken_report.txt", koutput = "kraken_output.txt", classified_out = "classified.fq", unclassified_out = NULL, kraken2 = NULL, envpath = NULL, conda = NULL, condaroot = NULL, odir = NULL )kraken2( reads, ..., db = NULL, kreport = "kraken_report.txt", koutput = "kraken_output.txt", classified_out = "classified.fq", unclassified_out = NULL, kraken2 = NULL, envpath = NULL, conda = NULL, condaroot = NULL, odir = NULL )
reads |
A character vector of FASTQ files used as input to Kraken2. Can be one file (single-end) or two files (paired-end). |
... |
<dynamic dots> Additional arguments passed to |
db |
Path to the Kraken2 database. You can download prebuilt databases from https://benlangmead.github.io/aws-indexes/k2, or build your own by following the instructions at https://github.com/DerrickWood/kraken2/wiki/Manual#kraken-2-databases. |
classified_out |
A string of path to save classified sequences, which should be a fastq file. |
unclassified_out |
A string of path to save unclassified sequences, which should be a fastq file. |
kraken2 |
Optional. Path to the Kraken2 binary if not in the system
|
envpath |
Optional. Additional directory to prepend to the system |
conda |
Optional. Name of the Conda environment to activate before running Kraken2. |
condaroot |
A string specifying the path to the conda root prefix. If not provided, the function searches for the root in the following order:
|
odir |
A string of path to the output directory. |
None. This function is called for its side effects.
It produces the following output files in odir (or the working directory
if odir is NULL):
kreport: Kraken2 classification report
koutput: Kraken2 raw classification output
classified_out: FASTQ file of classified reads
unclassified_out: FASTQ file of unclassified reads (if specified)
This function counts total and unique k-mers per taxon across cell barcodes, using both the cell barcode and unique molecular identifier (UMI) to resolve read identity at the single-cell level. It aggregates k-mer counts for each taxonomic rank of interest (by default, genus and species), including all descendant taxa within those ranks.
krcount( koutreads, kreport, umi_tag = NULL, barcode_tag = NULL, taxonomy = c("D__Bacteria", "D__Fungi", "D__Viruses"), batch_size = NULL, nqueue = NULL )krcount( koutreads, kreport, umi_tag = NULL, barcode_tag = NULL, taxonomy = c("D__Bacteria", "D__Fungi", "D__Viruses"), batch_size = NULL, nqueue = NULL )
koutreads |
Path to the output file produced by |
kreport |
Path to the Kraken2 report file. |
umi_tag |
(Optional) A string specifying the tag used to extract unique
molecular identifiers (UMIs) from each read. If |
barcode_tag |
(Optional) A string specifying the tag used to extract the
cell barcode from each read. If |
taxonomy |
A character vector. The set of taxonomic groups to include
(default: |
nqueue |
Integer. Maximum number of buffers per thread, controlling the
amount of in-flight data awaiting writing. Default: |
Parse kraken report file
read_kreport(kreport, taxonomy = NULL)read_kreport(kreport, taxonomy = NULL)
kreport |
The path to kraken report file. |
taxonomy |
A character vector. The set of taxonomic groups to include. |
A data frame.
https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
This function detects likely contaminant or false-positive microbial taxa by comparing read abundance (in reads per million microbiome reads, RPMM) from input Kraken2 reports with a curated reference of known contaminants derived from cell lines. It performs a one-sample quantile test on each taxon and reports taxa with significant enrichment above reference levels.
rpmm_quantile( kreports, study = "current study", taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"), quantile = 0.95, alpha = 0.05, alternative = "greater", drop_unmatched_taxa = TRUE )rpmm_quantile( kreports, study = "current study", taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"), quantile = 0.95, alpha = 0.05, alternative = "greater", drop_unmatched_taxa = TRUE )
kreports |
A character of path to all kraken report files. |
study |
A string of the study name, used to differentiate with cell line data. |
quantile |
Probabilities with values in |
alpha |
Level of significance. |
alternative |
A string specifying the alternative hypothesis, must be one of "two.sided", "greater" (default) or "less". You can specify just the initial letter. |
drop_unmatched_taxa |
A boolean value, indicates whether taxa not found
in celllines data should be dropped. Default: |
A polars DataFrame with following attributes:
pvalues: Quantile test pvalue.
unmatched_taxa: taxids in current study but not found in cellline data.
significant: significant taxids with pvalues < alpha.
truly: truly taxids based on alpha and unmatched_taxa. If
drop_unmatched_taxa isn't TRUE, this should be the union of
unmatched_taxa and significant, otherwise, this should be the same
with significant.
## Not run: # `paths` should be the output directory for each sample from # `blit::kraken2()` quantile_test <- remove_contaminants( kreports = file.path(paths, "kraken_report.txt"), quantile = 0.99, drop_unmatched_taxa = TRUE ) ggplot2::autoplot(quantile_test) ## End(Not run)## Not run: # `paths` should be the output directory for each sample from # `blit::kraken2()` quantile_test <- remove_contaminants( kreports = file.path(paths, "kraken_report.txt"), quantile = 0.99, drop_unmatched_taxa = TRUE ) ggplot2::autoplot(quantile_test) ## End(Not run)
Constructs a mire_seq_range object representing a UMI/Barcode range with
optional start and end positions.
seq_range(start = NULL, end = NULL)seq_range(start = NULL, end = NULL)
start |
Integer. Start position (1-based). Optional if |
end |
Integer. End position (1-based). Optional if |
A mire_seq_range or mire_seq_ranges object.
# Create a single range seq_range(start = 100, end = 200) # If `end` is NULL, the range goes to the end of the sequence seq_range(start = 100) # If `start` is NULL, the range starts from the beginning of the sequence seq_range(end = 200) # Combine multiple ranges r1 <- seq_range(100, 200) r2 <- seq_range(300, 400) r3 <- seq_range(end = 500) # Combine them into a multi-range object using `c()` c(r1, r2, r3) # Nested combinations are also supported c(c(r1, r2), r3)# Create a single range seq_range(start = 100, end = 200) # If `end` is NULL, the range goes to the end of the sequence seq_range(start = 100) # If `start` is NULL, the range starts from the beginning of the sequence seq_range(end = 200) # Combine multiple ranges r1 <- seq_range(100, 200) r2 <- seq_range(300, 400) r3 <- seq_range(end = 500) # Combine them into a multi-range object using `c()` c(r1, r2, r3) # Nested combinations are also supported c(c(r1, r2), r3)
This function refines one or two FASTQ files by applying trimming and
embedding operations defined via user-specified actions such as embed(),
trim(), or embed_trim(). It supports single-end and paired-end reads,
processes data in chunks, and uses multithreading for performance.
seq_refine( reads, ofile1 = NULL, ofile2 = NULL, umi_action1 = NULL, umi_action2 = NULL, barcode_action1 = NULL, barcode_action2 = NULL, extra_actions1 = NULL, extra_actions2 = NULL, batch_size = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL )seq_refine( reads, ofile1 = NULL, ofile2 = NULL, umi_action1 = NULL, umi_action2 = NULL, barcode_action1 = NULL, barcode_action2 = NULL, extra_actions1 = NULL, extra_actions2 = NULL, batch_size = NULL, chunk_bytes = NULL, compression_level = 4L, nqueue = NULL, threads = NULL, odir = NULL )
reads |
A character vector of FASTQ file paths. Accepts one file for single-end or two files for paired-end. |
ofile1 |
Output FASTQ file path for the first read ( |
ofile2 |
Optional path to the output FASTQ file for |
umi_action1, umi_action2
|
Sequence action for extracting or trimming UMI
from |
barcode_action1, barcode_action2
|
Sequence action for extracting or
trimming barcodes from |
extra_actions1, extra_actions2
|
Additional sequence actions to apply to
|
batch_size |
Integer. Number of FASTQ records to accumulate before
dispatching a chunk to worker threads for processing. This controls the
granularity of parallel work and affects memory usage and performance.
Default is |
chunk_bytes |
Integer specifying the size in bytes used for compressing
and writing records in batches to disk. Default is |
compression_level |
Integer from 1 to 12 (default: |
nqueue |
Integer. Maximum number of buffers per thread, controlling the
amount of in-flight data awaiting writing. Default: |
threads |
Integer. Number of threads to use. Default: |
odir |
A string of directory to save the output files. Please see
|
Actions define what to do with sequence ranges specified using
seq_range().
UMI/barcode actions must be a single range (no lists) and default to
embed_trim() if not specified.
Extra actions can be multiple ranges (use a list) and default to trim()
if not specified.
Use embed(), trim(), or embed_trim() to specify the behavior.
None. Outputs processed FASTQ files as specified by ofile1 and
ofile2.
In the low-microbiome biomass setting, real microbes also exhibit a
proportional number of total k-mers, number of unique k-mers, as well as
number of total assigned sequencing reads across samples; i.e. the following
three Spearman correlations are significant when tested using sample-level
data provided in Kraken reports: cor(minimizer_len, minimizer_n_unique),
cor(minimizer_len, total_reads) and cor(total_reads, minimizer_n_unique).
(r1>0 & r2>0 & r3>0 & p1<0.05 & p2<0.05 & p3<0.05).
slsd( kreports, method = "spearman", ..., min_reads = 3L, min_minimizer_n_unique = 3L, min_samples = 3L )slsd( kreports, method = "spearman", ..., min_reads = 3L, min_minimizer_n_unique = 3L, min_samples = 3L )
kreports |
kreport files returned by |
method |
A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. |
... |
Other arguments passed to cor.test. |
min_reads |
An integer, the minimal number of the total reads to filter
taxa. SAHMI use |
min_minimizer_n_unique |
An integer, the minimal number of the unique
number of minimizer to filter taxa. SAHMI use |
min_samples |
An integer, the minimal number of samples per taxid. SAHMI
use |
A polars DataFrame of correlation
coefficient and pvalue for cor(minimizer_len, minimizer_n_unique) (r1 and
p1), cor(minimizer_len, total_reads) (r2 and p2) and cor(total_reads, minimizer_n_unique) (r3 and p3).
These functions annotate a range or a list of ranges to indicate how the specified subsequences (e.g. UMI or barcode) should be handled during FASTQ processing.
embed(tag, ranges) trim(ranges) embed_trim(tag, ranges)embed(tag, ranges) trim(ranges) embed_trim(tag, ranges)
tag |
An character label used when embedding sequence content
into the FASTQ header (used with For UMI and barcode actions, the tag will be assigned automatically as
For other types of actions, you must explicitly specify a |
ranges |
A range or a list of ranges specifying the subsequence(s) to
process. Must be created using the |
embed() extracts the sequence and appends it to the read header without
modifying the sequence.
trim() removes the specified subsequence from the read but does not store
it.
embed_trim() both extracts the subsequence for the header and removes it
from the read.
Each function wraps the input range object in a new class to indicate its behavior downstream.
An annotated mire_seq_range or mire_seq_ranges object. object
with behavior-specific class:
mire_embed — for embedding only
mire_trim — for trimming only
mire_embed_trim — for embedding and trimming