| Title: | Tools for Accessing and Working with the Gene Expression Omnibus (GEO) |
|---|---|
| Description: | Provides a tidy and fast R interface to the NCBI Gene Expression Omnibus (GEO) database. Functions are included to query, download, and parse GEO metadata and expression data, making it easier to integrate GEO datasets into downstream analyses and workflows. |
| Authors: | Yun Peng [aut, cre] |
| Maintainer: | Yun Peng <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.1.9000 |
| Built: | 2026-06-09 10:41:10 UTC |
| Source: | https://github.com/WangLabCSU/geokit |
Determine the type of a GEO accession ID (e.g. DataSet, Series, Sample, Platform). This function inspects the accession prefix and returns its corresponding GEO type, optionally in an abbreviated form.
geo_gtype(accession, abbre = FALSE)geo_gtype(accession, abbre = FALSE)
accession |
A character of GEO accession IDs. Examples:
|
abbre |
A logical scalar indicating whether to abbreviate the GEO type
in the return value. If |
A character of GEO accession type.
geo_gtype("GSE10") geo_gtype("gpl98") geo_gtype("GSM1") geo_gtype("GDS10") # use abbreviation geo_gtype("GSE10", TRUE) geo_gtype("gpl98", TRUE) geo_gtype("GSM1", TRUE) geo_gtype("GDS10", TRUE)geo_gtype("GSE10") geo_gtype("gpl98") geo_gtype("GSM1") geo_gtype("GDS10") # use abbreviation geo_gtype("GSE10", TRUE) geo_gtype("gpl98", TRUE) geo_gtype("GSM1", TRUE) geo_gtype("GDS10", TRUE)
The function downloads and parses the relevant Series Matrix files, optionally mapping platform IDs to Bioconductor annotation packages.
geo_matrix( accession, add_gpl = NULL, pdata_from_soft = FALSE, ftp_over_https = NULL, handle_opts = list(), odir = getwd() )geo_matrix( accession, add_gpl = NULL, pdata_from_soft = FALSE, ftp_over_https = NULL, handle_opts = list(), odir = getwd() )
accession |
A character of GEO accession IDs. Examples:
|
add_gpl |
Logical or |
pdata_from_soft |
Logical. Specifies whether to derive |
ftp_over_https |
Logical scalar. If |
handle_opts |
A list of named options / headers to be set in the
|
odir |
Destination directory for downloads. Defaults to the current working directory. |
An ExpressionSet or a list of
ExpressionSets, one per Series Matrix file.
if (require("Biobase")) { eset <- geo_matrix("GSE10", odir = tempdir()) }if (require("Biobase")) { eset <- geo_matrix("GSE10", odir = tempdir()) }
This function is useful for combining with geo_search() to filter
results, as geo_search() cannot retrieve the full metadata for all GEO
identifiers. By default, this function uses the soft format for GDS and GSE
entities, and the full amount of text format data for GPL and GSM
entities.
geo_meta( accession, famount = NULL, scope = NULL, ftp_over_https = NULL, handle_opts = list(), odir = getwd() )geo_meta( accession, famount = NULL, scope = NULL, ftp_over_https = NULL, handle_opts = list(), odir = getwd() )
accession |
A character of GEO accession IDs. Examples:
|
famount |
A character specifying either:
See |
scope |
A character specifying which GEO accessions to include (Only applicable to Accession Display Bar access).
|
ftp_over_https |
Logical scalar. If |
handle_opts |
A list of named options / headers to be set in the
|
odir |
Destination directory for downloads. Defaults to the current working directory. |
A data frame contains metadata of all ids.
geo_meta("GSE10", odir = tempdir())geo_meta("GSE10", odir = tempdir())
Create a QueryChat object for exploring GEO
metadata with an LLM. Use geo_qc() to create the chat object, geo_shiny()
to launch the Shiny app, and geo_chat() to start a console chat.
geo_qc(client, data_source, table_name = NULL, ..., instructions = NULL) geo_shiny(...) geo_chat(...)geo_qc(client, data_source, table_name = NULL, ..., instructions = NULL) geo_shiny(...) geo_chat(...)
client |
Optional chat client. Can be:
|
data_source |
A data.frame or a database connection containing GEO
metadata, typically from |
table_name |
A string specifying the table name to use in SQL queries.
If |
... |
Arguments passed on to
|
instructions |
Optional single string with additional instructions to append to the default GEO metadata assistant instructions. |
geo_qc() intentionally does not serialize all rows or build a large data
prompt. Instead, it delegates schema summarization, SQL querying, and
dashboard filtering to QueryChat.
The three exported helpers differ only in how far they take the
QueryChat workflow:
geo_qc() creates and returns the QueryChat object. Use it when you want
to inspect the generated prompt, customize the object, embed it in another
Shiny app, or launch the app/console later with qc$app() or
qc$console().
geo_shiny() creates the same QueryChat object and immediately launches
its Shiny app. Use it for interactive browser-based filtering and
exploration.
geo_chat() creates the same QueryChat object and immediately starts
its console chat. Use it for command-line exploration without opening a
Shiny app.
The default instructions guide the assistant to query and filter GEO
metadata, identify relevant studies, generate reproducible R code when
asked, preserve explicit accession IDs, and explain GEO accession types
(GSE, GSM, GPL, and GDS) when useful.
The first argument is the LLM client. Use client = NULL or pass NULL as
the first positional argument to let querychat choose a client from its
options or environment variables. Additional context such as
data_description, greeting, tools, categorical_threshold, and
cleanup can be passed through ... to querychat::querychat().
prompt_template is intentionally not forwarded because geo_qc() supplies
GEO-specific instructions through extra_instructions.
A QueryChat object configured with
data_source, an LLM client, and GEO-specific instructions.
geo_meta(), geo_search(), QueryChat,
ellmer::chat_openai()
if (requireNamespace("querychat", quietly = TRUE) && requireNamespace("duckdb", quietly = TRUE)) { records <- data.frame( Accession = c("GSE1", "GSE2"), Title = c("human diabetes study", "mouse liver study"), Type = c("Expression profiling by array", "RNA-seq"), Samples = c(12L, 8L) ) qc <- geo_qc(NULL, records, table_name = "geo_records", cleanup = TRUE) qc$cleanup() }if (requireNamespace("querychat", quietly = TRUE) && requireNamespace("duckdb", quietly = TRUE)) { records <- data.frame( Accession = c("GSE1", "GSE2"), Title = c("human diabetes study", "mouse liver study"), Type = c("Expression profiling by array", "RNA-seq"), Samples = c(12L, 8L) ) qc <- geo_qc(NULL, records, table_name = "geo_records", cleanup = TRUE) qc$cleanup() }
Search the GDS database and return search results as a data frame.
geo_search(query, step = 500L, interval = NULL)geo_search(query, step = 500L, interval = NULL)
query |
A character string with the search term. The NCBI uses a
fielded search syntax. For example, |
step |
Integer. Number of records to fetch per request. Use a smaller value if requests fail. |
interval |
Numeric. Time interval (in seconds) between successive
requests. Defaults to |
The NCBI allows higher request limits (10 per second) when using an API key.
You can set this key for the current R session with
rentrez::set_entrez_key(), or permanently by setting the ENTREZ_KEY
environment variable via Sys.setenv().
Once set, rentrez will automatically use this key for all NCBI requests.
See the rentrez tutorial
for details.
A data frame contains the search results
# Ensure you have an active internet connection before running the search. # The `geo_search` function queries NCBI Entrez, which may have network # restrictions and limited bandwidth usage for large queries. out <- geo_search("diabetes[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]") head(out)# Ensure you have an active internet connection before running the search. # The `geo_search` function queries NCBI Entrez, which may have network # restrictions and limited bandwidth usage for large queries. out <- geo_search("diabetes[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]") head(out)
Construct a GEO landing page and open it directly in the system's default web
browser (or a user-specified browser). By default, this function uses the
brief amount of html format data for all entities.
geo_show( accession, famount = NULL, scope = NULL, ftp_over_https = NULL, browser = getOption("browser") )geo_show( accession, famount = NULL, scope = NULL, ftp_over_https = NULL, browser = getOption("browser") )
accession |
A character of GEO accession IDs. Examples:
|
famount |
A character specifying either:
See |
scope |
A character specifying which GEO accessions to include (Only applicable to Accession Display Bar access).
|
ftp_over_https |
Logical scalar. If |
browser |
a non-empty character string giving the name of the program to be used as the HTML browser. It should be in the PATH, or a full path specified. Alternatively, an R function to be called to invoke the browser. Under Windows |
See browseURL()
if (interactive()) { geo_show("gpl96") }if (interactive()) { geo_show("gpl96") }
By default, this function uses the soft format for GDS and GSE entities,
and the full amount of text format data for GPL and GSM entities.
geo_soft( accession, famount = NULL, scope = NULL, ftp_over_https = NULL, handle_opts = list(), odir = getwd() )geo_soft( accession, famount = NULL, scope = NULL, ftp_over_https = NULL, handle_opts = list(), odir = getwd() )
accession |
A character of GEO accession IDs. Examples:
|
famount |
A character specifying either:
See |
scope |
A character specifying which GEO accessions to include (Only applicable to Accession Display Bar access).
|
ftp_over_https |
Logical scalar. If |
handle_opts |
A list of named options / headers to be set in the
|
odir |
Destination directory for downloads. Defaults to the current working directory. |
The Gene Expression Omnibus (GEO) from NCBI serves as a public repository for a wide range of high-throughput experimental data. These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE), and mass spectrometry proteomic data. At the most basic level of organization of GEO, there are three entity types that may be supplied by users: Platforms, Samples, and Series. Additionally, there is a curated entity called a GEO dataset.
A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A Platform may reference many Samples/Series that have been submitted by multiple submitters.
A Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform and may be included in multiple Series.
A Series record defines a set of related Samples considered to be part of a group, how the Samples are related, and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique and stable GEO accession number (GSExxx).
GEO DataSets (GDSxxx) are curated sets of GEO Sample data. A GDS record represents a collection of biologically and statistically comparable GEO Samples and forms the basis of GEO's suite of data display and analysis tools. Samples within a GDS refer to the same Platform, that is, they share a common set of probe elements. Value measurements for each Sample within a GDS are assumed to be calculated in an equivalent manner, that is, considerations such as background processing and normalization are consistent across the dataset. Information reflecting experimental design is provided through GDS subsets.
A GEOSoft object
gse <- geo_soft("GSE10", odir = tempdir()) gpl <- geo_soft("gpl98", odir = tempdir()) gsm <- geo_soft("GSM1", odir = tempdir()) gds <- geo_soft("GDS10", odir = tempdir())gse <- geo_soft("GSE10", odir = tempdir()) gpl <- geo_soft("gpl98", odir = tempdir()) gsm <- geo_soft("GSM1", odir = tempdir()) gds <- geo_soft("GDS10", odir = tempdir())
NCBI GEO allows supplemental files to be attached to GEO Series (GSE), GEO platforms (GPL), and GEO samples (GSM). This function 'knows' how to get these files based on the GEO accession. No parsing of the downloaded files is attempted, since the file format is not generally knowable.
geo_suppl( accession, pattern = NULL, ftp_over_https = TRUE, handle_opts = list(), odir = getwd() )geo_suppl( accession, pattern = NULL, ftp_over_https = TRUE, handle_opts = list(), odir = getwd() )
accession |
A character of GEO accession IDs. Examples:
|
pattern |
character string containing a regular expression to be matched in the supplementary file names. |
ftp_over_https |
Logical scalar. If |
handle_opts |
A list of named options / headers to be set in the
|
odir |
Destination directory for downloads. Defaults to the current working directory. |
A list (or a character atomic verctor if only one accession is
provided) of the full file paths of the resulting downloaded files.
geo_suppl("GSM1137", odir = tempdir())geo_suppl("GSM1137", odir = tempdir())
Construct and resolve URLs for GEO (Gene Expression Omnibus) resources. This
function provides a unified interface for accessing GEO data either via
Accession Display Bar of GEO database or directly from GEO FTP/HTTPS servers.
Depending on the accession type or requested format and amount, it
automatically generates the correct URL.
geo_url( accession, format = NULL, amount = NULL, scope = NULL, ftp_over_https = NULL )geo_url( accession, format = NULL, amount = NULL, scope = NULL, ftp_over_https = NULL )
accession |
A character of GEO accession IDs. Examples:
|
||||||||||||||||||||||||||||||||||||||||||||||||||
format |
A character specifying file format type requested. GEO data can be accessed through two sites:
|
||||||||||||||||||||||||||||||||||||||||||||||||||
amount |
A character specifying the amount of data (Only applicable to Accession Display Bar access):
|
||||||||||||||||||||||||||||||||||||||||||||||||||
scope |
A character specifying which GEO accessions to include (Only applicable to Accession Display Bar access).
|
||||||||||||||||||||||||||||||||||||||||||||||||||
ftp_over_https |
Logical scalar. If |
A character of GEO URL.
geo_url("GSE10") geo_url("gpl98") geo_url("GSM1") geo_url("GDS10")geo_url("GSE10") geo_url("gpl98") geo_url("GSM1") geo_url("GDS10")
Checks whether the input data is already log-transformed; if not, applies a log2 transformation. This helps ensure comparability of expression values across datasets.
log_trans(data, pseudo = 1, ...) ## S3 method for class 'matrix' log_trans(data, pseudo = 1, ...) ## S3 method for class 'ExpressionSet' log_trans(data, pseudo = 1, ...)log_trans(data, pseudo = 1, ...) ## S3 method for class 'matrix' log_trans(data, pseudo = 1, ...) ## S3 method for class 'ExpressionSet' log_trans(data, pseudo = 1, ...)
data |
A matrix-like data object. |
pseudo |
A numeric value added before transformation to avoid
taking log of zero. For example, |
... |
Additional arguments passed to methods. |
The function heuristically determines whether data has been
log-transformed, following the methodology used in
GEO2R. If not, it
applies log2() with the specified pseudo offset.
A matrix or an ExpressionSet with
log2-transformed expression values.
NCBI GEO2R: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE1122
sample_means <- 2^runif(2, 2, 10) sample_disp <- 100 / sample_means + 0.5 data <- matrix( rnbinom(4, mu = sample_means, size = 1 / sample_disp), nrow = 2 ) log_trans(data) log_trans(log2(data))sample_means <- 2^runif(2, 2, 10) sample_disp <- 100 / sample_means + 0.5 data <- matrix( rnbinom(4, mu = sample_means, size = 1 / sample_disp), nrow = 2 ) log_trans(data) log_trans(log2(data))
Lots of GSEs now use "characteristics_ch*" meta header data for key-value
pairs of annotation. If that is the case, this simply cleans the GEOSoft
@metadata slot up and transforms the keys to column names and the values
to column values.
parse_sample_data(x, ...) ## S3 method for class 'GEOSeries' parse_sample_data(x, ...) ## S3 method for class 'data.frame' parse_sample_data(x, ..., fields = NULL, sep = ":") ## S3 method for class 'list' parse_sample_data(x, ...)parse_sample_data(x, ...) ## S3 method for class 'GEOSeries' parse_sample_data(x, ...) ## S3 method for class 'data.frame' parse_sample_data(x, ..., fields = NULL, sep = ":") ## S3 method for class 'list' parse_sample_data(x, ...)
x |
A GEOSeries object, a list of
GEOSoft from the |
... |
Additional arguments passed on to methods. |
fields |
A character vector which fields should be parsed. |
sep |
A single byte string defined the pairing separator. |
A data.frame whose rows are samples and columns are the sample infos
gse201530_soft <- geo_soft("GSE201530", odir = tempdir()) head(parse_sample_data(gse201530_soft))gse201530_soft <- geo_soft("GSE201530", odir = tempdir()) head(parse_sample_data(gse201530_soft))