If you are conducting
bioinformatics research, especially when dealing with GEO (Gene
Expression Omnibus) data, you may have used the GEOmetadb package. While
it was once a great tool, unfortunately, it has not been updated for
years. Additionally, the GEOmetadb package is static,
relying on the developer to manually update it, which often results in
incomplete or outdated data.
Why Build Your Own GEOmetadb?
geokit, you can use familiar tools like regular expressions
or database queries (e.g., grep, SQLite) to freely search and meet
various needs.geokit offers an efficient and straightforward method,
with core operations implemented in rust, allowing you to
build your own metadata database in just a few minutes. For example,
processing 654 records takes around 30.5 seconds if the
data is already downloaded.
GEOmetadb.uc_gse <- uc_gse |>
dplyr::mutate(
number_of_samples = str_match(Contains, "(\\d+) Samples?")[
, 2L,
drop = TRUE
],
number_of_samples = as.integer(number_of_samples)
)
# Quick Statistics
max_samples <- max(uc_gse$number_of_samples, na.rm = TRUE)
median_samples <- median(uc_gse$number_of_samples, na.rm = TRUE)
dplyr::slice_max(uc_gse, number_of_samples)
#> Title
#> 1 Prediction of tissue-of-origin of early-stage cancers using serum miRNomes
#> Summary
#> 1 Large-scale serum miRNomics in combination with machine learning could lead to the development of a blood-based cancer classification system.
#> Organism Type
#> 1 Homo sapiens Non-coding RNA profiling by array
#> FTP download
#> 1 GEO (TXT) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE211nnn/GSE211692/
#> ID SRA Run Selector Contains Datasets Platforms Series Accession
#> 1 200211692 <NA> 16190 Samples <NA> GPL21263 GSE211692
#> number_of_samples
#> 1 16190uc_gse_sc <- dplyr::filter(
uc_gse_meta,
dplyr::if_any(
c(Series_summary, Series_title, Series_overall_design),
str_detect,
pattern = regex("single[- ]cell|scRNA", ignore_case = TRUE)
)
) |>
dplyr::mutate(
number_of_samples = lengths(
strsplit(Series_sample_id, "; ", fixed = TRUE)
)
)
# Output Results and Statistics
dplyr::slice_max(uc_gse_sc, number_of_samples)$Series_geo_accession
max(uc_gse_sc$number_of_samples, na.rm = TRUE)
median(uc_gse_sc$number_of_samples, na.rm = TRUE)
writexl::write_xlsx(uc_gse_sc, "uc_gse_sc.xlsx")By following these steps, you have successfully created your own
GEOmetadb!
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] stringr_1.6.0 geokit_0.0.1.9000 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_2.0.0 dplyr_1.2.1 compiler_4.6.0
#> [4] tidyselect_1.2.1 Biobase_2.73.1 xml2_1.5.2
#> [7] rentrez_1.2.4 jquerylib_0.1.4 yaml_2.3.12
#> [10] fastmap_1.2.0 R6_2.6.1 generics_0.1.4
#> [13] curl_7.1.0 knitr_1.51 BiocGenerics_0.59.7
#> [16] XML_3.99-0.23 tibble_3.3.1 maketools_1.3.2
#> [19] bslib_0.11.0 pillar_1.11.1 R.utils_2.13.0
#> [22] rlang_1.2.0 cachem_1.1.0 stringi_1.8.7
#> [25] xfun_0.59 sass_0.4.10 sys_3.4.3
#> [28] otel_0.2.0 cli_3.6.6 withr_3.0.3
#> [31] magrittr_2.0.5 digest_0.6.39 lifecycle_1.0.5
#> [34] R.oo_1.27.1 R.methodsS3_1.8.2 vctrs_0.7.3
#> [37] data.table_1.18.4 evaluate_1.0.5 glue_1.8.1
#> [40] codetools_0.2-20 buildtools_1.0.0 httr_1.4.8
#> [43] tools_4.6.0 pkgconfig_2.0.3 htmltools_0.5.9