--- title: "Build Your Own GEOmetadb" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Build Your Own GEOmetadb} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` If you are conducting bioinformatics research, especially when dealing with GEO (Gene Expression Omnibus) data, you may have used the [GEOmetadb](https://github.com/zhujack/GEOmetadb) package. While it was once a great tool, unfortunately, it has not been updated for years. Additionally, the `GEOmetadb` package is static, relying on the developer to manually update it, which often results in incomplete or outdated data. **Why Build Your Own GEOmetadb?** - **Focus on Your Research Area**: For example, diabetes, liver cancer, urothelial carcinoma, etc. You can aggregate and annotate all relevant Series, making it easier for subsequent bulk downloads, filtering, and reproducible analysis. - **Flexible Offline Querying**: GEO's web search is not flexible enough. After building your own database with `geokit`, you can use familiar tools like regular expressions or database queries (e.g., grep, SQLite) to freely search and meet various needs. - **Dynamic Updates and Incremental Sync**: Automate scripts to regularly update the database, ensuring that the data is always up-to-date and avoiding the hassle of manual updates. - **Efficient Data Access**: After storing the data offline, query speeds are significantly improved, especially when handling large GEO Series datasets, saving both time and bandwidth. `geokit` offers an efficient and straightforward method, with core operations implemented in `rust`, allowing you to build your own metadata database in just a few minutes. For example, processing 654 records takes around **30.5 seconds** if the data is already downloaded. - Search the GEO database using **NCBI Eutils** and extract relevant metadata. - Fetch the relevant **metadata** from the GEO database and save it locally to build your own `GEOmetadb`. - Use R's regular expressions, filtering features, or tools like Excel and SQLite to quickly search and analyze Series (GSE) and sample information. # Search and Filter Single-Cell Studies of Urothelial Carcinoma ```{r setup} library(geokit) library(stringr) ``` ## 1. Search for Related Series by Keywords (e.g., Bladder/Urothelial Cancer) Based on the **NCBI Eutils** query, more details can be found here: [Querying GEO DataSets](https://www.ncbi.nlm.nih.gov/geo/info/qqtutorial.html) ```{r} uc_gse <- list( geo_search("bladder cancer[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]"), geo_search("urothelial cancer[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]") ) uc_gse <- unique(dplyr::bind_rows(uc_gse)) ``` ## 2. Extract Sample Count from the "Contains" Field ```{r} uc_gse <- uc_gse |> dplyr::mutate( number_of_samples = str_match(Contains, "(\\d+) Samples?")[ , 2L, drop = TRUE ], number_of_samples = as.integer(number_of_samples) ) # Quick Statistics max_samples <- max(uc_gse$number_of_samples, na.rm = TRUE) median_samples <- median(uc_gse$number_of_samples, na.rm = TRUE) dplyr::slice_max(uc_gse, number_of_samples) ``` ## 3. Fetch Series Metadata (Parallel Processing and Batch Saving to Local Directory) ```{r, eval=FALSE} uc_gse_meta <- geo_meta( uc_gse[["Series Accession"]], odir = "gse_urothelial_cancer" ) ``` ## 4. Filter Possible Single-Cell Studies (Search for Keywords in Summary/Title/Design) ```{r, eval=FALSE} uc_gse_sc <- dplyr::filter( uc_gse_meta, dplyr::if_any( c(Series_summary, Series_title, Series_overall_design), str_detect, pattern = regex("single[- ]cell|scRNA", ignore_case = TRUE) ) ) |> dplyr::mutate( number_of_samples = lengths( strsplit(Series_sample_id, "; ", fixed = TRUE) ) ) # Output Results and Statistics dplyr::slice_max(uc_gse_sc, number_of_samples)$Series_geo_accession max(uc_gse_sc$number_of_samples, na.rm = TRUE) median(uc_gse_sc$number_of_samples, na.rm = TRUE) writexl::write_xlsx(uc_gse_sc, "uc_gse_sc.xlsx") ``` By following these steps, you have successfully created your own `GEOmetadb`! # Session Information ```{r} sessionInfo() ```