---
title: "Search GEO"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Search GEO}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(geokit)
library(stringr)
```
The NCBI uses a search term syntax which can be associated with a specific
search field enclosed by a pair of square brackets. So, for instance `"Homo
sapiens[ORGN]"` denotes a search for `Homo sapiens` in the `"Organism"` field.
Details see . We can use
the same term to query our desirable results in `geo_search()`. `geo_search()`
will parse the searching results and return a `data.frame` object containing all
the records based on the search term. The internal of `geo_search()` is based on
[`rentrez`](https://github.com/ropensci/rentrez) package, which provides
functions working with the [NCBI
Eutils](http://www.ncbi.nlm.nih.gov/books/NBK25500/) API, so we can utilize
`NCBI API key` to increase the searching speed, details see
.
Providing we want ***GSE*** GEO records related to ***human diabetes***, we can get these records by following code, the returned object is a `data.frame`:
```{r diabetes_gse_records, cache = TRUE}
diabetes_gse_records <- geo_search(
"diabetes[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]"
)
head(diabetes_gse_records[1:5])
```
Once you have the search results, you can filter them based on specific
criteria. For instance, to filter for GSE datasets that contain at least 6
diabetic nephropathy samples with expression profiling, use the following code:
```{r filtered_diabetes_gse_records}
diabetes_nephropathy_gse_records <- diabetes_gse_records |>
dplyr::mutate(
number_of_samples = str_match(Contains, "(\\d+) Samples?")[
, 2L,
drop = TRUE
],
number_of_samples = as.integer(number_of_samples)
) |>
dplyr::filter(
dplyr::if_any(
c(Title, Summary),
~ str_detect(.x, regex("diabetes|diabetic", ignore_case = TRUE))
),
dplyr::if_any(
c(Title, Summary),
~ str_detect(.x, regex("nephropathy", ignore_case = TRUE))
),
str_detect(Type, regex("expression profiling", ignore_case = TRUE)),
number_of_samples >= 6L
)
head(diabetes_nephropathy_gse_records[1:5, 1:5])
```
After applying the filter, we obtain `r nrow(diabetes_nephropathy_gse_records)`
candidate datasets. This filtering step significantly reduces the time spent
manually reviewing summary records.
You can also use `geo_meta()` to dynamically create a self-knowledge-concerned
database in real-time. See `vignette("geometadb")` for details.
## Session Information
```{r}
sessionInfo()
```