This tutorial demonstrate how to download GEO gene expression matrix quickly and easily using R. Gene expression matrix is a table with genes as row and samples as the column. Each cell value, gij is the gene expression value correspond to the gene,‘gi’ for the sample Sj
GEO is a public repository that archives and freely distributes high-throughput genomics data including NGS, and microarray. GEO data are organized as different entities and provide unique accession numbers for each of these entities asfollows
If we have the accession numbers, we can readily retrieve the data using R, facilitating downstream analyses such as differential expression and gene enrichment analysis. Below, I illustrate how to download an expression matrix (series) containing official gene symbols and sample names, rather than solely probe names and sample numbers.We are downloading the series with accession number GSE122541. The work is related to Shift work disrupts circadian regulation of the transcriptome in hospital nurses.
R program installed
library GEOquery, stringr
Load the required library.
library(GEOquery)
Let’s access GSE122541
gset <- getGEO("GSE122541", GSEMatrix =TRUE, getGPL=FALSE)
## Found 1 file(s)
## GSE122541_series_matrix.txt.gz
Let see what type of the data we have:
class(gset)
## [1] "list"
gset is a list. Check what information it has:
list(gset)
## [[1]]
## [[1]]$GSE122541_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 47323 features, 44 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM3473465 GSM3473466 ... GSM3473508 (44 total)
## varLabels: title geo_accession ... work shift:ch1 (40 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## pubMedIds: 30712475
## Annotation: GPL10558
This Gene expression data have 47323 features (genes), and 44 samples. The sample accession numbers are GSM3473465, GSM3473466. Its platform accession number is GPL10558. We can access the series GSM3473465 as follows:
gsm<-getGEO('GSM3473465')
Table(gsm)[1:5,1:3] # See the first 5 rows and 3 columns
## ID_REF VALUE Detection Pval
## 1 ILMN_1762337 108.54140 0.28312
## 2 ILMN_2055271 123.14240 0.02468
## 3 ILMN_1736007 108.85779 0.26623
## 4 ILMN_2383229 96.48084 0.87143
## 5 ILMN_1806310 123.14240 0.02468
ID_REF column is the probe ID for a a particular gene and the VALUE is the corresponding gene expression for the sample GSM3473465. Now check what sample it is:
head(Meta(gsm))
## $channel_count
## [1] "1"
##
## $characteristics_ch1
## [1] "Sex: Female" "group: Fasted" "work shift: Day-Shift"
## [4] "time of sample: 9"
##
## $contact_address
## [1] "1720 7th Ave S"
##
## $contact_city
## [1] "Birmingham"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Psychiatry"
The variable $characteristics_ch1 contains the sample information. The sample is a female, fasted, day-shift subject, and was taken at 9 o’clock. Instead of simply labeling it as GSM3473465, one can assign a more meaningful name to this sample. For instance, if focusing on work shift and time, it could be named as Day-shift_9. Now we are trying to extract the gene expression for all the samples.
gset<-gset[[1]]
gset_expr<-exprs(gset)
gset_expr[1:5,1:5] # show 5 row and columns of the expression data
## GSM3473465 GSM3473466 GSM3473467 GSM3473468 GSM3473469
## ILMN_1343291 15919.7828 16777.45479 15607.7791 15919.7828 18113.3214
## ILMN_1343295 4238.9839 3620.28804 3668.5229 3001.0864 3500.4735
## ILMN_1651199 104.5345 99.34228 103.1815 104.0592 93.2272
## ILMN_1651209 132.7999 123.13354 126.2113 121.1003 107.1228
## ILMN_1651210 118.8960 113.87468 111.5030 125.2639 142.5366
We can visualize the distribution of the expression data:
hist(gset_expr)
The probe IDs are the row names of the expression data.
Probe_ID<- as.character(rownames(gset_expr) ) #get the probe ID
gset_expr_ProbID<-as.data.frame(cbind(Probe_ID, gset_expr)) # Bind Probe ID and EXpr data
We can now try to assign gene symbols to probe ID.Get the platform information using platform access number GPL10558
gpl<-getGEO("GPL10558")
gpl_table<-Table(gpl)
gpl_table[1:10, 1:7]
## ID Species Source Search_Key Transcript ILMN_Gene
## 1 ILMN_1343048
## 2 ILMN_1343049
## 3 ILMN_1343050
## 4 ILMN_1343052
## 5 ILMN_1343059
## 6 ILMN_1343061
## 7 ILMN_1343062
## 8 ILMN_1343063
## 9 ILMN_1343064
## 10 ILMN_1343291 Homo sapiens RefSeq NM_001402.4 ILMN_5311 EEF1A1
## Source_Reference_ID
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10 NM_001402.5
Extract the columns ID and ILMN_Gene and combine it with gene expression data
ID_Gene<- dplyr::select(gpl_table,'ID', 'ILMN_Gene') # select the ID and gene name
gset_expr_ProbID<-(subset(gset_expr_ProbID, Probe_ID %in% ID_Gene$ID)) # subset matching IDs
gset_exp_GeneName<-dplyr::right_join(x = ID_Gene,y=gset_expr_ProbID, by = c("ID" = "Probe_ID")) # Combine gene name with gene expression
gset_exp_GeneName<-gset_exp_GeneName[-1]
colnames(gset_exp_GeneName)[1]<-'Gene_symbol'
gset_exp_GeneName[1:5,1:5]
## Gene_symbol GSM3473465 GSM3473466 GSM3473467 GSM3473468
## 1 EEF1A1 15919.78282 16777.45479 15607.77906 15919.78282
## 2 GAPDH 4238.98393 3620.288036 3668.522876 3001.086433
## 3 LOC643334 104.5345278 99.34228171 103.1814704 104.0591587
## 4 SLC35E2 132.7999304 123.1335399 126.2112916 121.1003337
## 5 DUSP22 118.8960318 113.8746775 111.5030407 125.2638584
To access the sample information from GEO Omnibus, I have created the following function.
GetSampleInfo <- function(gset){
tmp <- pData(gset)[,grep("characteristics",colnames(pData(gset)))]
samps <- rownames(tmp)
longChar <- do.call(rbind,lapply(tmp,function(z){
z<-as.character(z)
z1 <- substr(z,1,regexpr(": ",z)-1)
z2 <- substr(z,regexpr(": ",z)+2,nchar(z))
cbind(samps,z1,z2)[z1!="",]
}))
chars <- unique(longChar[,2])
out <- matrix(nrow=length(samps),nc=length(chars),dimnames=list(samps,chars))
out[longChar[,1:2]] <- longChar[,3]
colnames(out) <- make.names(colnames(out))
out <- as.data.frame(out)
out <- out[colnames(exprs(gset)),]
return(out)
}
Call GetSampleInfo() function with gset as the passing argument for sample information
sample_Info <- GetSampleInfo(gset)
head(sample_Info)
## Sex group work.shift time.of.sample
## GSM3473465 Female Fasted Day-Shift 9
## GSM3473466 Female <NA> Day-Shift 12
## GSM3473467 Female <NA> Day-Shift 15
## GSM3473468 Female <NA> Day-Shift 18
## GSM3473469 Female <NA> Day-Shift 21
## GSM3473470 Female <NA> Day-Shift 24
We are interested on work shift and time as the sample name. So combine these two string.
library('stringr') # Load the required library for concatenate strings.
## Warning: package 'stringr' was built under R version 4.3.2
sample_Info$sample_name <- str_c(sample_Info$work.shift,'_', sample_Info$time.of.sample)
head(sample_Info)
## Sex group work.shift time.of.sample sample_name
## GSM3473465 Female Fasted Day-Shift 9 Day-Shift_9
## GSM3473466 Female <NA> Day-Shift 12 Day-Shift_12
## GSM3473467 Female <NA> Day-Shift 15 Day-Shift_15
## GSM3473468 Female <NA> Day-Shift 18 Day-Shift_18
## GSM3473469 Female <NA> Day-Shift 21 Day-Shift_21
## GSM3473470 Female <NA> Day-Shift 24 Day-Shift_24
Now we can replace the sample number with newly created sample name as follows:
sample_name<-c('Gene_symbol', t(sample_Info$sample_name))
Expression_matrix<-rbind(sample_name, gset_exp_GeneName)
colnames(Expression_matrix)<-Expression_matrix[1,]
Expression_matrix<-Expression_matrix[-1,]
row.names(Expression_matrix)<-NULL
Expression_matrix[1:5,1:5] # show first 5 genes and samples
## Gene_symbol Day-Shift_9 Day-Shift_12 Day-Shift_15 Day-Shift_18
## 1 EEF1A1 15919.78282 16777.45479 15607.77906 15919.78282
## 2 GAPDH 4238.98393 3620.288036 3668.522876 3001.086433
## 3 LOC643334 104.5345278 99.34228171 103.1814704 104.0591587
## 4 SLC35E2 132.7999304 123.1335399 126.2112916 121.1003337
## 5 DUSP22 118.8960318 113.8746775 111.5030407 125.2638584
Hope that everyone can follow it. If you have queries please contact me at shijusisobhan@gmail.com