Download and process GEO Gene expression matrix using R

This tutorial demonstrate how to download GEO gene expression matrix quickly and easily using R. Gene expression matrix is a table with genes as row and samples as the column. Each cell value, g_ij is the gene expression value correspond to the gene,‘g_i’ for the sample S_j

Gene Expression Omnibus (GEO) data organization

GEO is a public repository that archives and freely distributes high-throughput genomics data including NGS, and microarray. GEO data are organized as different entities and provide unique accession numbers for each of these entities asfollows

Platform - GPLxxx
Samples - GSMxxx
Series - GSExxx
Dataset - GDSxxx

If we have the accession numbers, we can readily retrieve the data using R, facilitating downstream analyses such as differential expression and gene enrichment analysis. Below, I illustrate how to download an expression matrix (series) containing official gene symbols and sample names, rather than solely probe names and sample numbers.We are downloading the series with accession number GSE122541. The work is related to Shift work disrupts circadian regulation of the transcriptome in hospital nurses.

Prerequisite

R program installed
library GEOquery, stringr

Load the required library.

library(GEOquery)

Let’s access GSE122541

gset <- getGEO("GSE122541", GSEMatrix =TRUE, getGPL=FALSE)

## Found 1 file(s)

## GSE122541_series_matrix.txt.gz

Let see what type of the data we have:

class(gset)

## [1] "list"

gset is a list. Check what information it has:

list(gset)

## [[1]]
## [[1]]$GSE122541_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 47323 features, 44 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM3473465 GSM3473466 ... GSM3473508 (44 total)
##   varLabels: title geo_accession ... work shift:ch1 (40 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 30712475 
## Annotation: GPL10558

This Gene expression data have 47323 features (genes), and 44 samples. The sample accession numbers are GSM3473465, GSM3473466. Its platform accession number is GPL10558. We can access the series GSM3473465 as follows:

gsm<-getGEO('GSM3473465')
Table(gsm)[1:5,1:3] # See the first 5 rows and 3 columns

##         ID_REF     VALUE Detection Pval
## 1 ILMN_1762337 108.54140        0.28312
## 2 ILMN_2055271 123.14240        0.02468
## 3 ILMN_1736007 108.85779        0.26623
## 4 ILMN_2383229  96.48084        0.87143
## 5 ILMN_1806310 123.14240        0.02468

ID_REF column is the probe ID for a a particular gene and the VALUE is the corresponding gene expression for the sample GSM3473465. Now check what sample it is:

head(Meta(gsm))

## $channel_count
## [1] "1"
## 
## $characteristics_ch1
## [1] "Sex: Female"           "group: Fasted"         "work shift: Day-Shift"
## [4] "time of sample: 9"    
## 
## $contact_address
## [1] "1720 7th Ave S"
## 
## $contact_city
## [1] "Birmingham"
## 
## $contact_country
## [1] "USA"
## 
## $contact_department
## [1] "Psychiatry"

The variable $characteristics_ch1 contains the sample information. The sample is a female, fasted, day-shift subject, and was taken at 9 o’clock. Instead of simply labeling it as GSM3473465, one can assign a more meaningful name to this sample. For instance, if focusing on work shift and time, it could be named as Day-shift_9. Now we are trying to extract the gene expression for all the samples.

gset<-gset[[1]]
gset_expr<-exprs(gset)
gset_expr[1:5,1:5] # show 5 row and columns of the expression data

##              GSM3473465  GSM3473466 GSM3473467 GSM3473468 GSM3473469
## ILMN_1343291 15919.7828 16777.45479 15607.7791 15919.7828 18113.3214
## ILMN_1343295  4238.9839  3620.28804  3668.5229  3001.0864  3500.4735
## ILMN_1651199   104.5345    99.34228   103.1815   104.0592    93.2272
## ILMN_1651209   132.7999   123.13354   126.2113   121.1003   107.1228
## ILMN_1651210   118.8960   113.87468   111.5030   125.2639   142.5366

We can visualize the distribution of the expression data:

hist(gset_expr)

The probe IDs are the row names of the expression data.

Probe_ID<- as.character(rownames(gset_expr) ) #get the probe ID
gset_expr_ProbID<-as.data.frame(cbind(Probe_ID, gset_expr)) # Bind Probe ID and EXpr data

Assign gene symbol to probe ID

We can now try to assign gene symbols to probe ID.Get the platform information using platform access number GPL10558

gpl<-getGEO("GPL10558")
gpl_table<-Table(gpl)
gpl_table[1:10, 1:7]

##              ID      Species Source  Search_Key Transcript ILMN_Gene
## 1  ILMN_1343048                                                     
## 2  ILMN_1343049                                                     
## 3  ILMN_1343050                                                     
## 4  ILMN_1343052                                                     
## 5  ILMN_1343059                                                     
## 6  ILMN_1343061                                                     
## 7  ILMN_1343062                                                     
## 8  ILMN_1343063                                                     
## 9  ILMN_1343064                                                     
## 10 ILMN_1343291 Homo sapiens RefSeq NM_001402.4  ILMN_5311    EEF1A1
##    Source_Reference_ID
## 1                     
## 2                     
## 3                     
## 4                     
## 5                     
## 6                     
## 7                     
## 8                     
## 9                     
## 10         NM_001402.5

Extract the columns ID and ILMN_Gene and combine it with gene expression data

ID_Gene<- dplyr::select(gpl_table,'ID', 'ILMN_Gene') # select the ID and gene name
gset_expr_ProbID<-(subset(gset_expr_ProbID, Probe_ID %in% ID_Gene$ID)) # subset matching IDs
gset_exp_GeneName<-dplyr::right_join(x = ID_Gene,y=gset_expr_ProbID,  by = c("ID" = "Probe_ID")) # Combine gene name with gene expression
gset_exp_GeneName<-gset_exp_GeneName[-1]
colnames(gset_exp_GeneName)[1]<-'Gene_symbol'
gset_exp_GeneName[1:5,1:5]

##   Gene_symbol  GSM3473465  GSM3473466  GSM3473467  GSM3473468
## 1      EEF1A1 15919.78282 16777.45479 15607.77906 15919.78282
## 2       GAPDH  4238.98393 3620.288036 3668.522876 3001.086433
## 3   LOC643334 104.5345278 99.34228171 103.1814704 104.0591587
## 4     SLC35E2 132.7999304 123.1335399 126.2112916 121.1003337
## 5      DUSP22 118.8960318 113.8746775 111.5030407 125.2638584

Assign meaningful sample name to sample number

To access the sample information from GEO Omnibus, I have created the following function.

GetSampleInfo <- function(gset){
  tmp <- pData(gset)[,grep("characteristics",colnames(pData(gset)))]
  samps <- rownames(tmp)
  longChar <- do.call(rbind,lapply(tmp,function(z){
    z<-as.character(z)
    z1 <- substr(z,1,regexpr(": ",z)-1)
    z2 <- substr(z,regexpr(": ",z)+2,nchar(z))
    cbind(samps,z1,z2)[z1!="",]
  }))
  chars <- unique(longChar[,2])
  out <- matrix(nrow=length(samps),nc=length(chars),dimnames=list(samps,chars))
  out[longChar[,1:2]] <- longChar[,3]
  colnames(out) <- make.names(colnames(out))
  out <- as.data.frame(out)
  out <- out[colnames(exprs(gset)),]
  return(out)   
}

Call GetSampleInfo() function with gset as the passing argument for sample information

sample_Info <- GetSampleInfo(gset)
head(sample_Info)

##               Sex  group work.shift time.of.sample
## GSM3473465 Female Fasted  Day-Shift              9
## GSM3473466 Female   <NA>  Day-Shift             12
## GSM3473467 Female   <NA>  Day-Shift             15
## GSM3473468 Female   <NA>  Day-Shift             18
## GSM3473469 Female   <NA>  Day-Shift             21
## GSM3473470 Female   <NA>  Day-Shift             24

We are interested on work shift and time as the sample name. So combine these two string.

library('stringr') # Load the required library for concatenate strings.

## Warning: package 'stringr' was built under R version 4.3.2

sample_Info$sample_name <- str_c(sample_Info$work.shift,'_', sample_Info$time.of.sample)
head(sample_Info)

##               Sex  group work.shift time.of.sample  sample_name
## GSM3473465 Female Fasted  Day-Shift              9  Day-Shift_9
## GSM3473466 Female   <NA>  Day-Shift             12 Day-Shift_12
## GSM3473467 Female   <NA>  Day-Shift             15 Day-Shift_15
## GSM3473468 Female   <NA>  Day-Shift             18 Day-Shift_18
## GSM3473469 Female   <NA>  Day-Shift             21 Day-Shift_21
## GSM3473470 Female   <NA>  Day-Shift             24 Day-Shift_24

Now we can replace the sample number with newly created sample name as follows:

sample_name<-c('Gene_symbol', t(sample_Info$sample_name))
Expression_matrix<-rbind(sample_name, gset_exp_GeneName)
colnames(Expression_matrix)<-Expression_matrix[1,]
Expression_matrix<-Expression_matrix[-1,]
row.names(Expression_matrix)<-NULL
Expression_matrix[1:5,1:5] # show first 5 genes and samples

##   Gene_symbol Day-Shift_9 Day-Shift_12 Day-Shift_15 Day-Shift_18
## 1      EEF1A1 15919.78282  16777.45479  15607.77906  15919.78282
## 2       GAPDH  4238.98393  3620.288036  3668.522876  3001.086433
## 3   LOC643334 104.5345278  99.34228171  103.1814704  104.0591587
## 4     SLC35E2 132.7999304  123.1335399  126.2112916  121.1003337
## 5      DUSP22 118.8960318  113.8746775  111.5030407  125.2638584

Hope that everyone can follow it. If you have queries please contact me at shijusisobhan@gmail.com

Download and process GEO Gene expression matrix using R

Shiju Sisobhan

2024-03-07

Gene Expression Omnibus (GEO) data organization

Prerequisite

Assign gene symbol to probe ID

Assign meaningful sample name to sample number