In the following workflow, we walk the reader through training a OC FTSEC scores signature using FTSEC samples as training set and applying it to score TCGA tumor samples (testing set).
The the data is organized in a data frame with 387 samples (190 normals and 197 TCGA) as columns and 21071 exprerssed genes as rows, named as combat_alln_mean
. Sample annotation is organized in another data frame NormalSamplesTable
.
load("RData/Normal-197-TestSet-TCGA-combatGCnorm_all_data.rda")
176_p6 | 179_Lp6 | 181_Lp7 | 185_p5 | 189_Lp6 | |
---|---|---|---|---|---|
WASH7P | 3.358975 | 2.749803 | -2.419971 | -2.419971 | -2.419971 |
OR4F5 | -4.988889 | -4.988889 | -4.988889 | -4.988889 | -4.988889 |
OR4F29 | -4.960010 | -4.960010 | -4.960010 | -4.960010 | -4.960010 |
OR4F16 | -4.986294 | -4.986294 | -4.986294 | -4.986294 | -4.986294 |
LINC00115 | -3.989873 | 2.737894 | 1.204892 | -3.989873 | -3.989873 |
load("RData/NormalsSamplesTableV3.RData")
head(NormalSamplesTable)
## Samples CellType Colors SP Batch
## 1 176_p6 NOSE #377EB8 NO B1
## 2 179_Lp6 NOSE #377EB8 NO B1
## 3 181_Lp7 NOSE #377EB8 NO B1
## 4 185_p5 NOSE #377EB8 NO B1
## 5 189_Lp6 NOSE #377EB8 NO B1
## 6 191_Lp5 NOSE #377EB8 NO B1
Add TCGA Sample annotation in Normals sample table annotation. After the data is splited into 3 groups (NFTE, NOSE, HGSOC).
nTCGAsamples = 197
TCGAsamplesTable = data.frame(Samples = colnames(combat_alln_mean)[191:387], CellType = c(rep("HGSOC", nTCGAsamples)), Colors=c(rep("green", nTCGAsamples)), SP="None", Batch="None" )
SamplesTable = rbind(NormalSamplesTable, TCGAsamplesTable)
all_expr = combat_alln_mean
normals_expr = all_expr[,c(1:190)]
FTSEC_expr = normals_expr[,which(SamplesTable$CellType == "NFTE")]
dim(FTSEC_expr)
## [1] 21071 71
OSEC_expr = normals_expr[,which(SamplesTable$CellType == "NOSE")]
dim(OSEC_expr)
## [1] 21071 119
TCGA_expr = all_expr[, which(SamplesTable$CellType == "HGSOC")]
dim(TCGA_expr)
## [1] 21071 197
Find the mean center by subtracting the mean of each gene from the entire data. The mean of each gene just be in a numeric vector the same size as the number of probes, in this case 21071.
m1 <- apply(normals_expr, 1, mean )
normals_expr <- normals_expr - m1
Now we can begin to train the the one-class model with the gelnet
function. The gelnet function can be used for Linear Regression, Binary Classification and One class Problems by using an iterative method called coordinated descent (Sokolov et al. 2016). It has four main arguments described below:
X: n by p matrix => transpose(X.r) y: NULL for one class models l1: coefficient for the L1-norm penalty => 0 l2: coefficient for the L2-norm penalty => 1
X1.tr <- FTSEC_expr
X1.bk <- OSEC_expr
library( gelnet )
library( dplyr )
set.seed(13)
## Train a one-class model and store the signature in mm1
mm1 <- gelnet( t(X1.tr), NULL, 0, 1 )
## Training a one-class model
## Iteration 1 : f = 0.6931472
## Iteration 2 : f = 0.1757746
## Iteration 3 : f = 0.08471829
## Iteration 4 : f = 0.04146866
## Iteration 5 : f = 0.02035861
## Iteration 6 : f = 0.009690802
## Iteration 7 : f = 0.004332091
## Iteration 8 : f = 0.001867173
## Iteration 9 : f = 0.0008424111
## Iteration 10 : f = 0.0003518755
## Iteration 11 : f = 0.0001872333
## Iteration 12 : f = 0.0001656636
## Iteration 13 : f = 0.0001655145
To test how the model’s performance by using leave one out cross-validation. This process has three steps:
Train model on non-left-out data
Score the left-out sample against the background
AUC = P( left-out sample is scored above the background )
## Perform leave-one-out cross-validation
auc <- c()
for( i in 1:ncol(X1.tr) )
{
## Train a model on non-left-out data
cat( "Current: ", i, " of ", ncol(X1.tr), "\n")
one <- X1.tr[,-i]
one_res <- gelnet( t(one), NULL, 0, 1 )
## Score the left-out sample against the background
score1.bk <- apply( X1.bk[rownames(X1.tr),], 2, function(z) {cor( one_res$w, z, method="sp" )} )
score1 <- cor( one_res$w, X1.tr[,i], method="sp" )
## AUC = P( left-out sample is scored above the background )
auc[i] <- sum( score1 > score1.bk ) / length(score1.bk)
cat( "Current AUC: ", auc[i], "\n" )
cat( "Average AUC: ", mean(auc), "\n" )
}
Now we can obtain the auc mean of all iteractions.
head(mean(auc))
## [1] 0.2240896
In this step we will use the signature that was created in the previous steps and stored as mm1
to score the TCGA data.
w1 = mm1$w
X1 <- TCGA_expr[names(w1),] #54 9318
X1 <- as.matrix(X1)
## Score via Spearman correlation
s1 <- apply( X1, 2, function(z) {cor( z, w1, method="sp", use="complete.obs" )} )
Scale the scores into a ratio from 0 to 1 and store as data frame.
## Scale the scores to be between 0 and 1
s1 <- s1 - min(s1)
s1 <- s1 / max(s1)
s1 <- as.data.frame(t(s1))
TCGA-24-1471-01A-01R-1566-13 | TCGA-61-2008-01A-02R-1568-13 | TCGA-23-1111-01A-01R-1567-13 | TCGA-23-1028-01A-01R-1564-13 | TCGA-59-2348-01A-01R-1569-13 |
---|---|---|---|---|
0.9072112 | 0.7678116 | 0.8180536 | 0.6836016 | 0.7431097 |
We demonstrated how to derive a gene signature from FTSEC cell type samples. The robustness of the signature was estimated through leave-one-out cross-validation. The same procedures can now be applied to OSEC samples following the same steps by changuing the test and background sets.