Machine learning analysis

Introduction

In the following workflow, we walk the reader through training a OC FTSEC scores signature using FTSEC samples as training set and applying it to score TCGA tumor samples (testing set).

Data loading

The the data is organized in a data frame with 387 samples (190 normals and 197 TCGA) as columns and 21071 exprerssed genes as rows, named as combat_alln_mean. Sample annotation is organized in another data frame NormalSamplesTable.

load("RData/Normal-197-TestSet-TCGA-combatGCnorm_all_data.rda")

	176_p6	179_Lp6	181_Lp7	185_p5	189_Lp6
WASH7P	3.358975	2.749803	-2.419971	-2.419971	-2.419971
OR4F5	-4.988889	-4.988889	-4.988889	-4.988889	-4.988889
OR4F29	-4.960010	-4.960010	-4.960010	-4.960010	-4.960010
OR4F16	-4.986294	-4.986294	-4.986294	-4.986294	-4.986294
LINC00115	-3.989873	2.737894	1.204892	-3.989873	-3.989873

load("RData/NormalsSamplesTableV3.RData")
head(NormalSamplesTable)

##   Samples CellType  Colors SP Batch
## 1  176_p6     NOSE #377EB8 NO    B1
## 2 179_Lp6     NOSE #377EB8 NO    B1
## 3 181_Lp7     NOSE #377EB8 NO    B1
## 4  185_p5     NOSE #377EB8 NO    B1
## 5 189_Lp6     NOSE #377EB8 NO    B1
## 6 191_Lp5     NOSE #377EB8 NO    B1

Add TCGA Sample annotation in Normals sample table annotation. After the data is splited into 3 groups (NFTE, NOSE, HGSOC).

nTCGAsamples = 197
TCGAsamplesTable = data.frame(Samples = colnames(combat_alln_mean)[191:387], CellType = c(rep("HGSOC", nTCGAsamples)), Colors=c(rep("green", nTCGAsamples)), SP="None", Batch="None" )

SamplesTable = rbind(NormalSamplesTable, TCGAsamplesTable)

all_expr = combat_alln_mean
normals_expr = all_expr[,c(1:190)]

FTSEC_expr = normals_expr[,which(SamplesTable$CellType == "NFTE")]
dim(FTSEC_expr)

## [1] 21071    71

OSEC_expr = normals_expr[,which(SamplesTable$CellType == "NOSE")]
dim(OSEC_expr)

## [1] 21071   119

TCGA_expr = all_expr[, which(SamplesTable$CellType == "HGSOC")]
dim(TCGA_expr)

## [1] 21071   197

Mean-center the data

Find the mean center by subtracting the mean of each gene from the entire data. The mean of each gene just be in a numeric vector the same size as the number of probes, in this case 21071.

m1 <- apply(normals_expr, 1, mean )
normals_expr <- normals_expr - m1

Train model

Now we can begin to train the the one-class model with the gelnet function. The gelnet function can be used for Linear Regression, Binary Classification and One class Problems by using an iterative method called coordinated descent (Sokolov et al. 2016). It has four main arguments described below:

X: n by p matrix => transpose(X.r) y: NULL for one class models l1: coefficient for the L1-norm penalty => 0 l2: coefficient for the L2-norm penalty => 1

X1.tr <- FTSEC_expr
X1.bk <- OSEC_expr

library( gelnet )
library( dplyr )

set.seed(13)

## Train a one-class model and store the signature in mm1
mm1 <- gelnet( t(X1.tr), NULL, 0, 1 )

## Training a one-class model
## Iteration 1 : f = 0.6931472 
## Iteration 2 : f = 0.1757746 
## Iteration 3 : f = 0.08471829 
## Iteration 4 : f = 0.04146866 
## Iteration 5 : f = 0.02035861 
## Iteration 6 : f = 0.009690802 
## Iteration 7 : f = 0.004332091 
## Iteration 8 : f = 0.001867173 
## Iteration 9 : f = 0.0008424111 
## Iteration 10 : f = 0.0003518755 
## Iteration 11 : f = 0.0001872333 
## Iteration 12 : f = 0.0001656636 
## Iteration 13 : f = 0.0001655145

Leave one out Cross-validation

To test how the model’s performance by using leave one out cross-validation. This process has three steps:

Train model on non-left-out data

Score the left-out sample against the background

AUC = P( left-out sample is scored above the background )

## Perform leave-one-out cross-validation
auc <- c()
for( i in 1:ncol(X1.tr) )
{
  ## Train a model on non-left-out data
  cat( "Current: ", i, " of ", ncol(X1.tr), "\n")
  one <- X1.tr[,-i]
  one_res <- gelnet( t(one), NULL, 0, 1 )

  ## Score the left-out sample against the background
  score1.bk <- apply( X1.bk[rownames(X1.tr),], 2, function(z) {cor( one_res$w, z, method="sp" )} )
  score1 <- cor( one_res$w, X1.tr[,i], method="sp" )

  ## AUC = P( left-out sample is scored above the background )
  auc[i] <- sum( score1 > score1.bk ) / length(score1.bk)
  cat( "Current AUC: ", auc[i], "\n" )
  cat( "Average AUC: ", mean(auc), "\n" )
}

Now we can obtain the auc mean of all iteractions.

head(mean(auc))

## [1] 0.2240896

Predict

In this step we will use the signature that was created in the previous steps and stored as mm1 to score the TCGA data.

w1 = mm1$w

X1 <- TCGA_expr[names(w1),] #54 9318 
X1 <- as.matrix(X1)

## Score via Spearman correlation
s1 <- apply( X1, 2, function(z) {cor( z, w1, method="sp", use="complete.obs" )} )

Scale the scores into a ratio from 0 to 1 and store as data frame.

## Scale the scores to be between 0 and 1
s1 <- s1 - min(s1)
s1 <- s1 / max(s1)
s1 <- as.data.frame(t(s1))

TCGA-24-1471-01A-01R-1566-13	TCGA-61-2008-01A-02R-1568-13	TCGA-23-1111-01A-01R-1567-13	TCGA-23-1028-01A-01R-1564-13	TCGA-59-2348-01A-01R-1569-13
0.9072112	0.7678116	0.8180536	0.6836016	0.7431097

Conclusion

We demonstrated how to derive a gene signature from FTSEC cell type samples. The robustness of the signature was estimated through leave-one-out cross-validation. The same procedures can now be applied to OSEC samples following the same steps by changuing the test and background sets.