Cachexia pipeline 1

Cachexia concentrations data was read and stored in a dataset. This consists of a list, containing 5 elements, the data itself (consisting in a matrix with the metabolites and their concentrations); the metadata (a data frame, in this case with a single variable Muscle.loss); a string describing the type of the data (“concentrations”); a short description, and the labels information.

Before running any code, make sure that you set your working directory, with the function setwd, to the base folder of the package.

setwd("~/Dropbox")
library(metabolomicsUM)

#create dataset list
cachexia.metadata.file = "Datasets/Cachexia/metadata/metadata_cachexia.csv"
cachexia.data.file = "Datasets/Cachexia/data/human_cachexia.csv"
label.x = "compound"
label.val = "concentration"
cachexia.ds = read.dataset.csv(cachexia.data.file, cachexia.metadata.file, description = "Human cachexia", header.row.meta = T, type = "concentrations", label.x = label.x, label.values = label.val)
sum.dataset(cachexia.ds)

This dataset does not contain any missing value. In this pipeline, no filters were applied, normalization was done by by median, the data was log transformation and auto-scaling was performed.

PREPROCESSING

The dataset contains 77 samples and 63 variables (metabolites).

Log transformation was applied:

logtransform.cachexia = transform.data(cachexia.ds, "log")

And it was scaled using the autoscaling method:

autoscaling.cachexia = scaling(logtransform.cachexia, "auto")

UNIVARIATE TESTS

t-tests were performed on the dataset and the top 10 results ordered by p-value are shown below:

ttest.cachexia = tTests.dataset(autoscaling.cachexia, "Muscle.loss")
ttest.cachexia[1:10,]

##                           p.value   -log10          fdr
## Quinolinate          3.452427e-06 5.461876 0.0002175029
## Glucose              1.644303e-05 4.784018 0.0002757628
## 3-Hydroxyisovalerate 1.883505e-05 4.725033 0.0002757628
## Leucine              1.955487e-05 4.708745 0.0002757628
## Succinate            2.860838e-05 4.543507 0.0002757628
## Valine               3.050168e-05 4.515676 0.0002757628
## N.N-Dimethylglycine  3.372739e-05 4.472017 0.0002757628
## Adipate              3.501750e-05 4.455715 0.0002757628
## myo-Inositol         3.981624e-05 4.399940 0.0002787137
## Acetate              6.945398e-05 4.158303 0.0004148572

plot.ttests(cachexia.ds, ttest.cachexia, 0.0001)

Fold change was calculated on the dataset and the top 10 results ordered by the absolute value of log2(value-of-fold-change) are shown below:

foldchange.cachexia = fold.change(cachexia.ds, "Muscle.loss", "control")
foldchange.cachexia[1:10,]

##                   FoldChange log2(FC)
## Glucose             5.868549 2.553004
## Adipate             3.871520 1.952900
## Creatine            3.396091 1.763875
## Lactate             3.310075 1.726864
## cis-Aconitate       3.009305 1.589431
## 3-Hydroxybutyrate   2.956018 1.563655
## myo-Inositol        2.902838 1.537464
## Trigonelline        2.751901 1.460428
## Sucrose             2.699269 1.432569
## Succinate           2.668888 1.416239

plot.fold.change(cachexia.ds, foldchange.cachexia, 3)

The volcano plot for t-tests and fold changes are is shown below:

both = volcano.plot.fc.tt(cachexia.ds, foldchange.cachexia, ttest.cachexia, 3, 0.0001)

both

## [1] "Adipate"  "Creatine" "Glucose"

A heatmap with the correlations between all variables is shown below:

correlation.cachexia = correlations.dataset(autoscaling.cachexia, method = "pearson")
heatmap(correlation.cachexia, col =  topo.colors(256))

CLUSTERING

Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:

hc.result = clustering(autoscaling.cachexia, method = "hc", distance = "euclidean")
dendrogram.plot.col(autoscaling.cachexia, hc.result, "Muscle.loss")

K-Means was performed on the data also with 2 and 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below.

With 2 centers:

kmeans.result.2 = clustering(autoscaling.cachexia, method = "kmeans", num.clusters = 2)
kmeans.plot(autoscaling.cachexia, kmeans.result.2)

kmeans.df.2 = kmeans.result.df(kmeans.result.2, 2)
kmeans.df.2

##   cluster
## 1       1
## 2       2
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 samples
## 1 PIF_178 PIF_087 PIF_090 NETL_005_V1 PIF_115 PIF_110 NETL_019_V1 PIF_154 NETL_022_V1 NETL_022_V2 NETL_008_V1 PIF_146 PIF_162 PIF_160 PIF_113 PIF_143 NETCR_007_V1 NETCR_007_V2 PIF_137 PIF_094 PIF_132 PIF_163 NETL_028_V1 NETCR_013_V1 NETL_020_V2 NETCR_012_V2 PIF_089 NETCR_002_V1 PIF_114 NETCR_006_V1 PIF_141 NETCR_025_V1 NETCR_025_V2 NETCR_016_V1 PIF_164 NETCR_015_V1 PIF_102 NETL_001_V1 NETCR_015_V2 NETCR_005_V1 PIF_171 NETL_002_V1 NETL_002_V2 NETCR_009_V1 NETCR_019_V2
## 2                                                                                                                                NETCR_014_V1 NETCR_014_V2 PIF_119 PIF_099 PIF_100 NETL_004_V1 NETCR_003_V1 NETL_028_V2 NETL_020_V1 PIF_192 NETCR_012_V1 PIF_179 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETL_010_V1 NETL_010_V2 PIF_111 NETCR_008_V1 NETCR_008_V2 NETL_017_V1 NETL_017_V2 PIF_190 NETCR_009_V2 NETL_007_V1 PIF_112 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2

With 4 centers:

kmeans.result.4 = clustering(autoscaling.cachexia, method = "kmeans", num.clusters = 4)
kmeans.plot(autoscaling.cachexia, kmeans.result.4)

kmeans.df.4 = kmeans.result.df(kmeans.result.4, 4)
kmeans.df.4

##   cluster
## 1       1
## 2       2
## 3       3
## 4       4
##                                                                                                                                                                                                                                                samples
## 1                     PIF_178 PIF_087 PIF_090 NETL_005_V1 PIF_115 PIF_110 PIF_154 NETL_022_V2 PIF_143 NETCR_007_V2 PIF_137 PIF_132 PIF_163 NETL_028_V1 PIF_089 PIF_114 NETCR_006_V1 NETCR_025_V2 NETCR_016_V1 PIF_164 PIF_102 NETL_002_V2 NETCR_009_V1
## 2 NETL_019_V1 NETL_022_V1 NETL_008_V1 PIF_146 PIF_162 PIF_160 PIF_113 NETCR_007_V1 PIF_094 NETCR_013_V1 NETL_020_V2 NETCR_012_V2 NETCR_002_V1 PIF_141 NETCR_025_V1 NETCR_015_V1 NETL_001_V1 NETCR_015_V2 NETCR_005_V1 PIF_171 NETL_002_V1 NETCR_019_V2
## 3                                                                   NETCR_014_V2 NETL_004_V1 NETL_028_V2 NETL_020_V1 PIF_192 NETCR_012_V1 PIF_179 PIF_111 NETCR_008_V2 NETL_017_V1 PIF_190 NETL_007_V1 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2
## 4                                                                                  NETCR_014_V1 PIF_119 PIF_099 PIF_100 NETCR_003_V1 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETL_010_V1 NETL_010_V2 NETCR_008_V1 NETL_017_V2 NETCR_009_V2 PIF_112

PCA

Principal components analysis was performed on the data and some plots are shown below:

pca.analysis.result = pca.analysis.dataset(autoscaling.cachexia)
pca.pairs.plot(autoscaling.cachexia, pca.analysis.result, "Muscle.loss")

pca.screeplot(pca.analysis.result)

pca.scoresplot2D(autoscaling.cachexia, pca.analysis.result, "Muscle.loss", ellipses = T)

pca.kmeans.plot2D(autoscaling.cachexia, pca.analysis.result, kmeans.result = kmeans.result.2, ellipses = T, labels = TRUE)

pca.kmeans.plot2D(autoscaling.cachexia, pca.analysis.result, kmeans.result = kmeans.result.4, ellipses = T)

MACHINE LEARNING

For classification models and prediction, the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 10 - number of repeats: 10

Below are some results with the best tune for each model:

performances = train.models.performance(autoscaling.cachexia, c("pls","J48", "JRip","svmLinear","rf"), "Muscle.loss", "repeatedcv", metric = "ROC")
performances$performance

##            Accuracy     Kappa       ROC AccuracySD   KappaSD     ROCSD
## pls       0.7596429 0.4866126 0.7791667  0.1281282 0.2776553 0.1647767
## J48       0.6194643 0.1934978 0.6216667  0.1574585 0.3209776 0.1734196
## JRip      0.6753571 0.3141137 0.6579167  0.1505928 0.3174972 0.1594057
## svmLinear 0.6764286 0.2789688 0.6793333  0.1456102 0.3274145 0.1970193
## rf        0.7223214 0.4066567 0.7879167  0.1424818 0.2996158 0.1558861

And the variable importance in the muscle loss classes for all models:

summary.var.importance(performances, 10)

## $pls
##                        Overall      Mean
## Uracil               100.00000 100.00000
## Hypoxanthine          85.83624  85.83624
## Pantothenate          83.55670  83.55670
## Acetate               82.75249  82.75249
## 3_Hydroxyisovalerate  75.16639  75.16639
## Quinolinate           72.47683  72.47683
## Sucrose               71.35651  71.35651
## Glucose               70.56961  70.56961
## Creatine              66.31554  66.31554
## myo_Inositol          65.74540  65.74540
## 
## $J48
##                       cachexic   control      Mean
## Quinolinate          100.00000 100.00000 100.00000
## Glucose               99.85935  99.85935  99.85935
## Adipate               97.18706  97.18706  97.18706
## N.N_Dimethylglycine   94.51477  94.51477  94.51477
## Valine                93.53024  93.53024  93.53024
## Leucine               91.84248  91.84248  91.84248
## 3_Hydroxyisovalerate  89.87342  89.87342  89.87342
## Betaine               89.59212  89.59212  89.59212
## Creatine              88.04501  88.04501  88.04501
## myo_Inositol          87.76371  87.76371  87.76371
## 
## $JRip
##                            Overall Mean
## myo_Inositol                   100  100
## 1.6_Anhydro_beta_D_glucose       0    0
## 1_Methylnicotinamide             0    0
## 2_Aminobutyrate                  0    0
## 2_Hydroxyisobutyrate             0    0
## 2_Oxoglutarate                   0    0
## 3_Aminoisobutyrate               0    0
## 3_Hydroxybutyrate                0    0
## 3_Hydroxyisovalerate             0    0
## 3_Indoxylsulfate                 0    0
## 
## $svmLinear
##                       cachexic   control      Mean
## Quinolinate          100.00000 100.00000 100.00000
## Glucose               99.85935  99.85935  99.85935
## Adipate               97.18706  97.18706  97.18706
## N.N_Dimethylglycine   94.51477  94.51477  94.51477
## Valine                93.53024  93.53024  93.53024
## Leucine               91.84248  91.84248  91.84248
## 3_Hydroxyisovalerate  89.87342  89.87342  89.87342
## Betaine               89.59212  89.59212  89.59212
## Creatine              88.04501  88.04501  88.04501
## myo_Inositol          87.76371  87.76371  87.76371
## 
## $rf
##                        Overall      Mean
## 3_Hydroxyisovalerate 100.00000 100.00000
## Creatine              98.73737  98.73737
## Quinolinate           98.69689  98.69689
## Acetate               93.09589  93.09589
## Glucose               89.79306  89.79306
## Methylamine           75.59841  75.59841
## N.N_Dimethylglycine   73.56760  73.56760
## Sucrose               69.84848  69.84848
## Succinate             69.04002  69.04002
## myo_Inositol          57.83869  57.83869

FEATURE SELECTION

Using recursive elimination feature selection, various subsets were used with random forests classifier. The results are shown below:

feature.selection.result = feature.selection(autoscaling.cachexia, "Muscle.loss", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          2   0.6461 0.2387     0.1587  0.3265         
##          4   0.6529 0.2594     0.1791  0.3699         
##          8   0.6925 0.3426     0.1739  0.3589         
##         16   0.7029 0.3682     0.1663  0.3470         
##         32   0.7100 0.3861     0.1662  0.3431        *
##         63   0.7050 0.3711     0.1761  0.3669         
## 
## The top 5 variables (out of 32):
##    Glucose, Creatine, Quinolinate, X3.Hydroxyisovalerate, Sucrose

plot(feature.selection.result, type=c("g","o"))

Also, selection by filter was used with the results shown below:

feature.selection.result2 = feature.selection(autoscaling.cachexia, "Muscle.loss", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2

## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance:
## 
##  Accuracy  Kappa AccuracySD KappaSD
##    0.7214 0.4015     0.1625  0.3548
## 
## Using the training set, 54 variables were selected:
##    X1.6.Anhydro.beta.D.glucose, X2.Aminobutyrate, X2.Hydroxyisobutyrate, X2.Oxoglutarate, X3.Hydroxybutyrate...
## 
## During resampling, the top 5 selected variables (out of a possible 60):
##    Acetate (100%), Adipate (100%), Alanine (100%), Asparagine (100%), Betaine (100%)
## 
## On average, 51.4 variables were selected (min = 44, max = 59)