Cachexia pipeline 2

Cachexia concentrations data was read and stored in a dataset. This consists of a list, containing 5 elements, the data itself (consisting in a matrix with the metabolites and their concentrations); the metadata (a data frame, in this case with a single variable Muscle.loss); a string describing the type of the data (“concentrations”); a short description, and the labels information.

Before running any code, make sure that you set your working directory, with the function setwd, to the base folder of the package.

setwd("~/Dropbox")
library(metabolomicsUM)

#create dataset list
cachexia.metadata.file = "Datasets/Cachexia/metadata/metadata_cachexia.csv"
cachexia.data.file = "Datasets/Cachexia/data/human_cachexia.csv"
label.x = "compound"
label.val = "concentration"
cachexia.ds = read.dataset.csv(cachexia.data.file, cachexia.metadata.file, description = "Human cachexia", type = "concentrations", label.x = label.x, label.values = label.val)

This dataset does not contain any missing value. In this pipeline, no filtering and no normalization were done.

PREPROCESSING

The dataset contains 77 samples and 63 variables (metabolites).

UNIVARIATE TESTS

t-tests were performed on the dataset and the top 10 results ordered by p-value are shown below:

ttest.cachexia = tTests.dataset(cachexia.ds, "Muscle.loss")
ttest.cachexia[1:10,]
##                          p.value   -log10         fdr
## Quinolinate         0.0001185108 3.926242 0.003264126
## Valine              0.0001394238 3.855663 0.003264126
## N.N-Dimethylglycine 0.0001554346 3.808452 0.003264126
## Leucine             0.0002695046 3.569434 0.004244697
## Dimethylamine       0.0004460069 3.350658 0.004616827
## Pyroglutamate       0.0004845363 3.314674 0.004616827
## Creatinine          0.0005129808 3.289899 0.004616827
## Glutamine           0.0010607678 2.974380 0.007467813
## Alanine             0.0011788609 2.928537 0.007467813
## 3-Hydroxybutyrate   0.0011853671 2.926147 0.007467813
plot.ttests(cachexia.ds, ttest.cachexia, 0.001)

Fold change was calculated on the dataset and the top 10 results ordered by the absolute value of log2(value-of-fold-change) are shown below:

foldchange.cachexia = fold.change(cachexia.ds, "Muscle.loss", "cachexic")
foldchange.cachexia[1:10,]
##                   FoldChange  log2(FC)
## Glucose            0.1703999 -2.553004
## Adipate            0.2582965 -1.952900
## Creatine           0.2944562 -1.763875
## Lactate            0.3021080 -1.726864
## cis-Aconitate      0.3323026 -1.589431
## 3-Hydroxybutyrate  0.3382929 -1.563655
## myo-Inositol       0.3444904 -1.537464
## Trigonelline       0.3633852 -1.460428
## Sucrose            0.3704707 -1.432569
## Succinate          0.3746879 -1.416239
plot.fold.change(cachexia.ds, foldchange.cachexia, 3)

The volcano plot for t-tests and fold changes are is shown below:

both = volcano.plot.fc.tt(cachexia.ds, foldchange.cachexia, ttest.cachexia, 3, 0.001)

both
## NULL

A heatmap with the correlations between all the variables is shown below:

correlation.cachexia = correlations.dataset(cachexia.ds, method = "pearson")
heatmap(correlation.cachexia, col =  topo.colors(256))

CLUSTERING

Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:

hc.result = clustering(cachexia.ds, method = "hc", distance = "euclidean")
dendrogram.plot.col(cachexia.ds, hc.result)

dendrogram.plot.col(cachexia.ds, hc.result, lab.cex = 0.4)

K-Means was performed on the data with 2 and 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below.

With 2 centers:

kmeans.result.2 = clustering(cachexia.ds, method = "kmeans", num.clusters = 2)
kmeans.plot(cachexia.ds, kmeans.result.2)

kmeans.df.2 = kmeans.result.df(kmeans.result.2, 2)
kmeans.df.2
##   cluster
## 1       1
## 2       2
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   samples
## 1                                                                                                                                                                                                                                                                                                    PIF_178 PIF_087 PIF_090 NETL_005_V1 PIF_110 NETL_008_V1 PIF_146 PIF_143 NETCR_007_V2 PIF_137 PIF_132 PIF_163 NETL_028_V1 NETCR_013_V1 NETCR_012_V2 PIF_089 PIF_114 NETCR_006_V1 NETCR_025_V1 NETCR_025_V2 NETCR_016_V1 PIF_164 NETCR_005_V1 NETL_002_V2 NETCR_009_V1
## 2 PIF_115 NETL_019_V1 NETCR_014_V1 NETCR_014_V2 PIF_154 NETL_022_V1 NETL_022_V2 PIF_119 PIF_099 PIF_162 PIF_160 PIF_113 NETCR_007_V1 PIF_100 NETL_004_V1 PIF_094 NETCR_003_V1 NETL_028_V2 NETL_020_V1 NETL_020_V2 PIF_192 NETCR_012_V1 NETCR_002_V1 PIF_179 PIF_141 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETCR_015_V1 PIF_102 NETL_010_V1 NETL_010_V2 NETL_001_V1 NETCR_015_V2 PIF_111 PIF_171 NETCR_008_V1 NETCR_008_V2 NETL_017_V1 NETL_017_V2 NETL_002_V1 PIF_190 NETCR_009_V2 NETL_007_V1 PIF_112 NETCR_019_V2 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2

With 4 centers:

kmeans.result.4 = clustering(cachexia.ds, method = "kmeans", num.clusters = 4)
kmeans.plot(cachexia.ds, kmeans.result.4)

kmeans.df.4 = kmeans.result.df(kmeans.result.4, 4)
kmeans.df.4
##   cluster
## 1       1
## 2       2
## 3       3
## 4       4
##                                                                                                                                                                                                                                                                                                                        samples
## 1                                 PIF_115 NETL_019_V1 NETCR_014_V2 PIF_154 NETL_022_V1 NETL_022_V2 PIF_146 PIF_162 PIF_160 PIF_113 PIF_143 NETCR_007_V1 PIF_094 PIF_163 NETCR_013_V1 NETL_020_V2 NETCR_002_V1 PIF_141 NETCR_015_V1 PIF_102 NETL_001_V1 NETCR_015_V2 NETCR_005_V1 PIF_171 NETCR_008_V2 NETL_002_V1 NETCR_019_V2
## 2 NETCR_014_V1 PIF_119 PIF_099 PIF_100 NETL_004_V1 NETCR_003_V1 NETL_028_V2 NETL_020_V1 PIF_192 NETCR_012_V1 PIF_179 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETL_010_V1 NETL_010_V2 PIF_111 NETCR_008_V1 NETL_017_V1 NETL_017_V2 PIF_190 NETCR_009_V2 NETL_007_V1 PIF_112 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2
## 3                                                                                                                                                                PIF_178 PIF_087 PIF_110 NETL_008_V1 NETCR_007_V2 PIF_137 NETCR_012_V2 PIF_089 PIF_114 NETCR_006_V1 NETCR_025_V1 NETCR_016_V1 PIF_164 NETL_002_V2 NETCR_009_V1
## 4                                                                                                                                                                                                                                                                         PIF_090 NETL_005_V1 PIF_132 NETL_028_V1 NETCR_025_V2

PCA

Principal components analysis was performed on the data and some plots are shown below:

pca.analysis.result = pca.analysis.dataset(cachexia.ds)

pca.pairs.plot(cachexia.ds, pca.analysis.result, "Muscle.loss")

pca.screeplot(pca.analysis.result)

pca.scoresplot2D(cachexia.ds, pca.analysis.result, "Muscle.loss", ellipses = T)

pca.kmeans.plot2D(cachexia.ds, pca.analysis.result, kmeans.result = kmeans.result.2, ellipses = T, labels = TRUE)

pca.kmeans.plot2D(cachexia.ds, pca.analysis.result, kmeans.result = kmeans.result.4, ellipses = T)

MACHINE LEARNING

For classification models and prediction the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 10 - number of repeats: 10

Below are some results with the best tune for each model:

performances = train.models.performance(cachexia.ds, c("pls","J48", "JRip","svmLinear","rf"), "Muscle.loss", "repeatedcv", metric = "ROC")
performances$performance
##            Accuracy     Kappa       ROC AccuracySD   KappaSD     ROCSD
## pls       0.6810714 0.3371415 0.7650000  0.1635654 0.3301522 0.1670282
## J48       0.6078571 0.1590593 0.6087500  0.1622585 0.3292922 0.1852152
## JRip      0.6796429 0.3264800 0.6602500  0.1587851 0.3356477 0.1706011
## svmLinear 0.6157143 0.1222838 0.7031667  0.1496509 0.3317783 0.2175738
## rf        0.7237500 0.4030278 0.7983333  0.1476351 0.3236571 0.1596415

And the variable importance in the muscle loss classes for all models:

summary.var.importance(performances, 10)
## $pls
##                           Overall       Mean
## Creatinine             100.000000 100.000000
## Glucose                 57.703402  57.703402
## Hippurate               54.619548  54.619548
## Citrate                 28.895949  28.895949
## Trimethylamine_N_oxide  19.721237  19.721237
## Glycine                 14.267857  14.267857
## Lactate                 10.976540  10.976540
## Glutamine               10.550499  10.550499
## myo_Inositol             9.989980   9.989980
## Creatine                 9.271334   9.271334
## 
## $J48
##                       cachexic   control      Mean
## Quinolinate          100.00000 100.00000 100.00000
## Glucose               99.85935  99.85935  99.85935
## Adipate               97.18706  97.18706  97.18706
## N.N_Dimethylglycine   94.51477  94.51477  94.51477
## Valine                93.53024  93.53024  93.53024
## Leucine               91.84248  91.84248  91.84248
## 3_Hydroxyisovalerate  89.87342  89.87342  89.87342
## Betaine               89.59212  89.59212  89.59212
## Creatine              88.04501  88.04501  88.04501
## myo_Inositol          87.76371  87.76371  87.76371
## 
## $JRip
##                            Overall Mean
## Creatine                       100  100
## 1.6_Anhydro_beta_D_glucose       0    0
## 1_Methylnicotinamide             0    0
## 2_Aminobutyrate                  0    0
## 2_Hydroxyisobutyrate             0    0
## 2_Oxoglutarate                   0    0
## 3_Aminoisobutyrate               0    0
## 3_Hydroxybutyrate                0    0
## 3_Hydroxyisovalerate             0    0
## 3_Indoxylsulfate                 0    0
## 
## $svmLinear
##                       cachexic   control      Mean
## Quinolinate          100.00000 100.00000 100.00000
## Glucose               99.85935  99.85935  99.85935
## Adipate               97.18706  97.18706  97.18706
## N.N_Dimethylglycine   94.51477  94.51477  94.51477
## Valine                93.53024  93.53024  93.53024
## Leucine               91.84248  91.84248  91.84248
## 3_Hydroxyisovalerate  89.87342  89.87342  89.87342
## Betaine               89.59212  89.59212  89.59212
## Creatine              88.04501  88.04501  88.04501
## myo_Inositol          87.76371  87.76371  87.76371
## 
## $rf
##                        Overall      Mean
## 3_Hydroxyisovalerate 100.00000 100.00000
## Creatine              84.18837  84.18837
## Adipate               77.73108  77.73108
## N.N_Dimethylglycine   72.33048  72.33048
## myo_Inositol          71.46391  71.46391
## Betaine               66.05395  66.05395
## Sucrose               62.77924  62.77924
## Methylamine           61.88298  61.88298
## Acetate               60.47007  60.47007
## Glucose               58.29863  58.29863

FEATURE SELECTION

Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:

feature.selection.result = feature.selection(cachexia.ds, "Muscle.loss", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          2   0.6400 0.2345     0.1376  0.2949         
##          4   0.6546 0.2802     0.1722  0.3585         
##          8   0.6954 0.3572     0.1532  0.3297         
##         16   0.6832 0.3253     0.1525  0.3222         
##         32   0.7036 0.3692     0.1501  0.3168         
##         63   0.7143 0.3888     0.1503  0.3200        *
## 
## The top 5 variables (out of 63):
##    Creatine, Glucose, X3.Hydroxyisovalerate, Quinolinate, Sucrose
plot(feature.selection.result, type=c("g","o"))

Also selection by filter was used with the results shown below:

feature.selection.result2 = feature.selection(cachexia.ds, "Muscle.loss", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2
## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance:
## 
##  Accuracy Kappa AccuracySD KappaSD
##    0.7186 0.396     0.1625  0.3452
## 
## Using the training set, 40 variables were selected:
##    X2.Aminobutyrate, X2.Hydroxyisobutyrate, X3.Hydroxybutyrate, X3.Hydroxyisovalerate, X3.Indoxylsulfate...
## 
## During resampling, the top 5 selected variables (out of a possible 52):
##    Acetate (100%), Alanine (100%), Asparagine (100%), Betaine (100%), cis.Aconitate (100%)
## 
## On average, 38.7 variables were selected (min = 31, max = 45)