Cachexia concentrations data was read and stored in a dataset. This consists of a list, containing 5 elements, the data itself (consisting in a matrix with the metabolites and their concentrations); the metadata (a data frame, in this case with a single variable Muscle.loss); a string describing the type of the data (“concentrations”); a short description, and the labels information.
Before running any code, make sure that you set your working directory, with the function setwd, to the base folder of the package.
setwd("~/Dropbox")
library(metabolomicsUM)
#create dataset list
cachexia.metadata.file = "Datasets/Cachexia/metadata/metadata_cachexia.csv"
cachexia.data.file = "Datasets/Cachexia/data/human_cachexia.csv"
label.x = "compound"
label.val = "concentration"
cachexia.ds = read.dataset.csv(cachexia.data.file, cachexia.metadata.file, description = "Human cachexia", header.row.meta = T, type = "concentrations", label.x = label.x, label.values = label.val)
sum.dataset(cachexia.ds)
This dataset does not contain any missing value. In this pipeline, no filters were applied, normalization was done by by median, the data was log transformation and auto-scaling was performed.
PREPROCESSING
The dataset contains 77 samples and 63 variables (metabolites).
Log transformation was applied:
logtransform.cachexia = transform.data(cachexia.ds, "log")
And it was scaled using the autoscaling method:
autoscaling.cachexia = scaling(logtransform.cachexia, "auto")
UNIVARIATE TESTS
t-tests were performed on the dataset and the top 10 results ordered by p-value are shown below:
ttest.cachexia = tTests.dataset(autoscaling.cachexia, "Muscle.loss")
ttest.cachexia[1:10,]
## p.value -log10 fdr
## Quinolinate 3.452427e-06 5.461876 0.0002175029
## Glucose 1.644303e-05 4.784018 0.0002757628
## 3-Hydroxyisovalerate 1.883505e-05 4.725033 0.0002757628
## Leucine 1.955487e-05 4.708745 0.0002757628
## Succinate 2.860838e-05 4.543507 0.0002757628
## Valine 3.050168e-05 4.515676 0.0002757628
## N.N-Dimethylglycine 3.372739e-05 4.472017 0.0002757628
## Adipate 3.501750e-05 4.455715 0.0002757628
## myo-Inositol 3.981624e-05 4.399940 0.0002787137
## Acetate 6.945398e-05 4.158303 0.0004148572
plot.ttests(cachexia.ds, ttest.cachexia, 0.0001)
Fold change was calculated on the dataset and the top 10 results ordered by the absolute value of log2(value-of-fold-change) are shown below:
foldchange.cachexia = fold.change(cachexia.ds, "Muscle.loss", "control")
foldchange.cachexia[1:10,]
## FoldChange log2(FC)
## Glucose 5.868549 2.553004
## Adipate 3.871520 1.952900
## Creatine 3.396091 1.763875
## Lactate 3.310075 1.726864
## cis-Aconitate 3.009305 1.589431
## 3-Hydroxybutyrate 2.956018 1.563655
## myo-Inositol 2.902838 1.537464
## Trigonelline 2.751901 1.460428
## Sucrose 2.699269 1.432569
## Succinate 2.668888 1.416239
plot.fold.change(cachexia.ds, foldchange.cachexia, 3)
The volcano plot for t-tests and fold changes are is shown below:
both = volcano.plot.fc.tt(cachexia.ds, foldchange.cachexia, ttest.cachexia, 3, 0.0001)
both
## [1] "Adipate" "Creatine" "Glucose"
A heatmap with the correlations between all variables is shown below:
correlation.cachexia = correlations.dataset(autoscaling.cachexia, method = "pearson")
heatmap(correlation.cachexia, col = topo.colors(256))
CLUSTERING
Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:
hc.result = clustering(autoscaling.cachexia, method = "hc", distance = "euclidean")
dendrogram.plot.col(autoscaling.cachexia, hc.result, "Muscle.loss")
K-Means was performed on the data also with 2 and 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below.
With 2 centers:
kmeans.result.2 = clustering(autoscaling.cachexia, method = "kmeans", num.clusters = 2)
kmeans.plot(autoscaling.cachexia, kmeans.result.2)
kmeans.df.2 = kmeans.result.df(kmeans.result.2, 2)
kmeans.df.2
## cluster
## 1 1
## 2 2
## samples
## 1 PIF_178 PIF_087 PIF_090 NETL_005_V1 PIF_115 PIF_110 NETL_019_V1 PIF_154 NETL_022_V1 NETL_022_V2 NETL_008_V1 PIF_146 PIF_162 PIF_160 PIF_113 PIF_143 NETCR_007_V1 NETCR_007_V2 PIF_137 PIF_094 PIF_132 PIF_163 NETL_028_V1 NETCR_013_V1 NETL_020_V2 NETCR_012_V2 PIF_089 NETCR_002_V1 PIF_114 NETCR_006_V1 PIF_141 NETCR_025_V1 NETCR_025_V2 NETCR_016_V1 PIF_164 NETCR_015_V1 PIF_102 NETL_001_V1 NETCR_015_V2 NETCR_005_V1 PIF_171 NETL_002_V1 NETL_002_V2 NETCR_009_V1 NETCR_019_V2
## 2 NETCR_014_V1 NETCR_014_V2 PIF_119 PIF_099 PIF_100 NETL_004_V1 NETCR_003_V1 NETL_028_V2 NETL_020_V1 PIF_192 NETCR_012_V1 PIF_179 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETL_010_V1 NETL_010_V2 PIF_111 NETCR_008_V1 NETCR_008_V2 NETL_017_V1 NETL_017_V2 PIF_190 NETCR_009_V2 NETL_007_V1 PIF_112 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2
With 4 centers:
kmeans.result.4 = clustering(autoscaling.cachexia, method = "kmeans", num.clusters = 4)
kmeans.plot(autoscaling.cachexia, kmeans.result.4)
kmeans.df.4 = kmeans.result.df(kmeans.result.4, 4)
kmeans.df.4
## cluster
## 1 1
## 2 2
## 3 3
## 4 4
## samples
## 1 PIF_178 PIF_087 PIF_090 NETL_005_V1 PIF_115 PIF_110 PIF_154 NETL_022_V2 PIF_143 NETCR_007_V2 PIF_137 PIF_132 PIF_163 NETL_028_V1 PIF_089 PIF_114 NETCR_006_V1 NETCR_025_V2 NETCR_016_V1 PIF_164 PIF_102 NETL_002_V2 NETCR_009_V1
## 2 NETL_019_V1 NETL_022_V1 NETL_008_V1 PIF_146 PIF_162 PIF_160 PIF_113 NETCR_007_V1 PIF_094 NETCR_013_V1 NETL_020_V2 NETCR_012_V2 NETCR_002_V1 PIF_141 NETCR_025_V1 NETCR_015_V1 NETL_001_V1 NETCR_015_V2 NETCR_005_V1 PIF_171 NETL_002_V1 NETCR_019_V2
## 3 NETCR_014_V2 NETL_004_V1 NETL_028_V2 NETL_020_V1 PIF_192 NETCR_012_V1 PIF_179 PIF_111 NETCR_008_V2 NETL_017_V1 PIF_190 NETL_007_V1 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2
## 4 NETCR_014_V1 PIF_119 PIF_099 PIF_100 NETCR_003_V1 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETL_010_V1 NETL_010_V2 NETCR_008_V1 NETL_017_V2 NETCR_009_V2 PIF_112
PCA
Principal components analysis was performed on the data and some plots are shown below:
pca.analysis.result = pca.analysis.dataset(autoscaling.cachexia)
pca.pairs.plot(autoscaling.cachexia, pca.analysis.result, "Muscle.loss")
pca.screeplot(pca.analysis.result)
pca.scoresplot2D(autoscaling.cachexia, pca.analysis.result, "Muscle.loss", ellipses = T)
pca.kmeans.plot2D(autoscaling.cachexia, pca.analysis.result, kmeans.result = kmeans.result.2, ellipses = T, labels = TRUE)
pca.kmeans.plot2D(autoscaling.cachexia, pca.analysis.result, kmeans.result = kmeans.result.4, ellipses = T)
MACHINE LEARNING
For classification models and prediction, the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 10 - number of repeats: 10
Below are some results with the best tune for each model:
performances = train.models.performance(autoscaling.cachexia, c("pls","J48", "JRip","svmLinear","rf"), "Muscle.loss", "repeatedcv", metric = "ROC")
performances$performance
## Accuracy Kappa ROC AccuracySD KappaSD ROCSD
## pls 0.7596429 0.4866126 0.7791667 0.1281282 0.2776553 0.1647767
## J48 0.6194643 0.1934978 0.6216667 0.1574585 0.3209776 0.1734196
## JRip 0.6753571 0.3141137 0.6579167 0.1505928 0.3174972 0.1594057
## svmLinear 0.6764286 0.2789688 0.6793333 0.1456102 0.3274145 0.1970193
## rf 0.7223214 0.4066567 0.7879167 0.1424818 0.2996158 0.1558861
And the variable importance in the muscle loss classes for all models:
summary.var.importance(performances, 10)
## $pls
## Overall Mean
## Uracil 100.00000 100.00000
## Hypoxanthine 85.83624 85.83624
## Pantothenate 83.55670 83.55670
## Acetate 82.75249 82.75249
## 3_Hydroxyisovalerate 75.16639 75.16639
## Quinolinate 72.47683 72.47683
## Sucrose 71.35651 71.35651
## Glucose 70.56961 70.56961
## Creatine 66.31554 66.31554
## myo_Inositol 65.74540 65.74540
##
## $J48
## cachexic control Mean
## Quinolinate 100.00000 100.00000 100.00000
## Glucose 99.85935 99.85935 99.85935
## Adipate 97.18706 97.18706 97.18706
## N.N_Dimethylglycine 94.51477 94.51477 94.51477
## Valine 93.53024 93.53024 93.53024
## Leucine 91.84248 91.84248 91.84248
## 3_Hydroxyisovalerate 89.87342 89.87342 89.87342
## Betaine 89.59212 89.59212 89.59212
## Creatine 88.04501 88.04501 88.04501
## myo_Inositol 87.76371 87.76371 87.76371
##
## $JRip
## Overall Mean
## myo_Inositol 100 100
## 1.6_Anhydro_beta_D_glucose 0 0
## 1_Methylnicotinamide 0 0
## 2_Aminobutyrate 0 0
## 2_Hydroxyisobutyrate 0 0
## 2_Oxoglutarate 0 0
## 3_Aminoisobutyrate 0 0
## 3_Hydroxybutyrate 0 0
## 3_Hydroxyisovalerate 0 0
## 3_Indoxylsulfate 0 0
##
## $svmLinear
## cachexic control Mean
## Quinolinate 100.00000 100.00000 100.00000
## Glucose 99.85935 99.85935 99.85935
## Adipate 97.18706 97.18706 97.18706
## N.N_Dimethylglycine 94.51477 94.51477 94.51477
## Valine 93.53024 93.53024 93.53024
## Leucine 91.84248 91.84248 91.84248
## 3_Hydroxyisovalerate 89.87342 89.87342 89.87342
## Betaine 89.59212 89.59212 89.59212
## Creatine 88.04501 88.04501 88.04501
## myo_Inositol 87.76371 87.76371 87.76371
##
## $rf
## Overall Mean
## 3_Hydroxyisovalerate 100.00000 100.00000
## Creatine 98.73737 98.73737
## Quinolinate 98.69689 98.69689
## Acetate 93.09589 93.09589
## Glucose 89.79306 89.79306
## Methylamine 75.59841 75.59841
## N.N_Dimethylglycine 73.56760 73.56760
## Sucrose 69.84848 69.84848
## Succinate 69.04002 69.04002
## myo_Inositol 57.83869 57.83869
FEATURE SELECTION
Using recursive elimination feature selection, various subsets were used with random forests classifier. The results are shown below:
feature.selection.result = feature.selection(autoscaling.cachexia, "Muscle.loss", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 2 0.6461 0.2387 0.1587 0.3265
## 4 0.6529 0.2594 0.1791 0.3699
## 8 0.6925 0.3426 0.1739 0.3589
## 16 0.7029 0.3682 0.1663 0.3470
## 32 0.7100 0.3861 0.1662 0.3431 *
## 63 0.7050 0.3711 0.1761 0.3669
##
## The top 5 variables (out of 32):
## Glucose, Creatine, Quinolinate, X3.Hydroxyisovalerate, Sucrose
plot(feature.selection.result, type=c("g","o"))
Also, selection by filter was used with the results shown below:
feature.selection.result2 = feature.selection(autoscaling.cachexia, "Muscle.loss", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## Accuracy Kappa AccuracySD KappaSD
## 0.7214 0.4015 0.1625 0.3548
##
## Using the training set, 54 variables were selected:
## X1.6.Anhydro.beta.D.glucose, X2.Aminobutyrate, X2.Hydroxyisobutyrate, X2.Oxoglutarate, X3.Hydroxybutyrate...
##
## During resampling, the top 5 selected variables (out of a possible 60):
## Acetate (100%), Adipate (100%), Alanine (100%), Asparagine (100%), Betaine (100%)
##
## On average, 51.4 variables were selected (min = 44, max = 59)