Cachexia concentrations data was read and stored in a dataset. This consists of a list, containing 5 elements, the data itself (consisting in a matrix with the metabolites and their concentrations); the metadata (a data frame, in this case with a single variable Muscle.loss); a string describing the type of the data (“concentrations”); a short description, and the labels information.
Before running any code, make sure that you set your working directory, with the function setwd, to the base folder of the package.
setwd("~/Dropbox")
library(metabolomicsUM)
#create dataset list
cachexia.metadata.file = "Datasets/Cachexia/metadata/metadata_cachexia.csv"
cachexia.data.file = "Datasets/Cachexia/data/human_cachexia.csv"
label.x = "compound"
label.val = "concentration"
cachexia.ds = read.dataset.csv(cachexia.data.file, cachexia.metadata.file, description = "Human cachexia", type = "concentrations", label.x = label.x, label.values = label.val)
This dataset does not contain any missing value. In this pipeline, no filtering and no normalization were done.
PREPROCESSING
The dataset contains 77 samples and 63 variables (metabolites).
UNIVARIATE TESTS
t-tests were performed on the dataset and the top 10 results ordered by p-value are shown below:
ttest.cachexia = tTests.dataset(cachexia.ds, "Muscle.loss")
ttest.cachexia[1:10,]
## p.value -log10 fdr
## Quinolinate 0.0001185108 3.926242 0.003264126
## Valine 0.0001394238 3.855663 0.003264126
## N.N-Dimethylglycine 0.0001554346 3.808452 0.003264126
## Leucine 0.0002695046 3.569434 0.004244697
## Dimethylamine 0.0004460069 3.350658 0.004616827
## Pyroglutamate 0.0004845363 3.314674 0.004616827
## Creatinine 0.0005129808 3.289899 0.004616827
## Glutamine 0.0010607678 2.974380 0.007467813
## Alanine 0.0011788609 2.928537 0.007467813
## 3-Hydroxybutyrate 0.0011853671 2.926147 0.007467813
plot.ttests(cachexia.ds, ttest.cachexia, 0.001)
Fold change was calculated on the dataset and the top 10 results ordered by the absolute value of log2(value-of-fold-change) are shown below:
foldchange.cachexia = fold.change(cachexia.ds, "Muscle.loss", "cachexic")
foldchange.cachexia[1:10,]
## FoldChange log2(FC)
## Glucose 0.1703999 -2.553004
## Adipate 0.2582965 -1.952900
## Creatine 0.2944562 -1.763875
## Lactate 0.3021080 -1.726864
## cis-Aconitate 0.3323026 -1.589431
## 3-Hydroxybutyrate 0.3382929 -1.563655
## myo-Inositol 0.3444904 -1.537464
## Trigonelline 0.3633852 -1.460428
## Sucrose 0.3704707 -1.432569
## Succinate 0.3746879 -1.416239
plot.fold.change(cachexia.ds, foldchange.cachexia, 3)
The volcano plot for t-tests and fold changes are is shown below:
both = volcano.plot.fc.tt(cachexia.ds, foldchange.cachexia, ttest.cachexia, 3, 0.001)
both
## NULL
A heatmap with the correlations between all the variables is shown below:
correlation.cachexia = correlations.dataset(cachexia.ds, method = "pearson")
heatmap(correlation.cachexia, col = topo.colors(256))
CLUSTERING
Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:
hc.result = clustering(cachexia.ds, method = "hc", distance = "euclidean")
dendrogram.plot.col(cachexia.ds, hc.result)
dendrogram.plot.col(cachexia.ds, hc.result, lab.cex = 0.4)
K-Means was performed on the data with 2 and 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below.
With 2 centers:
kmeans.result.2 = clustering(cachexia.ds, method = "kmeans", num.clusters = 2)
kmeans.plot(cachexia.ds, kmeans.result.2)
kmeans.df.2 = kmeans.result.df(kmeans.result.2, 2)
kmeans.df.2
## cluster
## 1 1
## 2 2
## samples
## 1 PIF_178 PIF_087 PIF_090 NETL_005_V1 PIF_110 NETL_008_V1 PIF_146 PIF_143 NETCR_007_V2 PIF_137 PIF_132 PIF_163 NETL_028_V1 NETCR_013_V1 NETCR_012_V2 PIF_089 PIF_114 NETCR_006_V1 NETCR_025_V1 NETCR_025_V2 NETCR_016_V1 PIF_164 NETCR_005_V1 NETL_002_V2 NETCR_009_V1
## 2 PIF_115 NETL_019_V1 NETCR_014_V1 NETCR_014_V2 PIF_154 NETL_022_V1 NETL_022_V2 PIF_119 PIF_099 PIF_162 PIF_160 PIF_113 NETCR_007_V1 PIF_100 NETL_004_V1 PIF_094 NETCR_003_V1 NETL_028_V2 NETL_020_V1 NETL_020_V2 PIF_192 NETCR_012_V1 NETCR_002_V1 PIF_179 PIF_141 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETCR_015_V1 PIF_102 NETL_010_V1 NETL_010_V2 NETL_001_V1 NETCR_015_V2 PIF_111 PIF_171 NETCR_008_V1 NETCR_008_V2 NETL_017_V1 NETL_017_V2 NETL_002_V1 PIF_190 NETCR_009_V2 NETL_007_V1 PIF_112 NETCR_019_V2 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2
With 4 centers:
kmeans.result.4 = clustering(cachexia.ds, method = "kmeans", num.clusters = 4)
kmeans.plot(cachexia.ds, kmeans.result.4)
kmeans.df.4 = kmeans.result.df(kmeans.result.4, 4)
kmeans.df.4
## cluster
## 1 1
## 2 2
## 3 3
## 4 4
## samples
## 1 PIF_115 NETL_019_V1 NETCR_014_V2 PIF_154 NETL_022_V1 NETL_022_V2 PIF_146 PIF_162 PIF_160 PIF_113 PIF_143 NETCR_007_V1 PIF_094 PIF_163 NETCR_013_V1 NETL_020_V2 NETCR_002_V1 PIF_141 NETCR_015_V1 PIF_102 NETL_001_V1 NETCR_015_V2 NETCR_005_V1 PIF_171 NETCR_008_V2 NETL_002_V1 NETCR_019_V2
## 2 NETCR_014_V1 PIF_119 PIF_099 PIF_100 NETL_004_V1 NETCR_003_V1 NETL_028_V2 NETL_020_V1 PIF_192 NETCR_012_V1 PIF_179 PIF_116 PIF_191 NETL_013_V1 PIF_188 PIF_195 NETL_010_V1 NETL_010_V2 PIF_111 NETCR_008_V1 NETL_017_V1 NETL_017_V2 PIF_190 NETCR_009_V2 NETL_007_V1 PIF_112 NETL_012_V1 NETL_012_V2 NETL_003_V1 NETL_003_V2
## 3 PIF_178 PIF_087 PIF_110 NETL_008_V1 NETCR_007_V2 PIF_137 NETCR_012_V2 PIF_089 PIF_114 NETCR_006_V1 NETCR_025_V1 NETCR_016_V1 PIF_164 NETL_002_V2 NETCR_009_V1
## 4 PIF_090 NETL_005_V1 PIF_132 NETL_028_V1 NETCR_025_V2
PCA
Principal components analysis was performed on the data and some plots are shown below:
pca.analysis.result = pca.analysis.dataset(cachexia.ds)
pca.pairs.plot(cachexia.ds, pca.analysis.result, "Muscle.loss")
pca.screeplot(pca.analysis.result)
pca.scoresplot2D(cachexia.ds, pca.analysis.result, "Muscle.loss", ellipses = T)
pca.kmeans.plot2D(cachexia.ds, pca.analysis.result, kmeans.result = kmeans.result.2, ellipses = T, labels = TRUE)
pca.kmeans.plot2D(cachexia.ds, pca.analysis.result, kmeans.result = kmeans.result.4, ellipses = T)
MACHINE LEARNING
For classification models and prediction the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 10 - number of repeats: 10
Below are some results with the best tune for each model:
performances = train.models.performance(cachexia.ds, c("pls","J48", "JRip","svmLinear","rf"), "Muscle.loss", "repeatedcv", metric = "ROC")
performances$performance
## Accuracy Kappa ROC AccuracySD KappaSD ROCSD
## pls 0.6810714 0.3371415 0.7650000 0.1635654 0.3301522 0.1670282
## J48 0.6078571 0.1590593 0.6087500 0.1622585 0.3292922 0.1852152
## JRip 0.6796429 0.3264800 0.6602500 0.1587851 0.3356477 0.1706011
## svmLinear 0.6157143 0.1222838 0.7031667 0.1496509 0.3317783 0.2175738
## rf 0.7237500 0.4030278 0.7983333 0.1476351 0.3236571 0.1596415
And the variable importance in the muscle loss classes for all models:
summary.var.importance(performances, 10)
## $pls
## Overall Mean
## Creatinine 100.000000 100.000000
## Glucose 57.703402 57.703402
## Hippurate 54.619548 54.619548
## Citrate 28.895949 28.895949
## Trimethylamine_N_oxide 19.721237 19.721237
## Glycine 14.267857 14.267857
## Lactate 10.976540 10.976540
## Glutamine 10.550499 10.550499
## myo_Inositol 9.989980 9.989980
## Creatine 9.271334 9.271334
##
## $J48
## cachexic control Mean
## Quinolinate 100.00000 100.00000 100.00000
## Glucose 99.85935 99.85935 99.85935
## Adipate 97.18706 97.18706 97.18706
## N.N_Dimethylglycine 94.51477 94.51477 94.51477
## Valine 93.53024 93.53024 93.53024
## Leucine 91.84248 91.84248 91.84248
## 3_Hydroxyisovalerate 89.87342 89.87342 89.87342
## Betaine 89.59212 89.59212 89.59212
## Creatine 88.04501 88.04501 88.04501
## myo_Inositol 87.76371 87.76371 87.76371
##
## $JRip
## Overall Mean
## Creatine 100 100
## 1.6_Anhydro_beta_D_glucose 0 0
## 1_Methylnicotinamide 0 0
## 2_Aminobutyrate 0 0
## 2_Hydroxyisobutyrate 0 0
## 2_Oxoglutarate 0 0
## 3_Aminoisobutyrate 0 0
## 3_Hydroxybutyrate 0 0
## 3_Hydroxyisovalerate 0 0
## 3_Indoxylsulfate 0 0
##
## $svmLinear
## cachexic control Mean
## Quinolinate 100.00000 100.00000 100.00000
## Glucose 99.85935 99.85935 99.85935
## Adipate 97.18706 97.18706 97.18706
## N.N_Dimethylglycine 94.51477 94.51477 94.51477
## Valine 93.53024 93.53024 93.53024
## Leucine 91.84248 91.84248 91.84248
## 3_Hydroxyisovalerate 89.87342 89.87342 89.87342
## Betaine 89.59212 89.59212 89.59212
## Creatine 88.04501 88.04501 88.04501
## myo_Inositol 87.76371 87.76371 87.76371
##
## $rf
## Overall Mean
## 3_Hydroxyisovalerate 100.00000 100.00000
## Creatine 84.18837 84.18837
## Adipate 77.73108 77.73108
## N.N_Dimethylglycine 72.33048 72.33048
## myo_Inositol 71.46391 71.46391
## Betaine 66.05395 66.05395
## Sucrose 62.77924 62.77924
## Methylamine 61.88298 61.88298
## Acetate 60.47007 60.47007
## Glucose 58.29863 58.29863
FEATURE SELECTION
Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:
feature.selection.result = feature.selection(cachexia.ds, "Muscle.loss", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 2 0.6400 0.2345 0.1376 0.2949
## 4 0.6546 0.2802 0.1722 0.3585
## 8 0.6954 0.3572 0.1532 0.3297
## 16 0.6832 0.3253 0.1525 0.3222
## 32 0.7036 0.3692 0.1501 0.3168
## 63 0.7143 0.3888 0.1503 0.3200 *
##
## The top 5 variables (out of 63):
## Creatine, Glucose, X3.Hydroxyisovalerate, Quinolinate, Sucrose
plot(feature.selection.result, type=c("g","o"))
Also selection by filter was used with the results shown below:
feature.selection.result2 = feature.selection(cachexia.ds, "Muscle.loss", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## Accuracy Kappa AccuracySD KappaSD
## 0.7186 0.396 0.1625 0.3452
##
## Using the training set, 40 variables were selected:
## X2.Aminobutyrate, X2.Hydroxyisobutyrate, X3.Hydroxybutyrate, X3.Hydroxyisovalerate, X3.Indoxylsulfate...
##
## During resampling, the top 5 selected variables (out of a possible 52):
## Acetate (100%), Alanine (100%), Asparagine (100%), Betaine (100%), cis.Aconitate (100%)
##
## On average, 38.7 variables were selected (min = 31, max = 45)