Propolis peak list data was read and stored in a list, containing 2 elements, the dataset consisting in a list with the samples with their ppm intensities being the elements. The propolis metadata consists on the agroregions and regions.
setwd("~/Dropbox")
library(metabolomicsUM)
source("Datasets/Propolis/NMR/scripts/propolis_metadata.R")
prop.nmr.metadata.file = "Datasets/Propolis/NMR/metadata/metadata_propolis_agro.csv"
prop.nmr.data.folder = "Datasets/Propolis/NMR/data"
get.metadata.agro(prop.nmr.data.folder, write.file = TRUE, file.name = prop.nmr.metadata.file)
prop.nmr.metadata = read.metadata(prop.nmr.metadata.file)
peaks.lists = read.csvs.folder(prop.nmr.data.folder)
Agroregions metadata used, own grouping peaks algorithm used, removed peak groups with less than 25% of values, missing values imputation with low value, and autoscaling.
PREPROCESSING
Own grouping peaks algorithm used with step = 0.03:
# removing resonances in selected regions
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 0, 0.19)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 3.29, 3.31)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 4.85, 5)
#group peaks
prop.nmr.ds = group.peaks(peaks.lists, type = "nmr-peaks", metadata = prop.nmr.metadata, description = "NMR propolis", label.x = "ppm", label.values = "intensity")
sum.dataset(prop.nmr.ds)
## Dataset summary:
## Valid dataset
## Description: NMR propolis
## Type of data: nmr-peaks
## Number of samples: 59
## Number of data points 293
## Number of metadata variables: 2
## Label of x-axis values: ppm
## Label of data points: intensity
## Number of missing values in data: 5376
## Mean of data values: 0.09016594
## Median of data values: 0.0287
## Standard deviation: 0.1904829
## Range of values: 0 10
## Quantiles:
## 0% 25% 50% 75% 100%
## 0.0000 0.0081 0.0287 0.0929 10.0000
Peak groups with less than 25% of values were removed:
nsamps = num.samples(prop.nmr.ds)
prop.nmr.ds = remove.variables.by.nas(prop.nmr.ds, 0.75*nsamps)
There are 2659 missing values found in the dataset, which will be replaced with a low value (0.00005).
prop.nmr.na = missingvalues.imputation(prop.nmr.ds, method="value", value = 0.00005)
The data was scaled with autoscaling:
prop.nmr.scal = scaling(prop.nmr.na, "auto")
UNIVARIATE TESTS
An analysis of variance (ANOVA) was conducted over the data with tukey test also, and this is the top 10 results ordered by p-value:
anova.prop.nmr.scal = aov.all.vars(prop.nmr.scal, "agroregions")
anova.prop.nmr.scal[1:20,]
## pvalues logs fdr tukey
## 6.13 2.607511e-06 5.583774 0.0006310178 Plain-Highlands; Plateau-Highlands
## 6.17 1.275155e-05 4.894437 0.0015429370 Plain-Highlands; Plateau-Highlands
## 2.5 2.540262e-05 4.595122 0.0018893344 Plain-Highlands; Plateau-Highlands
## 6.09 3.309534e-05 4.480233 0.0018893344 Plain-Highlands; Plateau-Highlands
## 1.9 3.903583e-05 4.408537 0.0018893344 Plain-Highlands; Plateau-Highlands
## 2.95 5.957016e-05 4.224971 0.0024026632 Plain-Highlands; Plateau-Highlands
## 9.51 8.640438e-05 4.063464 0.0029871227 Plain-Highlands; Plateau-Highlands
## 6.75 1.711229e-04 3.766692 0.0047148604 Plain-Highlands; Plateau-Highlands
## 5.21 1.753460e-04 3.756104 0.0047148604 Plain-Highlands; Plateau-Highlands
## 7.78 3.539747e-04 3.451028 0.0085661869 Plain-Highlands; Plateau-Highlands
## 2.02 4.253408e-04 3.371263 0.0090007500 Plain-Highlands; Plateau-Highlands
## 6.96 4.463182e-04 3.350355 0.0090007500 Plain-Highlands; Plateau-Highlands
## 5.28 9.225031e-04 3.035032 0.0171236034 Plain-Highlands; Plateau-Plain
## 2.47 1.023137e-03 2.990066 0.0171236034 Plain-Highlands; Plateau-Highlands
## 1.99 1.134243e-03 2.945294 0.0171236034 Plain-Highlands; Plateau-Plain
## 2.92 1.135982e-03 2.944628 0.0171236034 Plain-Highlands; Plateau-Highlands
## 2.11 1.202898e-03 2.919771 0.0171236034 Plain-Highlands; Plateau-Plain
## 1.93 1.563358e-03 2.805942 0.0210184791 Plain-Highlands; Plateau-Highlands
## 2.38 1.961535e-03 2.707404 0.0249837635 Plateau-Plain
## 6.78 2.119088e-03 2.673851 0.0252589164 Plain-Highlands; Plateau-Highlands
A heatmap with the correlations between all the variables is shown below:
correl.prop.nmr.scal = correlations.dataset(prop.nmr.scal, method = "pearson")
heatmap(correl.prop.nmr.scal, col = topo.colors(256))
CLUSTERING
Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:
hc.prop.nmr.scal = clustering(prop.nmr.scal, method = "hc", distance = "euclidean")
dendrogram.plot.col(prop.nmr.scal, hc.prop.nmr.scal, "agroregions")
K-Means was performed on the data also with 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below:
kmeans.prop.nmr.scal = clustering(prop.nmr.scal, method = "kmeans", num.clusters = 4)
kmeans.plot(prop.nmr.scal, kmeans.prop.nmr.scal)
kmeans.df = kmeans.result.df(kmeans.prop.nmr.scal, 4)
kmeans.df
## cluster
## 1 1
## 2 2
## 3 3
## 4 4
## samples
## 1 SJ_au UR_au SJ_sp UR_sp SJ_wi UR_wi
## 2 AC_au DC_au PU_au SA_au VR_au DC_sm SA_sm FP_sp SA_sp VR_sp AC_wi AN_wi
## 3 AN_au JB_au AC_sm AN_sm BR_sm CE_sm CN_sm FP_sm JB_sm SJC_sm SJ_sm UR_sm VR_sm AC_sp AN_sp DC_sp JB_sp DC_wi FP_wi JB_wi
## 4 BR_au CE_au CN_au IT_au SJC_au XX_au IT_sm PU_sm XX_sm BR_sp CE_sp CN_sp IT_sp PU_sp SJC_sp BR_wi CE_wi CN_wi PU_wi SA_wi XX_wi
PCA
Principal components analysis was performed on the data and some plots are shown below:
pca.analysis.result = pca.analysis.dataset(prop.nmr.scal)
pca.pairs.plot(prop.nmr.scal, pca.analysis.result, "agroregions")
pca.screeplot(pca.analysis.result)
pca.scoresplot2D(prop.nmr.scal, pca.analysis.result, "agroregions", ellipses = T)
pca.kmeans.plot2D(prop.nmr.scal, pca.analysis.result, kmeans.result = kmeans.prop.nmr.scal, ellipses = T)
MACHINE LEARNING
For classification models and prediction the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 5 - number of repeats: 10
Below are some results with the best tune for each model:
ml.prop.nmr = train.models.performance(prop.nmr.scal, c("pls", "J48", "JRip", "svmLinear", "rf"), "agroregions", "repeatedcv", num.folds = 10, num.repeats = 10, tunelength = 20, metric = "ROC")
ml.prop.nmr$performance
## Accuracy Kappa Sensitivity Specificity ROC AccuracySD
## pls 0.7338095 0.3839675 0.5722222 0.7834444 0.7777361 0.1387798
## J48 0.6031786 0.2585891 0.5172222 0.7535000 0.6493958 0.1810701
## JRip 0.5603810 0.1553230 0.4433333 0.7147778 0.5973333 0.1729767
## svmLinear 0.6337500 0.1777189 0.4411111 0.7184444 0.7240787 0.1325125
## rf 0.7487381 0.4400561 0.6200000 0.8015000 0.8162870 0.1379266
## KappaSD SensitivitySD SpecificitySD ROCSD
## pls 0.3462588 0.2247333 0.11080480 0.1988755
## J48 0.3333265 0.2284747 0.11631381 0.1811918
## JRip 0.2926550 0.1874408 0.10413636 0.1585233
## svmLinear 0.2817522 0.1653182 0.08340207 0.1786686
## rf 0.3381011 0.2225685 0.10971018 0.1540564
Also the confusion matrices and a plot using the first 3 PCs, showing the separation of the classes (agroregions) are shown below:
ml.prop.nmr$confusion.matrices
## $pls
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction Highlands Plain Plateau
## Highlands 8.3 0.0 0.0
## Plain 0.0 6.5 2.6
## Plateau 11.8 12.2 58.6
##
##
## $J48
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction Highlands Plain Plateau
## Highlands 10.2 2.4 10.1
## Plain 1.3 6.6 7.5
## Plateau 8.7 9.7 43.5
##
##
## $JRip
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction Highlands Plain Plateau
## Highlands 4.9 1.7 6.8
## Plain 2.1 6.9 10.1
## Plateau 13.3 10.0 44.3
##
##
## $svmLinear
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction Highlands Plain Plateau
## Highlands 2.2 0.8 1.3
## Plain 0.5 5.5 4.2
## Plateau 17.6 12.2 55.6
##
##
## $rf
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction Highlands Plain Plateau
## Highlands 10.0 0.0 1.0
## Plain 0.0 7.8 3.1
## Plateau 10.2 10.9 57.1
pls.model = ml.prop.nmr$final.models$pls
pca.plot.3d(prop.nmr.scal, pls.model, "agroregions")
And the variable importance in the four season classes for all models:
summary.var.importance(ml.prop.nmr, 10)
## $pls
## Highlands Plain Plateau Mean
## 2.05 60.34299 100.00000 75.29261 78.54520
## 0.31 54.47714 88.59049 66.30142 69.78968
## 5.31 27.98478 96.09231 66.03913 63.37207
## 2.38 37.49548 83.69323 65.05561 62.08144
## 0.48 48.15096 71.14557 55.98292 58.42648
## 6.13 98.98332 22.45461 50.86057 57.43283
## 6.28 53.85130 63.22312 54.52595 57.20012
## 0.56 38.71769 72.34711 58.71644 56.59375
## 1.49 29.54004 78.54854 58.66925 55.58594
## 3.13 17.09143 92.99659 53.87954 54.65585
##
## $J48
## Highlands Plain Plateau Mean
## 2.11 100.00000 100.00000 88.92950 96.30983
## 5.17 96.24021 96.24021 85.16971 92.55004
## 1.87 100.00000 100.00000 77.44125 92.48042
## 1.9 97.49347 97.49347 72.56745 89.18480
## 5.28 94.98695 94.98695 77.44125 89.13838
## 5.42 96.24021 96.24021 74.72585 89.06876
## 5.62 96.24021 96.24021 73.05483 88.51175
## 2.2 93.73368 93.73368 78.06789 88.51175
## 2.86 75.56136 91.64491 91.64491 86.28372
## 0.66 86.21410 86.21410 86.21410 86.21410
##
## $JRip
## Overall Mean
## 2.11 100 100
## 0.27 0 0
## 0.31 0 0
## 0.34 0 0
## 0.41 0 0
## 0.44 0 0
## 0.48 0 0
## 0.5 0 0
## 0.54 0 0
## 0.56 0 0
##
## $svmLinear
## Highlands Plain Plateau Mean
## 2.11 100.00000 100.00000 88.92950 96.30983
## 5.17 96.24021 96.24021 85.16971 92.55004
## 1.87 100.00000 100.00000 77.44125 92.48042
## 1.9 97.49347 97.49347 72.56745 89.18480
## 5.28 94.98695 94.98695 77.44125 89.13838
## 5.42 96.24021 96.24021 74.72585 89.06876
## 5.62 96.24021 96.24021 73.05483 88.51175
## 2.2 93.73368 93.73368 78.06789 88.51175
## 2.86 75.56136 91.64491 91.64491 86.28372
## 0.66 86.21410 86.21410 86.21410 86.21410
##
## $rf
## Overall Mean
## 2.86 100.00000 100.00000
## 6.17 88.86906 88.86906
## 5.17 87.50601 87.50601
## 5.31 87.41402 87.41402
## 2.8 82.42068 82.42068
## 2.11 81.30559 81.30559
## 2.65 72.49243 72.49243
## 3.25 71.63813 71.63813
## 6.75 70.82594 70.82594
## 5.21 65.99408 65.99408
FEATURE SELECTION
Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:
feature.selection.result = feature.selection(prop.nmr.scal, "agroregions", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 2 0.5370 0.1279 0.2042 0.3614
## 4 0.5902 0.2177 0.1947 0.3439
## 8 0.6241 0.2474 0.1696 0.3112
## 16 0.6945 0.3715 0.1386 0.2961
## 32 0.7127 0.3956 0.1505 0.3273
## 64 0.7390 0.4432 0.1334 0.3095
## 242 0.7475 0.4494 0.1373 0.3248 *
##
## The top 5 variables (out of 242):
## X6.17, X6.13, X2.11, X5.31, X2.65
plot(feature.selection.result, type=c("g","o"))
Also selection by filter was used with the results shown below:
feature.selection.result2 = feature.selection(prop.nmr.scal, "agroregions", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## Accuracy Kappa AccuracySD KappaSD
## 0.7435 0.4352 0.1708 0.3906
##
## Using the training set, 73 variables were selected:
## X0.5, X0.56, X0.61, X0.66, X0.99...
##
## During resampling, the top 5 selected variables (out of a possible 118):
## X0.5 (100%), X0.56 (100%), X0.61 (100%), X0.66 (100%), X0.99 (100%)
##
## On average, 65.5 variables were selected (min = 48, max = 82)