Propolis peak list data was read and stored in a list, containing 2 elements, the dataset consisting in a list with the samples with their ppm intensities being the elements. The propolis metadata consists on the seasons and regions. Seasons were obtained by the last part of the file name, for example the file name “AC_au”" means that the season of the sample is autumn (au). Stations were obtained by the first part of the file name and then assigned to the specific region, for example again the file name “AC_au”“, the station is”AC" that was assigned to West region.
setwd("~/Dropbox")
library(metabolomicsUM)
source("Datasets/Propolis/NMR/scripts/propolis_metadata.R")
prop.nmr.metadata.file = "Datasets/Propolis/NMR/metadata/metadata_propolis.csv"
prop.nmr.data.folder = "Datasets/Propolis/NMR/data"
get.metadata(prop.nmr.data.folder, write.file = TRUE, file.name = prop.nmr.metadata.file)
prop.nmr.metadata = read.metadata(prop.nmr.metadata.file)
peaks.lists = read.csvs.folder(prop.nmr.data.folder)
Seasons metadata used, own grouping peaks algorithm used, removed peak groups with less than 25% of values, missing values imputation with low value, log transformation and autoscaling.
PREPROCESSING
Own grouping peaks algorithm used with step = 0.03:
# removing resonances in selected regions
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 0, 0.19)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 3.29, 3.31)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 4.85, 5)
#group peaks
prop.nmr.ds = group.peaks(peaks.lists, type = "nmr-peaks", metadata = prop.nmr.metadata, description = "NMR propolis", label.x = "ppm", label.values = "intensity")
sum.dataset(prop.nmr.ds)
## Dataset summary:
## Valid dataset
## Description: NMR propolis
## Type of data: nmr-peaks
## Number of samples: 59
## Number of data points 293
## Number of metadata variables: 2
## Label of x-axis values: ppm
## Label of data points: intensity
## Number of missing values in data: 5376
## Mean of data values: 0.09016594
## Median of data values: 0.0287
## Standard deviation: 0.1904829
## Range of values: 0 10
## Quantiles:
## 0% 25% 50% 75% 100%
## 0.0000 0.0081 0.0287 0.0929 10.0000
Peak groups with less than 25% of values were removed:
nsamps = num.samples(prop.nmr.ds)
prop.nmr.ds = remove.variables.by.nas(prop.nmr.ds, 0.75*nsamps)
There are 2659 missing values found in the dataset, which will be replaced with a low value (0.00005).
prop.nmr.na = missingvalues.imputation(prop.nmr.ds, method="value", value = 0.00005)
The data was transformed by log transformation and scaled by autoscaling:
prop.nmr.log = transform.data(prop.nmr.na, method="log")
prop.nmr.log.scal = scaling(prop.nmr.log, "auto")
UNIVARIATE TESTS
An analysis of variance (ANOVA) was conducted over the data with tukey test also, and this is the top 10 results ordered by p-value:
anova.prop.nmr.log.scal = aov.all.vars(prop.nmr.log.scal, "seasons")
anova.prop.nmr.log.scal[1:20,]
## pvalues logs fdr tukey
## 4.66 9.584707e-26 25.018421 2.319499e-23 sm-au; sp-sm; wi-sm
## 4.58 3.384728e-17 16.470476 4.095521e-15 sm-au; sp-sm; wi-sm
## 4.55 6.092443e-14 13.215209 4.914571e-12 sm-au; sp-au; wi-au; sp-sm; wi-sm
## 4.63 1.043963e-13 12.981315 6.315979e-12 sm-au; sp-sm; wi-sm
## 4.71 2.082569e-13 12.681401 1.007963e-11 sm-au; sp-sm; wi-sm
## 4.5 2.643148e-13 12.577878 1.066070e-11 sp-au; wi-au; sp-sm; wi-sm
## 4.08 1.215892e-12 11.915105 4.203512e-11 sp-au; wi-au; sp-sm; wi-sm
## 4.45 2.446829e-12 11.611396 7.026362e-11 sp-au; wi-au; sp-sm; wi-sm
## 4.17 2.613110e-12 11.582842 7.026362e-11 sp-au; wi-au; sp-sm; wi-sm
## 4.31 4.226505e-12 11.374019 1.022814e-10 sp-au; wi-au; sp-sm; wi-sm
## 4.53 1.044161e-11 10.981232 2.297155e-10 sm-au; sp-au; wi-au; sp-sm; wi-sm
## 4.02 6.610158e-11 10.179788 1.333049e-09 sp-au; wi-au; sp-sm; wi-sm
## 4.38 8.033410e-11 10.095100 1.495450e-09 sp-au; wi-au; sp-sm; wi-sm
## 4.05 1.504671e-10 9.822558 2.600932e-09 sp-au; wi-au; sp-sm; wi-sm
## 4.28 2.622619e-10 9.581265 4.231159e-09 sp-au; wi-au; sp-sm; wi-sm
## 4.25 3.868532e-10 9.412454 5.636878e-09 sp-au; wi-au; sp-sm; wi-sm
## 4.34 3.959790e-10 9.402328 5.636878e-09 sp-au; wi-au; sp-sm; wi-sm
## 4.2 1.338480e-09 8.873388 1.799512e-08 sp-au; wi-au; sp-sm; wi-sm
## 4.13 4.538686e-09 8.343070 5.780852e-08 sp-au; wi-au; sp-sm; wi-sm
## 4.74 2.695351e-08 7.569385 3.261375e-07 sm-au; sp-sm; wi-sm
A heatmap with the correlations between all the variables is shown below:
correl.prop.nmr.log.scal = correlations.dataset(prop.nmr.log.scal, method = "pearson")
heatmap(correl.prop.nmr.log.scal, col = topo.colors(256))
CLUSTERING
Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:
hc.prop.nmr.log.scal = clustering(prop.nmr.log.scal, method = "hc", distance = "euclidean")
dendrogram.plot.col(prop.nmr.log.scal, hc.prop.nmr.log.scal, "seasons")
K-Means was performed on the data also with 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below:
kmeans.prop.nmr.log.scal = clustering(prop.nmr.log.scal, method = "kmeans", num.clusters = 4)
kmeans.plot(prop.nmr.log.scal, kmeans.prop.nmr.log.scal)
kmeans.df = kmeans.result.df(kmeans.prop.nmr.log.scal, 4)
kmeans.df
## cluster
## 1 1
## 2 2
## 3 3
## 4 4
## samples
## 1 AC_sp AC_wi DC_wi FP_wi JB_wi
## 2 AC_au DC_au JB_au SA_au SJ_au UR_au VR_au DC_sm SA_sm DC_sp FP_sp JB_sp SA_sp SJ_sp UR_sp VR_sp AN_wi SJ_wi UR_wi
## 3 BR_au CE_au CN_au IT_au SJC_au XX_au PU_sm XX_sm BR_sp CE_sp CN_sp IT_sp PU_sp SJC_sp BR_wi CE_wi CN_wi PU_wi SA_wi XX_wi
## 4 AN_au PU_au AC_sm AN_sm BR_sm CE_sm CN_sm FP_sm IT_sm JB_sm SJC_sm SJ_sm UR_sm VR_sm AN_sp
PCA
Principal components analysis was performed on the data and some plots are shown below:
pca.analysis.result = pca.analysis.dataset(prop.nmr.log.scal)
pca.pairs.plot(prop.nmr.log.scal, pca.analysis.result, "seasons" )
pca.screeplot(pca.analysis.result)
pca.scoresplot2D(prop.nmr.log.scal, pca.analysis.result, "seasons", ellipses = T)
pca.kmeans.plot2D(prop.nmr.log.scal, pca.analysis.result, kmeans.result = kmeans.prop.nmr.log.scal, ellipses = T)
MACHINE LEARNING
For classification models and prediction the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 5 - number of repeats: 10
Below are some results with the best tune for each model:
ml.prop.nmr = train.models.performance(prop.nmr.log.scal, c("pls", "J48", "JRip", "svmLinear", "rf"), "seasons", "repeatedcv", num.folds = 10, num.repeats = 10, tunelength = 20, metric = "ROC")
ml.prop.nmr$performance
## Accuracy Kappa Sensitivity Specificity ROC AccuracySD
## pls 0.8572024 0.8056019 0.85375 0.9525833 0.9694792 0.1357642
## J48 0.7415833 0.6498089 0.72625 0.9146667 0.8635729 0.1489946
## JRip 0.5441667 0.3828271 0.53125 0.8472083 0.7179375 0.1996879
## svmLinear 0.8563810 0.8047246 0.85625 0.9518333 0.9481875 0.1575904
## rf 0.8146071 0.7489153 0.81125 0.9394167 0.9444271 0.1415351
## KappaSD SensitivitySD SpecificitySD ROCSD
## pls 0.1826373 0.1388251 0.04539228 0.05386586
## J48 0.1967172 0.1514815 0.04892030 0.11275812
## JRip 0.2652724 0.2044139 0.06644812 0.13939771
## svmLinear 0.2129548 0.1582885 0.05329794 0.08677381
## rf 0.1909017 0.1491717 0.04698552 0.06944766
The full result of the performance of pls:
ml.prop.nmr$full.results$pls[,c("ncomp","Accuracy","Kappa","Sensitivity","Specificity","ROC","AccuracySD","KappaSD","SensitivitySD","SpecificitySD","ROCSD")]
## ncomp Accuracy Kappa Sensitivity Specificity ROC AccuracySD
## 1 1 0.4387500 0.2418491 0.42375 0.8113333 0.7445833 0.1267891
## 2 2 0.6244405 0.5004722 0.62250 0.8780000 0.8238333 0.2082399
## 3 3 0.7410357 0.6473117 0.72750 0.9135417 0.9183125 0.1685280
## 4 4 0.7903929 0.7150935 0.78500 0.9297083 0.9428542 0.1719610
## 5 5 0.8342976 0.7739395 0.82500 0.9445417 0.9545417 0.1419489
## 6 6 0.8209643 0.7548706 0.80875 0.9398333 0.9481458 0.1434605
## 7 7 0.8249167 0.7598123 0.81500 0.9409583 0.9519167 0.1399328
## 8 8 0.8295952 0.7662852 0.82000 0.9426667 0.9571458 0.1500886
## 9 9 0.8404405 0.7823278 0.83375 0.9467917 0.9614375 0.1435827
## 10 10 0.8471786 0.7911976 0.83875 0.9490000 0.9610625 0.1400432
## 11 11 0.8498571 0.7953201 0.84375 0.9501250 0.9626042 0.1389159
## 12 12 0.8566190 0.8044764 0.85125 0.9523750 0.9646250 0.1354606
## 13 13 0.8600357 0.8096542 0.85625 0.9537500 0.9648125 0.1333602
## 14 14 0.8555357 0.8033797 0.85125 0.9520833 0.9670000 0.1370769
## 15 15 0.8572024 0.8056019 0.85375 0.9525833 0.9694792 0.1357642
## 16 16 0.8652143 0.8163498 0.86250 0.9551250 0.9685417 0.1350068
## 17 17 0.8672143 0.8192909 0.86500 0.9559583 0.9676042 0.1355114
## 18 18 0.8712024 0.8246542 0.87000 0.9572083 0.9669792 0.1302891
## 19 19 0.8726310 0.8265986 0.87125 0.9577083 0.9675208 0.1309150
## 20 20 0.8701310 0.8232653 0.86875 0.9568750 0.9676250 0.1355843
## KappaSD SensitivitySD SpecificitySD ROCSD
## 1 0.1532159 0.1149975 0.03792585 0.13482502
## 2 0.2692999 0.2086967 0.06747220 0.12224601
## 3 0.2265762 0.1744218 0.05644917 0.09087135
## 4 0.2305916 0.1759534 0.05739889 0.08301140
## 5 0.1904355 0.1517990 0.04717279 0.07939084
## 6 0.1942065 0.1553497 0.04802979 0.08487572
## 7 0.1922017 0.1502103 0.04790085 0.07843321
## 8 0.2066522 0.1611982 0.05143876 0.07868633
## 9 0.1941391 0.1528919 0.04818848 0.06969494
## 10 0.1896204 0.1510642 0.04698067 0.07025206
## 11 0.1879684 0.1490447 0.04638842 0.06873015
## 12 0.1833818 0.1450955 0.04529356 0.06331607
## 13 0.1794070 0.1369018 0.04433976 0.06294169
## 14 0.1843234 0.1406772 0.04569538 0.05575070
## 15 0.1826373 0.1388251 0.04539228 0.05386586
## 16 0.1818378 0.1381927 0.04528814 0.05836376
## 17 0.1824107 0.1383981 0.04533999 0.05922196
## 18 0.1756069 0.1328590 0.04361543 0.05920641
## 19 0.1764677 0.1334930 0.04381809 0.05993260
## 20 0.1825622 0.1380499 0.04537139 0.05914866
Also the confusion matrices and a plot using the first 3 PCs, showing the separation of the four classes (seasons) are shown below:
ml.prop.nmr$confusion.matrices
## $pls
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction au sm sp wi
## au 17.6 0.0 1.4 1.7
## sm 0.9 25.5 0.0 0.6
## sp 3.3 1.6 23.9 1.1
## wi 3.7 0.0 0.0 18.6
##
##
## $J48
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction au sm sp wi
## au 18.4 0.0 1.4 8.6
## sm 0.1 25.4 0.0 0.0
## sp 1.0 1.7 21.1 4.3
## wi 5.8 0.0 2.8 9.2
##
##
## $JRip
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction au sm sp wi
## au 9.6 4.6 2.1 3.2
## sm 5.4 18.8 2.3 2.8
## sp 5.7 2.3 18.7 8.7
## wi 4.8 1.5 2.3 7.4
##
##
## $svmLinear
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction au sm sp wi
## au 19.7 0.0 1.7 1.7
## sm 1.8 25.3 0.5 1.0
## sp 0.6 1.7 21.3 0.2
## wi 3.2 0.0 1.9 19.3
##
##
## $rf
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction au sm sp wi
## au 18.9 0.0 1.5 1.8
## sm 1.0 25.3 0.0 0.0
## sp 1.4 1.6 22.1 5.1
## wi 4.1 0.0 2.0 15.2
pls.model = ml.prop.nmr$final.models$pls
pca.plot.3d(prop.nmr.log.scal, pls.model, "seasons")
And the variable importance in the four season classes for all models:
summary.var.importance(ml.prop.nmr, 10)
## $pls
## au sm sp wi Mean
## 4.66 48.12489 100.00000 72.54891 46.79194 66.86643
## 4.58 37.29387 91.23171 51.82092 33.59906 53.48639
## 4.71 33.49937 85.06308 47.08909 33.06510 49.67916
## 4.84 44.32077 56.07410 68.88678 25.93734 48.80475
## 4.63 32.74774 83.79981 46.32908 32.25388 48.78263
## 3.93 52.68751 30.80042 61.34304 45.68981 47.63020
## 5.62 51.44097 21.87526 63.90300 48.64462 46.46596
## 3.9 56.98534 18.46927 54.99129 44.46450 43.72760
## 4.28 39.69453 57.48872 37.54457 35.71307 42.61022
## 4.17 40.01824 58.45372 39.47756 32.28298 42.55812
##
## $J48
## au sm sp wi Mean
## 4.66 100.00000 100 100 100.00000 100.00000
## 4.58 96.00000 100 100 100.00000 99.00000
## 4.55 83.00000 100 100 100.00000 95.75000
## 4.5 89.00000 100 100 92.69231 95.42308
## 4.38 89.00000 100 100 88.84615 94.46154
## 4.08 89.00000 100 100 88.07692 94.26923
## 4.17 89.00000 100 100 88.07692 94.26923
## 4.31 83.66667 100 100 91.92308 93.89744
## 4.45 83.66667 100 100 91.15385 93.70513
## 4.63 88.33333 95 95 95.00000 93.33333
##
## $JRip
## Overall Mean
## 3.22 100 100
## 4.08 100 100
## 4.2 100 100
## 5.6 100 100
## 5.62 100 100
## 0.27 0 0
## 0.31 0 0
## 0.34 0 0
## 0.41 0 0
## 0.44 0 0
##
## $svmLinear
## au sm sp wi Mean
## 4.66 100.00000 100 100 100.00000 100.00000
## 4.58 96.00000 100 100 100.00000 99.00000
## 4.55 83.00000 100 100 100.00000 95.75000
## 4.5 89.00000 100 100 92.69231 95.42308
## 4.38 89.00000 100 100 88.84615 94.46154
## 4.08 89.00000 100 100 88.07692 94.26923
## 4.17 89.00000 100 100 88.07692 94.26923
## 4.31 83.66667 100 100 91.92308 93.89744
## 4.45 83.66667 100 100 91.15385 93.70513
## 4.63 88.33333 95 95 95.00000 93.33333
##
## $rf
## Overall Mean
## 4.66 100.000000 100.000000
## 5.99 46.403569 46.403569
## 4.58 21.637899 21.637899
## 6.28 14.768304 14.768304
## 2.32 13.484496 13.484496
## 6.03 10.477688 10.477688
## 4.71 10.422266 10.422266
## 4.2 9.350766 9.350766
## 5.56 9.292416 9.292416
## 4.17 9.144397 9.144397
FEATURE SELECTION
Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:
feature.selection.result = feature.selection(prop.nmr.log.scal, "seasons", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 2 0.6500 0.5253 0.1394 0.1754
## 4 0.7057 0.6003 0.1217 0.1551
## 8 0.7190 0.6202 0.1331 0.1701
## 16 0.7773 0.6982 0.1502 0.1997
## 32 0.8260 0.7624 0.1379 0.1866
## 64 0.8379 0.7785 0.1464 0.1993 *
## 242 0.8118 0.7437 0.1556 0.2089
##
## The top 5 variables (out of 64):
## X4.66, X4.08, X4.17, X4.5, X4.38
plot(feature.selection.result, type=c("g","o"))
Also selection by filter was used with the results shown below:
feature.selection.result2 = feature.selection(prop.nmr.log.scal, "seasons", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## Accuracy Kappa AccuracySD KappaSD
## 0.841 0.7852 0.1537 0.2059
##
## Using the training set, 96 variables were selected:
## X0.41, X0.44, X0.73, X0.76, X0.93...
##
## During resampling, the top 5 selected variables (out of a possible 161):
## X0.41 (100%), X0.44 (100%), X0.76 (100%), X1.55 (100%), X2.32 (100%)
##
## On average, 91 variables were selected (min = 79, max = 106)