NMR Propolis pipeline 2

Propolis peak list data was read and stored in a list, containing 2 elements, the dataset consisting in a list with the samples with their ppm intensities being the elements. The propolis metadata consists on the seasons and regions. Seasons were obtained by the last part of the file name, for example the file name “AC_au” means that the season of the sample is autumn (au). Stations were obtained by the first part of the file name and then assigned to the specific region, for example again the file name “AC_au”, the station is “AC” that was assigned to West region.

setwd("~/Dropbox")
library(metabolomicsUM)
source("Datasets/Propolis/NMR/scripts/propolis_metadata.R")

prop.nmr.metadata.file = "Datasets/Propolis/NMR/metadata/metadata_propolis.csv"
prop.nmr.data.folder = "Datasets/Propolis/NMR/data"

get.metadata(prop.nmr.data.folder, write.file = TRUE, file.name = prop.nmr.metadata.file)
prop.nmr.metadata = read.metadata(prop.nmr.metadata.file)

peaks.lists = read.csvs.folder(prop.nmr.data.folder)

Seasons metadata used, own grouping peaks algorithm used, removed peak groups with less than 25% of values, missing values imputation with low value, and log transformation.

PREPROCESSING

Own grouping peaks algorithm used with step = 0.03:

# removing resonances in selected regions
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 0, 0.19)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 3.29, 3.31)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 4.85, 5)

#group peaks
prop.nmr.ds = group.peaks(peaks.lists, type = "nmr-peaks", metadata = prop.nmr.metadata, description = "NMR propolis", label.x = "ppm", label.values = "intensity")
sum.dataset(prop.nmr.ds)
## Dataset summary:
## Valid dataset
## Description:  NMR propolis 
## Type of data:  nmr-peaks 
## Number of samples:  59 
## Number of data points 293 
## Number of metadata variables:  2 
## Label of x-axis values:  ppm 
## Label of data points:  intensity 
## Number of missing values in data:  5376 
## Mean of data values:  0.09016594 
## Median of data values:  0.0287 
## Standard deviation:  0.1904829 
## Range of values:  0 10 
## Quantiles: 
##      0%     25%     50%     75%    100% 
##  0.0000  0.0081  0.0287  0.0929 10.0000

Peak groups with less than 25% of values were removed:

nsamps = num.samples(prop.nmr.ds)
prop.nmr.ds = remove.variables.by.nas(prop.nmr.ds,  0.75*nsamps)

There are 2659 missing values found in the dataset, which will be replaced with a low value (0.00005).

prop.nmr.na = missingvalues.imputation(prop.nmr.ds, method="value", value = 0.00005)

The data was transformed by log transformation:

prop.nmr.log = transform.data(prop.nmr.na, method="log")

UNIVARIATE TESTS

An analysis of variance (ANOVA) was conducted over the data with tukey test also, and this is the top 10 results ordered by p-value:

anova.prop.nmr.log = aov.all.vars(prop.nmr.log, "seasons")
anova.prop.nmr.log[1:20,]
##           pvalues      logs          fdr                             tukey
## 4.66 9.584707e-26 25.018421 2.319499e-23               sm-au; sp-sm; wi-sm
## 4.58 3.384728e-17 16.470476 4.095521e-15               sm-au; sp-sm; wi-sm
## 4.55 6.092443e-14 13.215209 4.914571e-12 sm-au; sp-au; wi-au; sp-sm; wi-sm
## 4.63 1.043963e-13 12.981315 6.315979e-12               sm-au; sp-sm; wi-sm
## 4.71 2.082569e-13 12.681401 1.007963e-11               sm-au; sp-sm; wi-sm
## 4.5  2.643148e-13 12.577878 1.066070e-11        sp-au; wi-au; sp-sm; wi-sm
## 4.08 1.215892e-12 11.915105 4.203512e-11        sp-au; wi-au; sp-sm; wi-sm
## 4.45 2.446829e-12 11.611396 7.026362e-11        sp-au; wi-au; sp-sm; wi-sm
## 4.17 2.613110e-12 11.582842 7.026362e-11        sp-au; wi-au; sp-sm; wi-sm
## 4.31 4.226505e-12 11.374019 1.022814e-10        sp-au; wi-au; sp-sm; wi-sm
## 4.53 1.044161e-11 10.981232 2.297155e-10 sm-au; sp-au; wi-au; sp-sm; wi-sm
## 4.02 6.610158e-11 10.179788 1.333049e-09        sp-au; wi-au; sp-sm; wi-sm
## 4.38 8.033410e-11 10.095100 1.495450e-09        sp-au; wi-au; sp-sm; wi-sm
## 4.05 1.504671e-10  9.822558 2.600932e-09        sp-au; wi-au; sp-sm; wi-sm
## 4.28 2.622619e-10  9.581265 4.231159e-09        sp-au; wi-au; sp-sm; wi-sm
## 4.25 3.868532e-10  9.412454 5.636878e-09        sp-au; wi-au; sp-sm; wi-sm
## 4.34 3.959790e-10  9.402328 5.636878e-09        sp-au; wi-au; sp-sm; wi-sm
## 4.2  1.338480e-09  8.873388 1.799512e-08        sp-au; wi-au; sp-sm; wi-sm
## 4.13 4.538686e-09  8.343070 5.780852e-08        sp-au; wi-au; sp-sm; wi-sm
## 4.74 2.695351e-08  7.569385 3.261375e-07               sm-au; sp-sm; wi-sm

A heatmap with the correlations between all the variables is shown below:

correl.prop.nmr.log = correlations.dataset(prop.nmr.log, method = "pearson")
heatmap(correl.prop.nmr.log, col =  topo.colors(256))

CLUSTERING

Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:

hc.prop.nmr.log = clustering(prop.nmr.log, method = "hc", distance = "euclidean")
dendrogram.plot.col(prop.nmr.log, hc.prop.nmr.log, "seasons")

K-Means was performed on the data also with 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below:

kmeans.prop.nmr.log = clustering(prop.nmr.log, method = "kmeans", num.clusters = 4)
kmeans.plot(prop.nmr.log, kmeans.prop.nmr.log)

kmeans.df = kmeans.result.df(kmeans.prop.nmr.log, 4)
kmeans.df
##   cluster
## 1       1
## 2       2
## 3       3
## 4       4
##                                                                                                                                            samples
## 1 AC_au BR_au XX_sm BR_sp CE_sp CN_sp DC_sp FP_sp IT_sp JB_sp PU_sp SA_sp SJC_sp SJ_sp UR_sp VR_sp AN_wi BR_wi CE_wi CN_wi PU_wi SA_wi UR_wi XX_wi
## 2                                                                         DC_au JB_au PU_au SA_au SJ_au SJC_au UR_au VR_au XX_au DC_sm SA_sm SJ_wi
## 3                                                                                                        AN_au AC_sp AN_sp AC_wi DC_wi FP_wi JB_wi
## 4                                                 CE_au CN_au IT_au AC_sm AN_sm BR_sm CE_sm CN_sm FP_sm IT_sm JB_sm PU_sm SJC_sm SJ_sm UR_sm VR_sm

PCA

Principal components analysis was performed on the data and some plots are shown below:

pca.analysis.result = pca.analysis.dataset(prop.nmr.log)

pca.pairs.plot(prop.nmr.log, pca.analysis.result, "seasons")

pca.screeplot(pca.analysis.result)

pca.scoresplot2D(prop.nmr.log, pca.analysis.result, "seasons", ellipses = T)

pca.kmeans.plot2D(prop.nmr.log, pca.analysis.result, kmeans.result = kmeans.prop.nmr.log, ellipses = T)

MACHINE LEARNING

For classification models and prediction the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 5 - number of repeats: 10

Below are some results with the best tune for each model:

ml.prop.nmr = train.models.performance(prop.nmr.log, c("pls", "J48", "JRip", "svmLinear", "rf"), "seasons", "repeatedcv", num.folds = 10, num.repeats = 10, tunelength = 20, metric = "ROC")
ml.prop.nmr$performance
##            Accuracy     Kappa Sensitivity Specificity       ROC AccuracySD
## pls       0.8242143 0.7617456     0.81500   0.9427917 0.9736042  0.1380070
## J48       0.7320714 0.6383464     0.71375   0.9121667 0.8470937  0.1611039
## JRip      0.5808452 0.4368772     0.58000   0.8608333 0.7537187  0.2056991
## svmLinear 0.8651190 0.8162027     0.85625   0.9553333 0.9455417  0.1449692
## rf        0.8130952 0.7476420     0.80625   0.9391250 0.9419792  0.1581879
##             KappaSD SensitivitySD SpecificitySD      ROCSD
## pls       0.1844654     0.1480940    0.04457783 0.04383465
## J48       0.2128784     0.1669458    0.05228599 0.13400425
## JRip      0.2728461     0.2121023    0.06820870 0.14949104
## svmLinear 0.1964400     0.1572882    0.04839628 0.08631047
## rf        0.2102166     0.1631978    0.05157078 0.09150792

Also the confusion matrices and a plot using the first 3 PCs, showing the separation of the four classes (seasons) are shown below:

ml.prop.nmr$confusion.matrices
## $pls
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction   au   sm   sp   wi
##         au 17.9  0.0  1.2  1.8
##         sm  0.9 25.3  0.5  0.0
##         sp  3.0  1.8 22.9  4.0
##         wi  3.6  0.0  0.8 16.2
## 
## 
## $J48
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction   au   sm   sp   wi
##         au 18.3  0.0  2.7  9.8
##         sm  0.0 25.4  0.2  0.0
##         sp  1.6  1.7 21.0  4.0
##         wi  5.4  0.0  1.5  8.5
## 
## 
## $JRip
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction   au   sm   sp   wi
##         au 12.2  4.0  2.7  3.6
##         sm  3.5 18.8  1.7  2.3
##         sp  5.7  1.4 18.1  7.1
##         wi  4.1  3.0  2.9  9.0
## 
## 
## $svmLinear
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction   au   sm   sp   wi
##         au 20.3  0.0  0.9  1.8
##         sm  1.5 25.4  0.8  1.4
##         sp  0.1  1.6 22.2  0.2
##         wi  3.4  0.0  1.7 18.6
## 
## 
## $rf
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction   au   sm   sp   wi
##         au 19.6  0.0  1.4  2.1
##         sm  1.0 25.2  0.0  0.0
##         sp  1.4  1.7 22.1  5.6
##         wi  3.6  0.0  2.0 14.4
pls.model = ml.prop.nmr$final.models$pls
pca.plot.3d(prop.nmr.log, pls.model, "seasons")

And the variable importance in the four season classes for all models:

summary.var.importance(ml.prop.nmr, 10)
## $pls
##             au       sm       sp       wi     Mean
## 4.66 100.00000 87.99574 78.26532 76.23799 85.62476
## 4.84  93.15999 75.51208 86.47502 67.58277 80.68246
## 4.58  64.71789 68.76926 57.41307 54.59438 61.37365
## 3.93  65.77182 42.33916 64.37470 70.38216 60.71696
## 4.81  77.22518 64.64432 53.81721 41.29494 59.24541
## 4.71  64.87917 67.30350 52.75528 48.28403 58.30550
## 3.9   77.83870 49.05336 49.98558 54.33163 57.80232
## 4.63  57.80789 62.52451 52.53770 47.49601 55.09153
## 1.55  73.31049 34.48552 49.33235 54.49136 52.90493
## 2.14  37.53567 33.28398 74.45243 66.32951 52.90040
## 
## $J48
##             au  sm  sp        wi      Mean
## 4.66 100.00000 100 100 100.00000 100.00000
## 4.58  96.00000 100 100 100.00000  99.00000
## 4.55  83.00000 100 100 100.00000  95.75000
## 4.5   89.00000 100 100  92.69231  95.42308
## 4.38  89.00000 100 100  88.84615  94.46154
## 4.08  89.00000 100 100  88.07692  94.26923
## 4.17  89.00000 100 100  88.07692  94.26923
## 4.31  83.66667 100 100  91.92308  93.89744
## 4.45  83.66667 100 100  91.15385  93.70513
## 4.63  88.33333  95  95  95.00000  93.33333
## 
## $JRip
##      Overall Mean
## 0.41     100  100
## 0.44     100  100
## 2.8      100  100
## 2.83     100  100
## 3.9      100  100
## 4.2      100  100
## 6.03     100  100
## 0.27       0    0
## 0.31       0    0
## 0.34       0    0
## 
## $svmLinear
##             au  sm  sp        wi      Mean
## 4.66 100.00000 100 100 100.00000 100.00000
## 4.58  96.00000 100 100 100.00000  99.00000
## 4.55  83.00000 100 100 100.00000  95.75000
## 4.5   89.00000 100 100  92.69231  95.42308
## 4.38  89.00000 100 100  88.84615  94.46154
## 4.08  89.00000 100 100  88.07692  94.26923
## 4.17  89.00000 100 100  88.07692  94.26923
## 4.31  83.66667 100 100  91.92308  93.89744
## 4.45  83.66667 100 100  91.15385  93.70513
## 4.63  88.33333  95  95  95.00000  93.33333
## 
## $rf
##        Overall      Mean
## 4.66 100.00000 100.00000
## 5.99  47.44296  47.44296
## 4.58  26.16193  26.16193
## 6.03  18.31512  18.31512
## 4.71  16.81943  16.81943
## 2.32  15.29759  15.29759
## 4.08  15.01731  15.01731
## 4.17  13.42245  13.42245
## 6.28  10.92276  10.92276
## 5.56  10.45265  10.45265

FEATURE SELECTION

Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:

feature.selection.result = feature.selection(prop.nmr.log, "seasons", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          2   0.6487 0.5256     0.1370  0.1763         
##          4   0.6986 0.5929     0.1093  0.1408         
##          8   0.7076 0.6043     0.1259  0.1648         
##         16   0.7737 0.6953     0.1454  0.1951         
##         32   0.8111 0.7443     0.1406  0.1905         
##         64   0.8299 0.7698     0.1407  0.1894        *
##        242   0.7938 0.7209     0.1512  0.2025         
## 
## The top 5 variables (out of 64):
##    X4.66, X4.38, X4.08, X4.5, X4.17
plot(feature.selection.result, type=c("g","o"))

Also selection by filter was used with the results shown below:

feature.selection.result2 = feature.selection(prop.nmr.log, "seasons", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2
## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance:
## 
##  Accuracy  Kappa AccuracySD KappaSD
##     0.816 0.7516     0.1378  0.1826
## 
## Using the training set, 96 variables were selected:
##    X0.41, X0.44, X0.73, X0.76, X0.93...
## 
## During resampling, the top 5 selected variables (out of a possible 161):
##    X0.41 (100%), X0.44 (100%), X0.76 (100%), X1.55 (100%), X2.14 (100%)
## 
## On average, 92.3 variables were selected (min = 78, max = 108)