PROSAC as a selection tool for SO-PLS regression: a strategy for multi-block data fusion

28 February 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Spectral data from multiple sources can be integrated into multi-block fusion chemometric models, such as sequentially orthogonalized partial-least squares (SO-PLS), to improve the prediction of sample quality features. Pre-processing techniques are often applied to mitigate extraneous variability, unrelated to the response variables. However, the selection of suitable pre-processing methods and identification of informative data blocks becomes increasingly complex and time-consuming when dealing with a large number of blocks. The problem addressed in this work is the efficient pre-processing, selection and ordering of data blocks for targeted applications in SO-PLS. We introduce the PROSAC-SO-PLS methodology, which employs pre-processing ensembles with response-oriented sequential alternation calibration (PROSAC). This approach identifies the best pre-processed data blocks and their sequential order for specific SO-PLS applications. The method uses a stepwise forward selection strategy, facilitated by the rapid Gram-Schmidt process, to prioritize blocks based on their effectiveness in minimizing prediction error, as indicated by the lowest prediction residuals. To validate the efficacy of our approach, we showcase the outcomes of three empirical near-infrared (NIR) datasets. Comparative analyses were performed against partial-least-squares (PLS) regressions on single-block pre-processed datasets and a methodology relying solely on PROSAC. The PROSAC-SO-PLS approach consistently outperformed these methods, yielding significantly lower prediction errors. This has been evidenced by a reduction in the root-mean-squared error of prediction (RMSEP) ranging from 5 to 25% across seven out of the eight response variables analyzed. The PROSAC-SO-PLS methodology offers a versatile and efficient technique for ensemble pre-processing in NIR data modeling. It enables the use of SO-PLS minimizing concerns about pre-processing sequence or block order and effectively manages a large number of data blocks. This innovation significantly streamlines the data pre-processing and model-building processes, enhancing the accuracy and efficiency of chemometric models.

Keywords

Multi-block
Chemometrics
Pre-processing
Spectroscopy
Data fusion
NIR

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.