April 11, 2023
New preprint from the lab:
“DeepDelta: Predicting Pharmacokinetic Improvements of Molecular Derivatives with Deep Learning”
https://chemrxiv.org/engage/chemrxiv/article-details/642d823f0784a63aee949898
The technical novelty of DeepDelta is the transformation of a classically single-molecule task (ADMET property prediction of molecules) into a dual molecule task by pairing molecules. This transformation creates a novel regression task with quadratically increased amount of data. This regression task can be solved with any established molecular machine learning pipeline. We have evaluated several established models, including LightGBM and the two-molecule version of Chemprop, and found strong performance and notable improvements over using these and other models (e.g., Random Forest) in the classic, single molecule mode to predict ADMET properties. The final and best performing version of DeepDelta creates the molecular pairing and then simply uses the previously published and extensively validated two-molecule version of Chemprop, which has been developed by Professor William H. Green and the Chemprop team for various two-molecule tasks such as solubility of solute - solvent pairs. We are grateful to the Chemprop team for making their code open source and providing feedback on the work, without which this work could not have been conducted. Although the figures and tables in the manuscript describe models as "Chemprop" and "DeepDelta", both use the underlying Chemprop D-MPNN architecture and the difference is the training data - whether Chemprop is applied to ADMET prediction through single-molecule processing as established and published by others for these type of tasks or whether it is used by following the "DeepDelta molecular pairing" approach for training data creation. Given the improved performance of two-molecule Chemprop when using paired data compared to single-molecule Chemprop, we hope this result can be informative for the future development of molecular machine learning approaches on small datasets.