Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set

Sushko, Iurii; Novotarskyi, Sergii; Ko¨rner, †Robert; Anil Kumar Pandey,; Cherkasov, Artem; Jiazhong, Li; Gramatica, Paola; Hansen, Katja; Schroeter, Timon; Klaus Robert Mu¨ller,; Lili, Xi; Liu, Huanxiang; Yao, Xiaojun; Tomas O¨ Berg,; Hormozdiari, Farhad; Phuong Dao Cenk Sahinalp,; Todeschini, Roberto; Polishchuk, Pavel; Artemenko, Anatoliy; Kuz’Min, Victor; Martin, Todd M.; Young, Douglas M.; Fourches, Denis; Muratov, Eugene; Tropsha, Alexander; Baskin, Igor; Horvath, 9dragos; Marcou, Gilles; Varnek, Alexander; Prokopenko, Volodymyr V.; Tetko*, Igor V.

doi:10.1021/ci100253r

The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of “distance to model” (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been basedon the standard deviation within an ensemble of QSAR models. The current study applies such analysis to QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performancethan other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. Thedeveloped model has been made publicly available at http://ochem.eu/models/1.