Bioinfo Chem
System biology and Infochemistry | Online ISSN 3071-4826
1
Citations
13.3k
Views
32
Articles
REVIEWS (Open Access)
Limitations of QSAR Modeling: Data Bias, Curation, and Predictive Reliability in Computational Drug Discovery
Amena Khatun Manica 1*
Bioinfo Chem 5 (1) 1-13 https://doi.org/10.25163/bioinformatics.5110719
Submitted: 14 June 2023 Revised: 07 August 2023 Accepted: 16 August 2023 Published: 18 August 2023
Abstract
Quantitative Structure–Activity Relationship (QSAR) modeling remains a widely used approach in computational toxicology and drug discovery, enabling prediction of biological activity from molecular structure. However, despite decades of methodological development, concerns persist regarding the reliability and generalizability of QSAR models. This narrative review revisits QSAR modeling with a focus on data curation, dataset bias, and their impact on predictive reliability. Rather than viewing QSAR as a purely algorithmic process, this review emphasizes how data quality, experimental variability, and endpoint definition influence model performance. Issues such as class imbalance, sampling bias, and publication bias are shown to significantly affect predictive outcomes, often leading to overestimated model accuracy. In particular, commonly used validation metrics, including R², may fail to reflect true predictive performance when external validation and applicability domain considerations are not adequately addressed. Emerging approaches, including advanced validation strategies, consensus modeling, and improved descriptor frameworks, offer partial solutions to these challenges. However, the findings suggest that QSAR reliability is fundamentally dependent on data integrity, transparency, and appropriate validation rather than computational complexity alone. Overall, this review highlights the need for more robust data curation practices and context-aware validation frameworks to improve predictive modeling in QSAR and enhance its application in drug discovery and computational toxicology.
Keywords: QSAR modeling; dataset curation; bias; applicability domain; model validation
References
Afantitis, A., et al. (2020). NanoSolveIT project: Driving nanoinformatics research to develop innovative and integrated tools for in silico nanosafety assessment. Computational and Structural Biotechnology Journal, 18, 583–602. https://doi.org/10.1016/j.csbj.2020.02.023
Ambure, P., & Cordeiro, M. N. D. S. (2020). Importance of data curation in QSAR studies especially while modeling large-size datasets. In K. Roy (Ed.), Ecotoxicological QSARs (pp. 97–109). Springer. https://doi.org/10.1007/978-1-0716-0150-1_5
Cherkasov, A., Muratov, E. N., Fourches, D., Varnek, A., Baskin, I. I., Cronin, M., Dearden, J., Gramatica, P., Martin, Y. C., Todeschini, R., Consonni, V., Kuz'min, V. E., Cramer, R. D., Benigni, R., Yang, C., Rathman, J., Terfloth, L., Gasteiger, J., Richard, A., & Tropsha, A. (2014). QSAR modeling: Where have you been? Where are you going to? Journal of Medicinal Chemistry, 57(12), 4977–5010. https://doi.org/10.1021/jm4004285
Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData Mining, 10(35), 1–17. https://doi.org/10.1186/s13040-017-0155-3
Christley, R. M. (2010). Power and error: Increased risk of false positive results in underpowered studies. Open Epidemiology Journal, 3, 16–19. https://doi.org/10.2174/1874297101003010016
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Commission of the European Communities. (2001). White paper on a strategy for a future chemicals policy. European Commission.
Cote, I., Andersen, M. E., Ankley, G. T., Barone, S., Birnbaum, L. S., Boekelheide, K., DeWoskin, R. S., Hays, S. M., Judson, R., Portier, C. J., Smith, M. T., & Yauk, C. L. (2016). The next generation of risk assessment multiyear study—Highlights of findings, applications to risk assessment, and future directions. Environmental Health Perspectives, 124(11), 1671–1682. https://doi.org/10.1289/EHP233
De, P., Kar, S., Ambure, P., & Roy, K. (2022). Prediction reliability of QSAR models: An overview of various validation tools. Archives of Toxicology, 96, 1279–1295. https://doi.org/10.1007/s00204-022-03252-y
Dearden, J. C. (2016). The history and development of quantitative structure–activity relationships (QSARs). International Journal of Quantitative Structure-Property Relationships, 1(1), 1–44. https://doi.org/10.4018/IJQSPR.2016010101
Fourches, D., Muratov, E., & Tropsha, A. (2010). Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research. Journal of Chemical Information and Modeling, 50(7), 1189–1204. https://doi.org/10.1021/ci100176x
Gaheen, S., et al. (2013). caNanoLab: Data sharing to expedite the use of nanotechnology in biomedicine. Computational Science & Discovery, 6, 014010. https://doi.org/10.1088/1749-4699/6/1/014010
Hansch, C., & Fujita, T. (1964). ρ-σ-π analysis: A method for the correlation of biological activity and chemical structure. Journal of the American Chemical Society, 86(8), 1616–1626. https://doi.org/10.1021/ja01062a035
He, J., et al. (2017). The combined QSAR-ICE models: Practical application in ecological risk assessment and water quality criteria. Environmental Science & Technology, 51(16), 8877–8878. https://doi.org/10.1021/acs.est.7b02736
Jeliazkova, N., et al. (2015). The eNanoMapper database for nanomaterial safety information. Beilstein Journal of Nanotechnology, 6, 1609–1634. https://doi.org/10.3762/bjnano.6.165
Kar, S., et al. (2014). Periodic table-based descriptors to encode cytotoxicity profile of metal oxide nanoparticles: A mechanistic QSTR approach. Ecotoxicology and Environmental Safety, 107, 162–169. https://doi.org/10.1016/j.ecoenv.2014.05.026
Kerner, J., et al. (2021). Machine learning and big data provide crucial insight for future biomaterials discovery and research. Acta Biomaterialia, 130, 54–65. https://doi.org/10.1016/j.actbio.2021.05.053
Kluxen, F. M., Felkers, E., Baumann, J., Morgan, N., Wiemann, C., Stauber, F., & Kuster, C. J. (2021). Compounded conservatism in European re-entry worker risk assessment of pesticides. Regulatory Toxicology and Pharmacology, 121, 104864. https://doi.org/10.1016/j.yrtph.2021.104864
Li, J., et al. (2022). Nano-QSAR modeling for predicting the cytotoxicity of metallic and metal oxide nanoparticles: A review. Ecotoxicology and Environmental Safety, 243, 113955. https://doi.org/10.1016/j.ecoenv.2022.113955
Marquardt, C., et al. (2013). Latest research results on the effects of nanomaterials on humans and the environment: DaNa—Knowledge Base Nanomaterials. Journal of Physics: Conference Series, 429, 012060. https://doi.org/10.1088/1742-6596/429/1/012060
Miller, A. L., et al. (2007). The Nanoparticle Information Library (NIL): A prototype for linking and sharing emerging data. Journal of Occupational and Environmental Hygiene, 4, D131–D134. https://doi.org/10.1080/15459620701683947
Mills, K. C., et al. (2014). Nanomaterial registry: A database that captures minimal information about nanomaterial physicochemical characteristics. Journal of Nanoparticle Research, 16, 2219. https://doi.org/10.1007/s11051-013-2219-8
Ojha, P. K., Mitra, I., Das, R. N., & Roy, K. (2011). Further exploring rm² metrics for validation of QSPR models. Chemometrics and Intelligent Laboratory Systems, 107(1), 194–205. https://doi.org/10.1016/j.chemolab.2011.03.011
Organisation for Economic Co-operation and Development (OECD). (2004). The report from the expert group on (quantitative) structure–activity relationships [(Q)SARs] on the principles for the validation of (Q)SARs. OECD Publishing.
Organisation for Economic Co-operation and Development (OECD). (2014). Guidance document on the validation of (quantitative) structure–activity relationship [(Q)SAR] models. OECD Publishing.
Puzyn, T., et al. (2009). Toward the development of nano-QSARs: Advances and challenges. Small, 5(22), 2494–2509. https://doi.org/10.1002/smll.200900179
Raimondo, S., Jackson, C. R., & Barron, M. G. (2010). Influence of taxonomic relatedness and chemical mode of action in acute interspecies estimation models for aquatic species. Environmental Science & Technology, 44(19), 7711–7716. https://doi.org/10.1021/es101630b
Romano, J. D., Hao, Y., & Moore, J. H. (2022). Improving QSAR modeling for predictive toxicology using publicly aggregated semantic graph data and graph neural networks. Pacific Symposium on Biocomputing, 27, 187–198. https://doi.org/10.1142/9789811250477_0018
Roy, K., Das, R. N., Ambure, P., & Aher, R. B. (2016). Be aware of error measures: Further studies on validation of predictive QSAR models. Chemometrics and Intelligent Laboratory Systems, 152, 18–33. https://doi.org/10.1016/j.chemolab.2016.01.008
Roy, K., Kar, S., & Das, R. N. (2015). A primer on QSAR/QSPR modeling. Springer. https://doi.org/10.1007/978-3-319-17281-1
Sahigara, F., et al. (2012). Comparison of different approaches to define the applicability domain of QSAR models. Molecules, 17, 4791–4810. https://doi.org/10.3390/molecules17054791
Saouter, E., et al. (2017). Improving substance information in USEtox®, part 2: Data for estimating fate and ecosystem exposure factors. Environmental Toxicology and Chemistry, 36(12), 3463–3470. https://doi.org/10.1002/etc.3903
Tice, R. R., Austin, C. P., Kavlock, R. J., & Bucher, J. R. (2013). Improving the human hazard characterization of chemicals: A Tox21 update. Environmental Health Perspectives, 121(7), 756–765. https://doi.org/10.1289/ehp.1205784
Trinh, T. X., et al. (2018). Dataset curation and nanoSAR model development for metallic nanoparticles. Environmental Science: Nano, 5, 1902–1910. https://doi.org/10.1039/C8EN00061A
Tropsha, A. (2010). Best practices for QSAR model development, validation, and exploitation. Molecular Informatics, 29(6–7), 476–488. https://doi.org/10.1002/minf.201000061
Walter, M., Allen, L. N., de la Vega de León, A., Webb, S. J., & Gillet, V. J. (2022). Analysis of the benefits of imputation models over traditional QSAR models for toxicity prediction. Journal of Cheminformatics, 14, 32. https://doi.org/10.1186/s13321-022-00611-w
Wandall, B., Hansson, S. O., & Rudén, C. (2007). Bias in toxicology. Archives of Toxicology, 81(9), 605–617. https://doi.org/10.1007/s00204-007-0194-5
Recommended articles
Health Risks and Industrial Significance of Filamentous Fungi: Pathogenicity, Mycotoxicity, and Environmental Determinants
Marine Bacterial Carotenoid Pathways as a Reservoir of Functional Xanthophyll Biosynthesis: Enzymes, Diversity, and Engineering Insights
Marine Microbial Metabolites as Bioactive Reservoirs: A Systematic Synthesis of Biosynthetic Diversity and Functional Potential
0
Save
Save
0
Citation
Citation
12
View
View
0
Share
Share