1. Introduction
Quantitative Structure–Activity Relationship (QSAR) modeling—at least in principle—offers an elegant solution to a persistent problem in chemistry and toxicology: how to infer biological activity or physicochemical behavior from molecular structure alone. Since the foundational work of Corwin Hansch and Toshio Fujita in the 1960s, which formalized the relationship between chemical substituents and biological activity, QSAR has steadily evolved from relatively simple regression frameworks into a central pillar of modern computational toxicology and drug discovery (Hansch & Fujita, 1964; Dearden, 2016). Yet, despite this apparent maturity—and perhaps because of it—the field continues to wrestle with deeper methodological uncertainties, particularly those tied to how datasets are defined, curated, and interpreted.
It is tempting to think of QSAR as a purely computational exercise, driven by descriptors, algorithms, and validation metrics. But this view, while convenient, is incomplete. At its core, QSAR is inseparable from the data it consumes. Each data point represents not just a number, but the outcome of a specific experimental context—conditions, protocols, measurement variability, and sometimes even unreported assumptions. As Alexander Tropsha and colleagues have repeatedly emphasized, the predictive power of QSAR models depends fundamentally on the integrity and meaning of the underlying data (Tropsha, 2010; Fourches et al., 2010). In this sense, QSAR modeling is less about discovering patterns in abstract space and more about reconciling heterogeneous fragments of experimental reality.
This becomes particularly evident when considering dataset curation, which—somewhat paradoxically—remains both the most critical and the least standardized phase of QSAR development. Public chemical and toxicological databases have grown exponentially over the past two decades, fueled by regulatory initiatives and high-throughput screening programs such as Tox21 (Tice et al., 2013). However, the aggregation of such data often introduces inconsistencies: duplicated entries, conflicting measurements, ambiguous chemical representations, and incomplete metadata. The now-familiar principle of “garbage in, garbage out” is not merely rhetorical here—it is a structural limitation. Even subtle inconsistencies, such as how salt forms or tautomers are represented, can shift descriptor calculations and distort similarity relationships, ultimately affecting model predictions (Fourches et al., 2010).
And yet, the challenge is not only technical—it is also conceptual. What, precisely, constitutes a “dataset” in QSAR modeling? When multiple assays, endpoints, or biological systems are combined under a single label, the resulting dataset may lose semantic clarity. For instance, pooling binding affinity measurements with functional assay outcomes—despite their fundamentally different biological meanings—can lead to models that appear statistically robust but are biologically incoherent. This issue of endpoint ambiguity has been noted in environmental and toxicological modeling contexts, where heterogeneous data sources are frequently merged without sufficient harmonization (Raimondo et al., 2010; Saouter et al., 2017).
Closely related to this is the problem of data sparsity. Although millions of chemical substances are registered globally, only a relatively small subset has been thoroughly characterized in terms of toxicity or biological activity (Commission of the European Communities, 2001). This imbalance creates a narrow and unevenly populated chemical space, within which QSAR models must operate. The concept of the applicability domain (AD)—the region of chemical space where predictions are considered reliable—emerges as a necessary constraint, but also as a limitation. Models trained on sparse or clustered datasets may perform well internally yet fail when extrapolated to novel compounds, particularly those with distinct scaffolds or mechanisms of action (Cherkasov et al., 2014).
If dataset definition poses one layer of difficulty, bias introduces another—arguably more insidious—layer. Bias in QSAR datasets is not always immediately visible; it often manifests through subtle distortions in class distribution, sampling strategies, or publication practices. Class imbalance, for example, is a pervasive issue in toxicology datasets, where “active” or “toxic” compounds are often overrepresented relative to inactive ones. This imbalance can inflate performance metrics such as accuracy or AUC, giving a misleading impression of model reliability while masking poor sensitivity toward minority classes (Christley, 2010).
Sampling bias further complicates the picture. Many QSAR models are developed using datasets enriched with specific chemical classes—pesticides, pharmaceuticals, or industrial chemicals—depending on the research focus. While this may improve performance within that domain, it limits generalizability. A model trained predominantly on triazole fungicides, for instance, may struggle to predict the activity of structurally unrelated compounds. In effect, the model learns not the underlying biology, but the idiosyncrasies of its training set.
Publication bias adds yet another dimension. Historically, studies reporting significant or adverse effects have been more likely to be published, leading to an overrepresentation of “positive” findings in the literature (Wandall et al., 2007). In QSAR datasets derived from published sources, this can result in an underreporting of negative or null results, skewing model training and evaluation. Moreover, smaller studies—often with limited statistical power—are more prone to false positives, which can propagate through datasets and into predictive models (Christley, 2010).
Recognizing these challenges, regulatory bodies and international organizations have attempted to formalize best practices for QSAR development. The Organisation for Economic Co-operation and Development (OECD) established a set of validation principles in 2004, emphasizing the need for defined endpoints, transparent algorithms, clear applicability domains, and robust validation procedures (OECD, 2004). These guidelines have undoubtedly improved the transparency and reproducibility of QSAR models, particularly in regulatory contexts such as chemical risk assessment in the European Union (Kluxen et al., 2021).
However, adherence to these principles does not fully resolve the underlying issues. In practice, many QSAR models still exhibit discrepancies between internal validation metrics and external predictive performance. This gap—sometimes subtle, sometimes striking—often reflects unresolved biases or overfitting to curated datasets. As newer approaches, including machine learning and big data analytics, are integrated into QSAR workflows, the risk of amplifying these biases may increase rather than diminish (Kerner et al., 2021; Cote et al., 2016).
Emerging strategies attempt to address these limitations, though not without their own uncertainties. Read-across methods, for example, aim to infer properties of untested chemicals based on structurally similar compounds, thereby reducing reliance on experimental data (He et al., 2017). Similarly, efforts to develop nano-QSAR models extend traditional frameworks to nanomaterials, introducing new descriptors and challenges (Puzyn et al., 2009). While promising, these approaches depend heavily on the same foundational issue: the quality and representativeness of the underlying data.
In this context, it becomes increasingly clear that QSAR modeling is not merely a technical discipline, but an interpretive one. It requires not only computational rigor, but also a critical awareness of data provenance, experimental variability, and inherent bias. This narrative review, therefore, does not attempt to resolve these challenges outright. Rather, it seeks to examine them—perhaps even to sit with them for a moment—by exploring how dataset definition and bias shape the reliability of QSAR models, and what this means for their future role in drug discovery and regulatory science. Ultimately, the question is not whether QSAR can predict chemical behavior—it clearly can—but under what conditions, and with what degree of confidence. And that, it seems, depends less on the sophistication of the algorithms and more on the often-overlooked details of the data itself.
This narrative review aims to examine the methodological challenges associated with QSAR dataset definition and to critically evaluate the sources of bias that affect model reliability, while highlighting emerging strategies to improve predictive robustness.