Guidelines for QSAR modeling

QSAR is an acronym for quantitative structure-activity relationship, which is a widely used ligand-based virtual screening approach for quantitatively correlating the structural features for a set of compounds with their respective biological activity (or simply “bioactivity”). QSAR modeling may be a cumbersome task owing to the flexibility of the individual components of the QSAR workflow.

QSAR workflow

A typical QSAR workflow comprises of the following steps:

  1. Compile a dataset for QSAR modeling
  2. Calculating the molecular descriptors for describing the structural features of the compounds in the dataset
  3. Select a subset of descriptors to use via rational selection or feature selection
  4. Perform data splitting (perhaps via Kennard-Stone algorithm) to separate the dataset into internal and external sets (i.e. corresponding to 80 and 20% from the original dataset)
  5. Construct the QSAR model using the internal set as training data
  6. Apply the above QSAR model against the external set
  7. Compare the statistical performances of QSAR models from internal and external sets
  8. Assess the robustness of the QSAR model (i.e. possibility of chance correlation?) via R^2-Q^2, Y-scrambling (if regression model), applicability domain, etc.

General overview

Data splitting

Statistical measures