QSAR
Analysis and HTS Interpretation
High throughput
screening (HTS) has become a core component in drug discovery.
HTS produces a great amount of data that must be analyzed to
guide follow-on screens and lead optimization chemistry. QSAR
(quantitative structure-activity relationships) represent
attempts to describe and to correlate structural or property
descriptors of compounds with compound activities.
In both cases compounds are represented by
physicochemical descriptors. The goal
of interpretation package is to
predict the activity of new compounds based on observed actives.
Many learning
and data mining methods now emerging from the artificial
intelligence community are ideally suited to the task. These
include Bayesian Classifiers and Probabilistic models,
application of Kernel Based methods,
Decision
Trees methods,
and
Scaling methods that map results into lower dimensional
spaces through selection of appropriate molecule descriptors.
We have collected these methods into an extensive internal
software package for application to HTS and for other
chemoinformatics tasks associated with maintenance of a
screening library. .Here we briefly describe some problems and
opportunities associated with HTS interpretation and with
associated QSAR.
Typical predictions based on classifications
of HTS results ignore the prior distribution of compounds that
were screened. When one analyzes a typical screening library -
even one selected for diversity – compounds fall into ‘islands’
that are far apart and have, in an adequate approximation, a
continuous distribution of compounds within the islands. This
allows us to create a Bayesian Classification Schemes
that measures the size of these cluster islands and then uses
them as “prior” distributions in the classification of hits.
This classification is based on sampling the immediate
environment of each positive compound and, by taking the ratio
of positives to negatives and formulating statistical priors,
infers a probability that the compound is a true positive. This
approach allows the identification of inferred true positives,
inferred false positives, and inferred false negatives. For
example, Most HTS efforts discard isolated hits (clusters of one
compound). The use of priors can differentiate between isolated
hits that are isolated because their chemotypes was
underrepresented in the original library, and those that are
isolated because they are candidates for false positives.
Isolated hits that are under-represented are worth pursuing.
In the same venue the distances between the
compound vectors can be defined by their local environments.
Namely, the straightforward cartesian distances between compound
descriptor vectors do not reflect well the similarity between
the compounds as perceived by chemists. We define the non
Euclidean distances between objects by confining the analysis to
the local neighborhoods and then by defining the distance
through conditional probabilities of two objects to belong to
the same chemotype.
A key problem in HTS data mining and in QSAR
is the selection of chemical descriptors used to
express each compound. For any given dataset a large number of
descriptors are noisy or simply irrelevant to the problem at
hand. These descriptors can impede subsequent clustering and
classification procedures. Before applying any of the
classification or clustering schemes we examine how relevant a
descriptor is to the data set. A well-chosen set of descriptors
makes subsequent classification more sensitive- and therefore
more precise. This is done by application of such scaling
methods as PCA (principal component analysis) and ICA (independant
component analysis), by computing the descriptor’s
characteristics such as entropy over that neighbor data set, and
by applying of non Euclidean metrics mentioned above.
Classification and regression can be directly
derived from the training set by Kernel based classification
methods. Unfortunately, training sets are often erroneous and
noisy. Kernel based methods comprise a novel and sensitive class
of "instance" methods that create a separation hyper surface
between positive and negative points. In these methods a
hypersurface separating positives and negatives is constructed
based on data points close to the hyper surface called support
vectors. By separating these support vectors the methods apply
a “worst case” analysis, and as such they generalizes well
(i.e. they can be trained on small data sets). The methods are
typically robust, in the sense that the choice and normalization
of compound descriptors is usually not critical.
There is a need to rapidly analyze and
classify large compound data sets. Decision trees work
by analyzing molecular descriptors, which may be binary or
numerical, and categorizing compounds using a top-down recursive
procedure applied to descriptor values. Decision tree methods
are used to classify and to rapidly cluster compounds coming
from a high throughput screen to determine recurring compound
classes that hold the most promise for lead optimization. The
method can also be used to create descriptor-based profiles of
ligands for target classes. These profiles can then be used to
rapidly search existing compound databases for other compounds
that fit the profile, for “druglike”, “kinase-like”, active,
etc. |