user manual

Chapter 4

The PCA / Factor node provides powerful data-reduction techniques to reduce

the complexity of you r data. Principal components analysis (PCA) ﬁnds linear

combinations of the input ﬁel ds that do the best job of capturing the variance in the

entire set of ﬁelds, where the components are orthogonal (perpendicul ar) to each

other. Factor analysis attempts to identify u nderlying factors that explain the pattern

of correlations within a set of observed ﬁelds. For both approaches, the goal is to

ﬁnd a small number o f derived ﬁelds that effectively su mmarizes the information in

the original set of ﬁelds.

The Feature Selection node screens input ﬁelds for removal based on a set of criteria

(such as the percentage of missing values); it then ranks the importance of remaining

inputs relative to a speciﬁed target. For example, given a data set wi t h hundreds of

potential inputs, which are most likely to be useful in modelin g patient outcomes?

Discriminant analysis makes more stringent assumptions than logistic regression but

can be a valuable altern at i ve or supplement to a logistic regression analysis when

those assump t i ons are met.

Logistic reg res sion is a stati stical technique for classifying records based on values

of input ﬁelds. It is analogous t o linear regression but takes a categorical target ﬁeld

instead of a numeric range.

The Generalized Linear model expand s the general linear model so that the

dependent variable is linearly related to the factors and covariates through a speciﬁed

link function. Moreover, the m odel allo w s f or th e dependent variable to have a

non-normal distribution. It covers the functionality of a wide number of statistical

models, including linear regression, logistic regression, loglinear models for count

data, and interval-censored survival models.

A generalized linear mixed model (GLMM) extends the linear model so that the target

can have a no n-normal distribution, is linearly related to the factors and covariates via

a speciﬁed link function, and so that the observations can be correlated. Generalized

linear mixed models cover a w i de variety of models, from simple linear regression to

complex mult i l evel models for non-normal longitudinal data.

The Cox reg ression node enables you t o build a survival model for time-to-event data

in the presence of censored r ecords. The mod el produces a survival function that

predicts the probability that the event of interest has occurred at a given time (t)

for given values of the input variables.

The Suppo rt Vector Machi ne (SVM) node enables you t o classify data into one of

two groups without overﬁtting. SVM works well with wide data sets, such as those

with a very large number of input ﬁelds.

The Bayesian Network node enables you to build a probability model by combin i ng

observed and recorded evidence with real-world knowledge to establish the likelih ood

of occurrences. The node focuses on Tree Augmen t ed Naïve Bayes (TAN) and

Markov Bla nket networks that are primarily used for classiﬁcation.