User`s guide

2 Importing, Viewing, and Preprocessing Data
2-28
by a particular distribution, which is often assumed to be Gaussian. The
statistical nature of the data implies that it contains random variations along
with a deterministic component.
data = deterministic component + random component
However, your data set might contain one or more data points that are
nonstatistical in nature, or are described by a different statistical distribution.
These data points might be easy to identify, or they might be buried in the data
and difficult to identify.
A nonstatistical process can involve the measurement of a physical variable
such as temperature or voltage in which the random variation is negligible
compared to the systematic errors. For example, if your sensor calibration is
inaccurate, the data measured with that sensor will be systematically
inaccurate. In some cases, you might be able to quantify this nonstatistical
data component and correct the data accordingly. However, if the scatter plot
reveals that a handful of response values are far removed from neighboring
response values, these data points are considered outliers and should be
excluded from the fit. Outliers are usually difficult to explain away. For
example, it might be that your sensor experienced a power surge or someone
wrote down the wrong number in a log book.
If you decide there is justification, you should mark outliers to be excluded from
subsequent fits particularly parametric fits. Removing these data points can
have a dramatic effect on the fit results because the fitting process minimizes
the square of the residuals. If you do not exclude outliers, the resulting fit will
be poor for a large portion of your data. Conversely, if you do exclude the
outliers and choose the appropriate model, the fit results should be reasonable.
Because outliers can have a significant effect on a fit, they are considered
influential data. However, not all influential data points are outliers. For
example, your data set can contain valid data points that are far removed from
the rest of the data. The data is valid because it is well described by the model
used in the fit. The data is influential because its exclusion will dramatically
affect the fit results.