pca Module
Implements principal component analysis (PCA) and related operations.
@author: drusk
-
class pml.unsupervised.pca.ReducedDataSet(data, sample_ids, labels, eigenvalues)[source]
A DataSet which has had dimensionality reduction performed on it.
Columns are interpreted as features in the data set, and rows are
observations.
This dimensionally reduced data set has all of the observations of the
original, but its features have been adjusted to be linear combinations
of the originals.
Those features with little variance may have been dropped during the
dimensionality reduction process. Use the percent_variance() method to
find out how much of the original variance has been retained in the
reduced features.
-
__init__(data, sample_ids, labels, eigenvalues)[source]
Creates a new ReducedDataSet.
- Args:
- data: numpy.array
- The raw array with the new data.
- sample_ids: list
- The ids for the samples (rows, observations) in the data set.
- labels: pandas.Series
- The labels, if any, provided for the observations.
- eigenvalues: numpy.array (1D)
- The list of eigenvalues produced to determine which components in
the new feature space were most important. This includes all of
the eigenvalues, not just the ones for the components selected.
-
percent_variance()[source]
Calculates the percentage of the original DataSet’s variance which is
still present in this dimensionally reduced DataSet.
- Returns:
- A floating point number between 0.0 and 1.0 representing the
percentage.
-
pml.unsupervised.pca.get_pct_variance_per_principal_component(dataset)[source]
Determines the percentage of variance captured by each principal component
in the data set.
- Args:
- dataset: model.DataSet
- The data set whose principal components will be examined. Should not
already be reduced.
- Returns:
- variances: pandas.Series
- The percentage of variance (as a float between 0.0 and 1.0) for each
principal component.
-
pml.unsupervised.pca.pca(dataset, num_components)[source]
Performs Principle Component Analysis (PCA) on a dataset.
- Args:
- dataset: model.DataSet
- The dataset to be analysed.
- num_components: int
- The number of principal components to select.
-
pml.unsupervised.pca.plot_pct_variance_per_principal_component(dataset, plot_type='bar')[source]
Generates a plot to visualize the percentage of variance captured
by each principal component in the data set.
- Args:
- dataset: model.DataSet
- The data set whose principal components will be examined. Should not
already be reduced.
- plot_type: string
- The plot type to generate. Supported plot types are:
- ‘bar’: vertical bar chart
‘barh’: horizontal bar chart
‘line’: line chart
Default is ‘bar’.
- Returns:
- void, but produces a matplotlib plot.
- Raises:
- UnsupportedPlotTypeError if plot_type is not recognized.
-
pml.unsupervised.pca.recommend_num_components(dataset, min_pct_variance=0.9)[source]
Recommends the number of principal components that should be selected in
order to keep a minimum specified percentage of the original data’s
variance while also minimizing dimensionality.
- Args:
- dataset: model.DataSet
- The dataset in question.
- min_pct_variance: float
- The minimum percent of variance which should be maintained when
selecting the recommended number of principal components. Should be
between 0.0 and 1.0.
Defaults to 0.9 (i.e. 90%).
- Returns:
- The integer number of principal components which should be selected for
Principal Component Analysis.
- Raises:
- ValueError if min_pct_variance is < 0 or > 1.
-
pml.unsupervised.pca.remove_means(dataset)[source]
Remove the column mean from each value in the dataset.
For example, if a certain column as values [1, 2, 3], the column mean is
2. When the column means are removed, that column will then have the
values [-1, 0, 1].
NOTE: the modifications are made in place in dataset.
- Args:
- dataset: model.DataSet
- The dataset to remove the column means from.