pca Module

Implements principal component analysis (PCA) and related operations.

@author: drusk

class pml.unsupervised.pca.ReducedDataSet(data, sample_ids, labels, eigenvalues)[source]

A DataSet which has had dimensionality reduction performed on it.

Columns are interpreted as features in the data set, and rows are observations.

This dimensionally reduced data set has all of the observations of the original, but its features have been adjusted to be linear combinations of the originals.

Those features with little variance may have been dropped during the dimensionality reduction process. Use the percent_variance() method to find out how much of the original variance has been retained in the reduced features.

__init__(data, sample_ids, labels, eigenvalues)[source]

Creates a new ReducedDataSet.

Args:
data: numpy.array
The raw array with the new data.
sample_ids: list
The ids for the samples (rows, observations) in the data set.
labels: pandas.Series
The labels, if any, provided for the observations.
eigenvalues: numpy.array (1D)
The list of eigenvalues produced to determine which components in the new feature space were most important. This includes all of the eigenvalues, not just the ones for the components selected.
percent_variance()[source]

Calculates the percentage of the original DataSet’s variance which is still present in this dimensionally reduced DataSet.

Returns:
A floating point number between 0.0 and 1.0 representing the percentage.
pml.unsupervised.pca.get_pct_variance_per_principal_component(dataset)[source]

Determines the percentage of variance captured by each principal component in the data set.

Args:
dataset: model.DataSet
The data set whose principal components will be examined. Should not already be reduced.
Returns:
variances: pandas.Series
The percentage of variance (as a float between 0.0 and 1.0) for each principal component.
pml.unsupervised.pca.pca(dataset, num_components)[source]

Performs Principle Component Analysis (PCA) on a dataset.

Args:
dataset: model.DataSet
The dataset to be analysed.
num_components: int
The number of principal components to select.
pml.unsupervised.pca.plot_pct_variance_per_principal_component(dataset, plot_type='bar')[source]

Generates a plot to visualize the percentage of variance captured by each principal component in the data set.

Args:
dataset: model.DataSet
The data set whose principal components will be examined. Should not already be reduced.
plot_type: string
The plot type to generate. Supported plot types are:
‘bar’: vertical bar chart ‘barh’: horizontal bar chart ‘line’: line chart Default is ‘bar’.
Returns:
void, but produces a matplotlib plot.
Raises:
UnsupportedPlotTypeError if plot_type is not recognized.
pml.unsupervised.pca.recommend_num_components(dataset, min_pct_variance=0.9)[source]

Recommends the number of principal components that should be selected in order to keep a minimum specified percentage of the original data’s variance while also minimizing dimensionality.

Args:
dataset: model.DataSet
The dataset in question.
min_pct_variance: float
The minimum percent of variance which should be maintained when selecting the recommended number of principal components. Should be between 0.0 and 1.0. Defaults to 0.9 (i.e. 90%).
Returns:
The integer number of principal components which should be selected for Principal Component Analysis.
Raises:
ValueError if min_pct_variance is < 0 or > 1.
pml.unsupervised.pca.remove_means(dataset)[source]

Remove the column mean from each value in the dataset.

For example, if a certain column as values [1, 2, 3], the column mean is 2. When the column means are removed, that column will then have the values [-1, 0, 1].

NOTE: the modifications are made in place in dataset.

Args:
dataset: model.DataSet
The dataset to remove the column means from.

Project Versions

Previous topic

pandas_util Module

Next topic

plotting Module

This Page