pca Module¶

Implements principal component analysis (PCA) and related operations.

@author: drusk

class pml.unsupervised.pca.ReducedDataSet(data, sample_ids, labels, eigenvalues)[source]¶

A DataSet which has had dimensionality reduction performed on it.

Columns are interpreted as features in the data set, and rows are observations.

This dimensionally reduced data set has all of the observations of the original, but its features have been adjusted to be linear combinations of the originals.

Those features with little variance may have been dropped during the dimensionality reduction process. Use the percent_variance() method to find out how much of the original variance has been retained in the reduced features.

__init__(data, sample_ids, labels, eigenvalues)[source]¶

Creates a new ReducedDataSet.

Args:

data: numpy.array: The raw array with the new data.
sample_ids: list: The ids for the samples (rows, observations) in the data set.
labels: pandas.Series: The labels, if any, provided for the observations.
eigenvalues: numpy.array (1D): The list of eigenvalues produced to determine which components in the new feature space were most important. This includes all of the eigenvalues, not just the ones for the components selected.

percent_variance()[source]¶

Calculates the percentage of the original DataSet’s variance which is still present in this dimensionally reduced DataSet.

Returns:: A floating point number between 0.0 and 1.0 representing the percentage.

pml.unsupervised.pca.get_pct_variance_per_principal_component(dataset)[source]¶

Determines the percentage of variance captured by each principal component in the data set.

Args:

dataset: model.DataSet: The data set whose principal components will be examined. Should not already be reduced.

Returns:

variances: pandas.Series: The percentage of variance (as a float between 0.0 and 1.0) for each principal component.

pml.unsupervised.pca.pca(dataset, num_components)[source]¶

Performs Principle Component Analysis (PCA) on a dataset.

Args:

dataset: model.DataSet: The dataset to be analysed.
num_components: int: The number of principal components to select.

pml.unsupervised.pca.plot_pct_variance_per_principal_component(dataset, plot_type='bar')[source]¶

Generates a plot to visualize the percentage of variance captured by each principal component in the data set.

Args:

dataset: model.DataSet

The data set whose principal components will be examined. Should not already be reduced.

plot_type: string

The plot type to generate. Supported plot types are:: ‘bar’: vertical bar chart ‘barh’: horizontal bar chart ‘line’: line chart Default is ‘bar’.

Returns:

void, but produces a matplotlib plot.

Raises:

UnsupportedPlotTypeError if plot_type is not recognized.

pml.unsupervised.pca.recommend_num_components(dataset, min_pct_variance=0.9)[source]¶

Recommends the number of principal components that should be selected in order to keep a minimum specified percentage of the original data’s variance while also minimizing dimensionality.

Args:

dataset: model.DataSet: The dataset in question.
min_pct_variance: float: The minimum percent of variance which should be maintained when selecting the recommended number of principal components. Should be between 0.0 and 1.0. Defaults to 0.9 (i.e. 90%).

Returns:

The integer number of principal components which should be selected for Principal Component Analysis.

Raises:

ValueError if min_pct_variance is < 0 or > 1.

pml.unsupervised.pca.remove_means(dataset)[source]¶

Remove the column mean from each value in the dataset.

For example, if a certain column as values [1, 2, 3], the column mean is 2. When the column means are removed, that column will then have the values [-1, 0, 1].

NOTE: the modifications are made in place in dataset.

Args:

dataset: model.DataSet: The dataset to remove the column means from.

pca Module¶

Project Versions

Previous topic

Next topic

This Page

Navigation

pca Module¶

Project Versions

RTD Search

Previous topic

Next topic

This Page

Quick search

Navigation