model Module

Models for the data being analysed and manipulated.

@author: drusk

class pml.data.model.DataSet(data, labels=None)[source]

A collection of data that may be analysed and manipulated.

Columns are interpreted as features in the data set, and rows are samples or observations.

__init__(data, labels=None)[source]

Creates a new DataSet from data of an unknown type. If data is itself a DataSet object, then its contents are copied and a new DataSet is created from the copies.

Args:
data:
Data of unknown type. The supported types are:
  1. pandas DataFrame
  2. Python lists or pandas DataFrame
  3. an existing DataSet object
labels: pandas Series, Python list or Python dictionary
The classification labels for the samples in data. If they are not known (i.e. it is an unlabelled data set) the value None should be used. Default value is None (unlabelled).
Raises:

ValueError if the data or labels are not of a supported type.

InconsistentSampleIdError if labels were provided whose sample ids do not match those of the data.

combine_labels(to_combine, new_label)[source]

Combines classification labels to have some new value.

For example, consider a dataset with labels “cat”, “crow” and “pidgeon”. Maybe you are only really worried about whether something is a cat or a bird, so you want to combine the “crow” and “pidgeon” labels into a new one called “bird”.

Args:
to_combine: list
The list of labels which will be combined to form one new classification label.
new_label: string
The new classification label for those which were combined.
copy()[source]

Creates a copy of this dataset. Changes made to one dataset will not affect the other.

Returns:
A new DataSet with the current data and labels.
drop_column(index)[source]

Creates a copy of the data set with a specified column removed.

Args:
index:
the index (0 based) of the column to drop.
Returns:
a new DataSet with the specified column removed. The original DataSet remains unaltered.
feature_list()[source]

Returns: The list of features in the dataset.

fill_missing(fill_value)[source]

Fill in missing data with a constant value. Changes are made in-place.

Args:
fill_value:
The value to insert wherever data is missing.
Returns:
Void. The changes to the DataSet are made in-place.
get_column(index)[source]

Selects a column from the data set.

Args:
index:
The column index. If the columns are named, this is the column name. Otherwise it is the 0-based index.
Returns:
the columns at the specified index as a pandas Series object. This series is a view on the original data set, not a copy. That means any changes to it will also be applied to the original data set.
get_data_frame()[source]

Retrieve the DataSet’s underlying data as a pandas DataFrame object.

See also get_labelled_data_frame().

Returns:
A pandas DataFrame with the DataSet’s main data, but no labels.
get_feature_value_counts(feature)[source]

Count the number of occurrences of each value of a given feature in the data set.

Args:
feature: string
The feature whose values will be counted.
Returns:
value_counts: pandas.Series
A Series containing the counts of each label. It is indexable by label. The index is ordered from highest to lowest count.
get_feature_values(feature)[source]

Retrieves the set of values for a given feature.

Args:
feature: string
The feature whose unique values will be retrieved.
Returns:
value_set: set
The set of unique values for a feature.
get_label_value_counts()[source]

Count the number of occurrences of each label.

NOTE: If the data set is unlabelled an empty set of results will be returned.

Returns:
value_counts: pandas.Series
A Series containing the counts of each label. It is indexable by label. The index is ordered from highest to lowest count.
get_labelled_data_frame()[source]

Retrieve the DataSet’s underlying data as a pandas DataFrame object, including any labels.

See also get_data_frame().

Returns:
A pandas DataFrame with the DataSet’s main data and the labels if they are present attached as the rightmost column.
get_labels(indices=None)[source]

Selects classification labels for the specified samples (rows) in the DataSet.

Args:
indices: list
The list of row indices (0 based) which should be selected. Defaults to None, in which case all labels are selected.
Returns:
A pandas Series with the classification labels.
get_row(identifier)[source]

Selects a single row from the dataset.

Args:
identifier:
The id of the row to select. If the DataSet has special indices set up (ex: through a call to load with has_ids=True) these can be used. The integer index (0 based) can also be used.
Returns:
A pandas Series object representing the desired row. NOTE: this is a view on the original dataset. Changes made to this Series will also be made to the DataSet.
get_rows(indices)[source]

Selects specified rows from the dataset.

Args:
indices: list
The list of row indices (0 based) which should be selected.
Returns:
A new DataSet with the specified rows from the original.
get_sample_ids()[source]

Returns: A Python list of the ids of the samples in the dataset.

has_missing_values()[source]

Returns: True if the dataset is missing values. These will be represented as np.NaN.

is_labelled()[source]

Returns: True if the dataset has classification labels for each sample, False otherwise.

label_filter(label)[source]

Filters the data set based on its labels.

Args:
label:
Samples with this label value will remain in the filtered data set. All others will be removed.
Returns:
filtered: model.DataSet
The filtered data set.
Raises:
UnlabelledDataSetError if the data set is not labeled.
num_features()[source]

Returns: The number of features (columns) in the data set.

num_samples()[source]

Returns: The number of samples (rows) in the data set.

plot_radviz()[source]

Generates a RadViz plot of the data set. Radviz is useful for visualizing data with more than two dimensions.

Returns:
void, but a plot is generated.
reduce_features(function)[source]

Performs a feature-wise (i.e. column-wise) reduction of the data set.

Args:
function:
The function which will be applied to each feature in the data set.
Returns:
A pandas Series object which is the one dimensional result of the reduction (one value corresponding to each feature).
reduce_rows(function)[source]

Performs a row-wise reduction of the data set.

Args:
function:
the function which will be applied to each row in the data set.
Returns:
a pandas Series object which is the one dimensional result of
reduction (one value corresponding to each row).
sample_filter(samples_to_keep)[source]

Filters the data set based on its sample ids.

Args:
samples_to_keep:
The sample ids of the samples which should be kept. All others will be removed.
Returns:
filtered: model.DataSet
The filtered data set.
set_column(index, new_column)[source]

Set the new values for a column. Can be used to create a new column.

Args:
index:
The column index. If the columns are named, this is the column name. Otherwise it is the 0-based index.
new_column: pandas.Series or compatible object
The new column data to be placed at the specified index.
split(percent, random=False)[source]

Splits the dataset in two.

Args:
percent: float
The percentage of the original dataset samples which should be placed in the first dataset returned. The remainder are placed in the second dataset. This percentage must be specified as a value between 0 and 1 inclusive.
random: boolean
Set to True if the samples selected for each new dataset should be picked randomly. Defaults to False, meaning the samples are taken in their existing order.
Returns:
dataset1: DataSet object
A subset of the original dataset with <percent> samples.
dataset2: DataSet object
A subset of the original dataset with 1-<percent> samples.
Raises:
ValueError if percent < 0 or percent > 1.
value_filter(feature, value)[source]

Filters the data set based on its values for a given feature.

Args:
feature: string
The name of the feature whose value will be examined for each sample.
value:
The value which all samples passing through the filter should have for the specified feature.
Returns:
filtered: model.DataSet
The filtered data set.
pml.data.model.as_dataset(data)[source]

Creates a DataSet from the provided data. If data is already a DataSet, return it directly. Use this instead of the DataSet constructor if you don’t know whether your data is a DataSet already, but you don’t want to create a new one if it already is.

Args:
data:
Data of unknown type. It may be a Python list or pandas DataFrame or DataSet object.
Returns:
A DataSet object. If the data was already a DataSet then the input object will be directly returned.
Raises:
ValueError if the data is not of a supported type.

Project Versions

Previous topic

loader Module

Next topic

naive_bayes Module

This Page