model Module
Models for the data being analysed and manipulated.
@author: drusk
-
class pml.data.model.DataSet(data, labels=None)[source]
A collection of data that may be analysed and manipulated.
Columns are interpreted as features in the data set, and rows are samples
or observations.
-
__init__(data, labels=None)[source]
Creates a new DataSet from data of an unknown type. If data is itself
a DataSet object, then its contents are copied and a new DataSet is
created from the copies.
- Args:
- data:
- Data of unknown type. The supported types are:
- pandas DataFrame
- Python lists or pandas DataFrame
- an existing DataSet object
- labels: pandas Series, Python list or Python dictionary
- The classification labels for the samples in data. If they are
not known (i.e. it is an unlabelled data set) the value None
should be used. Default value is None (unlabelled).
- Raises:
ValueError if the data or labels are not of a supported type.
InconsistentSampleIdError if labels were provided whose sample ids
do not match those of the data.
-
combine_labels(to_combine, new_label)[source]
Combines classification labels to have some new value.
For example, consider a dataset with labels “cat”, “crow” and
“pidgeon”. Maybe you are only really worried about whether something
is a cat or a bird, so you want to combine the “crow” and “pidgeon”
labels into a new one called “bird”.
- Args:
- to_combine: list
- The list of labels which will be combined to form one new
classification label.
- new_label: string
- The new classification label for those which were combined.
-
copy()[source]
Creates a copy of this dataset. Changes made to one dataset will not
affect the other.
- Returns:
- A new DataSet with the current data and labels.
-
drop_column(index)[source]
Creates a copy of the data set with a specified column removed.
- Args:
- index:
- the index (0 based) of the column to drop.
- Returns:
- a new DataSet with the specified column removed. The original
DataSet remains unaltered.
-
feature_list()[source]
Returns:
The list of features in the dataset.
-
fill_missing(fill_value)[source]
Fill in missing data with a constant value. Changes are made in-place.
- Args:
- fill_value:
- The value to insert wherever data is missing.
- Returns:
- Void. The changes to the DataSet are made in-place.
-
get_column(index)[source]
Selects a column from the data set.
- Args:
- index:
- The column index. If the columns are named, this is the column
name. Otherwise it is the 0-based index.
- Returns:
- the columns at the specified index as a pandas Series object. This
series is a view on the original data set, not a copy. That means
any changes to it will also be applied to the original data set.
-
get_data_frame()[source]
Retrieve the DataSet’s underlying data as a pandas DataFrame object.
See also get_labelled_data_frame().
- Returns:
- A pandas DataFrame with the DataSet’s main data, but no labels.
-
get_feature_value_counts(feature)[source]
Count the number of occurrences of each value of a given feature in
the data set.
- Args:
- feature: string
- The feature whose values will be counted.
- Returns:
- value_counts: pandas.Series
- A Series containing the counts of each label. It is indexable by
label. The index is ordered from highest to lowest count.
-
get_feature_values(feature)[source]
Retrieves the set of values for a given feature.
- Args:
- feature: string
- The feature whose unique values will be retrieved.
- Returns:
- value_set: set
- The set of unique values for a feature.
-
get_label_value_counts()[source]
Count the number of occurrences of each label.
NOTE: If the data set is unlabelled an empty set of results will be
returned.
- Returns:
- value_counts: pandas.Series
- A Series containing the counts of each label. It is indexable by
label. The index is ordered from highest to lowest count.
-
get_labelled_data_frame()[source]
Retrieve the DataSet’s underlying data as a pandas DataFrame object,
including any labels.
See also get_data_frame().
- Returns:
- A pandas DataFrame with the DataSet’s main data and the labels if
they are present attached as the rightmost column.
-
get_labels(indices=None)[source]
Selects classification labels for the specified samples (rows) in the
DataSet.
- Args:
- indices: list
- The list of row indices (0 based) which should be selected.
Defaults to None, in which case all labels are selected.
- Returns:
- A pandas Series with the classification labels.
-
get_row(identifier)[source]
Selects a single row from the dataset.
- Args:
- identifier:
- The id of the row to select. If the DataSet has special indices
set up (ex: through a call to load with has_ids=True) these can
be used. The integer index (0 based) can also be used.
- Returns:
- A pandas Series object representing the desired row. NOTE: this is
a view on the original dataset. Changes made to this Series will
also be made to the DataSet.
-
get_rows(indices)[source]
Selects specified rows from the dataset.
- Args:
- indices: list
- The list of row indices (0 based) which should be selected.
- Returns:
- A new DataSet with the specified rows from the original.
-
get_sample_ids()[source]
Returns:
A Python list of the ids of the samples in the dataset.
-
has_missing_values()[source]
Returns:
True if the dataset is missing values. These will be represented
as np.NaN.
-
is_labelled()[source]
Returns:
True if the dataset has classification labels for each sample,
False otherwise.
-
label_filter(label)[source]
Filters the data set based on its labels.
- Args:
- label:
- Samples with this label value will remain in the filtered data
set. All others will be removed.
- Returns:
- filtered: model.DataSet
- The filtered data set.
- Raises:
- UnlabelledDataSetError if the data set is not labeled.
-
num_features()[source]
Returns:
The number of features (columns) in the data set.
-
num_samples()[source]
Returns:
The number of samples (rows) in the data set.
-
plot_radviz()[source]
Generates a RadViz plot of the data set. Radviz is useful for
visualizing data with more than two dimensions.
- Returns:
- void, but a plot is generated.
-
reduce_features(function)[source]
Performs a feature-wise (i.e. column-wise) reduction of the data set.
- Args:
- function:
- The function which will be applied to each feature in the data set.
- Returns:
- A pandas Series object which is the one dimensional result of the
reduction (one value corresponding to each feature).
-
reduce_rows(function)[source]
Performs a row-wise reduction of the data set.
- Args:
- function:
- the function which will be applied to each row in the data set.
- Returns:
- a pandas Series object which is the one dimensional result of
- reduction (one value corresponding to each row).
-
sample_filter(samples_to_keep)[source]
Filters the data set based on its sample ids.
- Args:
- samples_to_keep:
- The sample ids of the samples which should be kept. All others
will be removed.
- Returns:
- filtered: model.DataSet
- The filtered data set.
-
set_column(index, new_column)[source]
Set the new values for a column. Can be used to create a new column.
- Args:
- index:
- The column index. If the columns are named, this is the column
name. Otherwise it is the 0-based index.
- new_column: pandas.Series or compatible object
- The new column data to be placed at the specified index.
-
split(percent, random=False)[source]
Splits the dataset in two.
- Args:
- percent: float
- The percentage of the original dataset samples which should be
placed in the first dataset returned. The remainder are placed
in the second dataset. This percentage must be specified as a
value between 0 and 1 inclusive.
- random: boolean
- Set to True if the samples selected for each new dataset should
be picked randomly. Defaults to False, meaning the samples are
taken in their existing order.
- Returns:
- dataset1: DataSet object
- A subset of the original dataset with <percent> samples.
- dataset2: DataSet object
- A subset of the original dataset with 1-<percent> samples.
- Raises:
- ValueError if percent < 0 or percent > 1.
-
value_filter(feature, value)[source]
Filters the data set based on its values for a given feature.
- Args:
- feature: string
- The name of the feature whose value will be examined for each
sample.
- value:
- The value which all samples passing through the filter should
have for the specified feature.
- Returns:
- filtered: model.DataSet
- The filtered data set.
-
pml.data.model.as_dataset(data)[source]
Creates a DataSet from the provided data. If data is already a DataSet,
return it directly. Use this instead of the DataSet constructor if you
don’t know whether your data is a DataSet already, but you don’t want to
create a new one if it already is.
- Args:
- data:
- Data of unknown type. It may be a Python list or pandas DataFrame or
DataSet object.
- Returns:
- A DataSet object. If the data was already a DataSet then the input
object will be directly returned.
- Raises:
- ValueError if the data is not of a supported type.