Splitter

Oblique decision tree classifier based on SVM nodes Splitter class

class Splitter.Splitter(clf: Optional[SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]

Bases: object

Splits a dataset in two based on different criteria

Parameters

clfSVC, optional: classifier, by default None
criterionstr, optional: The function to measure the quality of a split (only used if max_features != num_features). Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain., by default “entropy”, by default None
feature_selectstr, optional: The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. “mutual”: Chooses the best features w.r.t. their mutual info with the label. “cfs”: Apply Correlation-based Feature Selection. “fcbf”: Apply Fast Correlation- Based, by default None
criteriastr, optional: ecides (just in case of a multi class classification) which column (class) use to split the dataset in a node. max_samples is incompatible with ‘ovo’ multiclass_strategy, by default None
min_samples_splitint, optional: The minimum number of samples required to split an internal node. 0 (default) for any, by default None
random_stateoptional: Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.Pass an int for reproducible output across multiple function calls, by default None
normalizebool, optional: If standardization of features should be applied on each node with the samples that reach it , by default False

Raises

ValueError: clf has to be a sklearn estimator
ValueError: criterion must be gini or entropy
ValueError: criteria has to be max_samples or impurity
ValueError: splitter must be in {random, best, mutual, cfs, fcbf}

_distances(node: Snode, data: ndarray) → array[source]

Compute distances of the samples to the hyperplane of the node

Parameters

nodeSnode: node containing the svm classifier
datanp.ndarray: samples to compute distance to hyperplane

Returns

np.array: array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes

static _entropy(y: array) → float[source]

Compute entropy of a labels set

Parameters

ynp.array: set of labels

Returns

float: entropy

static _fs_best(dataset: array, labels: array, max_features: int) → tuple[source]

Return the variabes with higher f-score

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

Returns

tuple: indices of the features selected

static _fs_cfs(dataset: array, labels: array, max_features: int) → tuple[source]

Correlattion-based feature selection with max_features limit

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

Returns

tuple: indices of the features selected

static _fs_fcbf(dataset: array, labels: array, max_features: int) → tuple[source]

Fast Correlation-based Filter algorithm with max_features limit

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

Returns

tuple: indices of the features selected

static _fs_iwss(dataset: array, labels: array, max_features: int) → tuple[source]

Correlattion-based feature selection based on iwss with max_features limit

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

Returns

tuple: indices of the features selected

_fs_mutual(dataset: array, labels: array, max_features: int) → tuple[source]

Return the best features with mutual information with labels

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

Returns

tuple: indices of the features selected

_fs_random(dataset: array, labels: array, max_features: int) → tuple[source]

Return the best of five random feature set combinations

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

Returns

tuple: indices of the features selected

static _fs_trandom(dataset: array, labels: array, max_features: int) → tuple[source]

Return the a random feature set combination

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

Returns

tuple: indices of the features selected

static _generate_spaces(features: int, max_features: int) → list[source]

Generate at most 5 feature random combinations

Parameters

featuresint: number of features in each combination
max_featuresint: number of features in dataset

Returns

list: list with up to 5 combination of features randomly selected

_get_subspaces_set(dataset: array, labels: array, max_features: int) → tuple[source]

Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter

Parameters

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (<= number of features in dataset)

Returns

tuple: indices of the features selected

static _gini(y: array) → float[source]

_impurity(data: array, y: array) → array[source]

return column of dataset to be taken into account to split dataset

Parameters

datanp.array: distances to hyper plane of every class
ynp.array: vector of labels (classes)

Returns

np.array: column of dataset to be taken into account to split dataset

static _max_samples(data: array, y: array) → array[source]

return column of dataset to be taken into account to split dataset

Parameters

datanp.array: distances to hyper plane of every class
ynp.array: column of dataset to be taken into account to split dataset

Returns

np.array: column of dataset to be taken into account to split dataset

_select_best_set(dataset: array, labels: array, features_sets: list) → list[source]

Return the best set of features among feature_sets, the criterion is the information gain

Parameters

datasetnp.array: array of samples (# samples, # features)
labelsnp.array: array of labels
features_setslist: list of features sets to check

Returns

list: best feature set

get_subspace(dataset: array, labels: array, max_features: int) → tuple[source]

Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparameter

Parameters

datasetnp.array: array of samples (# samples, # features)
labelsnp.array: labels of the dataset
max_featuresint: number of features to form the subspace

Returns

tuple: tuple with the dataset with only the features selected and the indices of the features selected

information_gain(labels: array, labels_up: array, labels_dn: array) → float[source]

Compute information gain of a split candidate

Parameters

labelsnp.array: labels of the dataset
labels_upnp.array: labels of one side
labels_dnnp.array: labels on the other side

Returns

float: information gain

part(origin: array) → list[source]

Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices

Parameters

originnp.array: dataset to split

Returns

list: list with two splits of the array

partition(samples: array, node: Snode, train: bool)[source]

Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)

Parameters

samplesnp.array: array of samples (# samples, # features)
nodeSnode: Node of the tree where partition is going to be made
trainbool: Train time - True / Test time - False

partition_impurity(y: array) → array[source]