Splitter

Oblique decision tree classifier based on SVM nodes Splitter class

class Splitter.Splitter(clf: Optional[SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]

Bases: object

Splits a dataset in two based on different criteria

Parameters

clfSVC, optional

classifier, by default None

criterionstr, optional

The function to measure the quality of a split (only used if max_features != num_features). Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain., by default “entropy”, by default None

feature_selectstr, optional

The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. “mutual”: Chooses the best features w.r.t. their mutual info with the label. “cfs”: Apply Correlation-based Feature Selection. “fcbf”: Apply Fast Correlation- Based, by default None

criteriastr, optional

ecides (just in case of a multi class classification) which column (class) use to split the dataset in a node. max_samples is incompatible with ‘ovo’ multiclass_strategy, by default None

min_samples_splitint, optional

The minimum number of samples required to split an internal node. 0 (default) for any, by default None

random_stateoptional

Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.Pass an int for reproducible output across multiple function calls, by default None

normalizebool, optional

If standardization of features should be applied on each node with the samples that reach it , by default False

Raises

ValueError

clf has to be a sklearn estimator

ValueError

criterion must be gini or entropy

ValueError

criteria has to be max_samples or impurity

ValueError

splitter must be in {random, best, mutual, cfs, fcbf}

_distances(node: Snode, data: ndarray) array[source]

Compute distances of the samples to the hyperplane of the node

Parameters

nodeSnode

node containing the svm classifier

datanp.ndarray

samples to compute distance to hyperplane

Returns

np.array

array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes

static _entropy(y: array) float[source]

Compute entropy of a labels set

Parameters

ynp.array

set of labels

Returns

float

entropy

static _fs_best(dataset: array, labels: array, max_features: int) tuple[source]

Return the variabes with higher f-score

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

Returns

tuple

indices of the features selected

static _fs_cfs(dataset: array, labels: array, max_features: int) tuple[source]

Correlattion-based feature selection with max_features limit

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

Returns

tuple

indices of the features selected

static _fs_fcbf(dataset: array, labels: array, max_features: int) tuple[source]

Fast Correlation-based Filter algorithm with max_features limit

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

Returns

tuple

indices of the features selected

static _fs_iwss(dataset: array, labels: array, max_features: int) tuple[source]

Correlattion-based feature selection based on iwss with max_features limit

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

Returns

tuple

indices of the features selected

_fs_mutual(dataset: array, labels: array, max_features: int) tuple[source]

Return the best features with mutual information with labels

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

Returns

tuple

indices of the features selected

_fs_random(dataset: array, labels: array, max_features: int) tuple[source]

Return the best of five random feature set combinations

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

Returns

tuple

indices of the features selected

static _fs_trandom(dataset: array, labels: array, max_features: int) tuple[source]

Return the a random feature set combination

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

Returns

tuple

indices of the features selected

static _generate_spaces(features: int, max_features: int) list[source]

Generate at most 5 feature random combinations

Parameters

featuresint

number of features in each combination

max_featuresint

number of features in dataset

Returns

list

list with up to 5 combination of features randomly selected

_get_subspaces_set(dataset: array, labels: array, max_features: int) tuple[source]

Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter

Parameters

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (<= number of features in dataset)

Returns

tuple

indices of the features selected

static _gini(y: array) float[source]
_impurity(data: array, y: array) array[source]

return column of dataset to be taken into account to split dataset

Parameters

datanp.array

distances to hyper plane of every class

ynp.array

vector of labels (classes)

Returns

np.array

column of dataset to be taken into account to split dataset

static _max_samples(data: array, y: array) array[source]

return column of dataset to be taken into account to split dataset

Parameters

datanp.array

distances to hyper plane of every class

ynp.array

column of dataset to be taken into account to split dataset

Returns

np.array

column of dataset to be taken into account to split dataset

_select_best_set(dataset: array, labels: array, features_sets: list) list[source]

Return the best set of features among feature_sets, the criterion is the information gain

Parameters

datasetnp.array

array of samples (# samples, # features)

labelsnp.array

array of labels

features_setslist

list of features sets to check

Returns

list

best feature set

get_subspace(dataset: array, labels: array, max_features: int) tuple[source]

Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparameter

Parameters

datasetnp.array

array of samples (# samples, # features)

labelsnp.array

labels of the dataset

max_featuresint

number of features to form the subspace

Returns

tuple

tuple with the dataset with only the features selected and the indices of the features selected

information_gain(labels: array, labels_up: array, labels_dn: array) float[source]

Compute information gain of a split candidate

Parameters

labelsnp.array

labels of the dataset

labels_upnp.array

labels of one side

labels_dnnp.array

labels on the other side

Returns

float

information gain

part(origin: array) list[source]

Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices

Parameters

originnp.array

dataset to split

Returns

list

list with two splits of the array

partition(samples: array, node: Snode, train: bool)[source]

Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)

Parameters

samplesnp.array

array of samples (# samples, # features)

nodeSnode

Node of the tree where partition is going to be made

trainbool

Train time - True / Test time - False

partition_impurity(y: array) array[source]