Splitter¶

Oblique decision tree classifier based on SVM nodes Splitter class

class Splitter.Splitter(clf: Optional[sklearn.svm._classes.SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]¶

Bases: object

_distances(node: Splitter.Snode, data: numpy.ndarray) → numpy.array[source]¶

Compute distances of the samples to the hyperplane of the node

nodeSnode: node containing the svm classifier
datanp.ndarray: samples to compute distance to hyperplane

np.array: array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes

static _entropy(y: numpy.array) → float[source]¶

Compute entropy of a labels set

ynp.array: set of labels

float: entropy

static _fs_best(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶

Return the variabes with higher f-score

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

tuple: indices of the features selected

static _fs_cfs(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶

Correlattion-based feature selection with max_features limit

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

tuple: indices of the features selected

static _fs_fcbf(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶

Fast Correlation-based Filter algorithm with max_features limit

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

tuple: indices of the features selected

static _fs_mutual(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶

Return the best features with mutual information with labels

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

tuple: indices of the features selected

_fs_random(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶

Return the best of five random feature set combinations

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (< number of features in dataset)

tuple: indices of the features selected

static _generate_spaces(features: int, max_features: int) → list[source]¶

Generate at most 5 feature random combinations

featuresint: number of features in each combination
max_featuresint: number of features in dataset

list: list with up to 5 combination of features randomly selected

_get_subspaces_set(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶

Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter

datasetnp.array: array of samples
labelsnp.array: labels of the dataset
max_featuresint: number of features of the subspace (<= number of features in dataset)

tuple: indices of the features selected

static _gini(y: numpy.array) → float[source]¶

_impurity(data: numpy.array, y: numpy.array) → numpy.array[source]¶

return column of dataset to be taken into account to split dataset

datanp.array: distances to hyper plane of every class
ynp.array: vector of labels (classes)

np.array: column of dataset to be taken into account to split dataset

static _max_samples(data: numpy.array, y: numpy.array) → numpy.array[source]¶

return column of dataset to be taken into account to split dataset

datanp.array: distances to hyper plane of every class
ynp.array: column of dataset to be taken into account to split dataset

np.array: column of dataset to be taken into account to split dataset

_select_best_set(dataset: numpy.array, labels: numpy.array, features_sets: list) → list[source]¶

Return the best set of features among feature_sets, the criterion is the information gain

datasetnp.array: array of samples (# samples, # features)
labelsnp.array: array of labels
features_setslist: list of features sets to check

list: best feature set

get_subspace(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶

Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparameter

datasetnp.array: array of samples (# samples, # features)
labelsnp.array: labels of the dataset
max_featuresint: number of features to form the subspace

tuple: tuple with the dataset with only the features selected and the indices of the features selected

information_gain(labels: numpy.array, labels_up: numpy.array, labels_dn: numpy.array) → float[source]¶

Compute information gain of a split candidate

labelsnp.array: labels of the dataset
labels_upnp.array: labels of one side
labels_dnnp.array: labels on the other side

float: information gain

part(origin: numpy.array) → list[source]¶

Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices

originnp.array: dataset to split

list: list with two splits of the array

partition(samples: numpy.array, node: Splitter.Snode, train: bool)[source]¶

Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)

samplesnp.array: array of samples (# samples, # features)
nodeSnode: Node of the tree where partition is going to be made
trainbool: Train time - True / Test time - False

partition_impurity(y: numpy.array) → numpy.array[source]¶