Splitter

Oblique decision tree classifier based on SVM nodes Splitter class

class Splitter.Splitter(clf: Optional[sklearn.svm._classes.SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]

Bases: object

_distances(node: Splitter.Snode, data: numpy.ndarray) numpy.array[source]

Compute distances of the samples to the hyperplane of the node

nodeSnode

node containing the svm classifier

datanp.ndarray

samples to compute distance to hyperplane

np.array

array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes

static _entropy(y: numpy.array) float[source]

Compute entropy of a labels set

ynp.array

set of labels

float

entropy

static _fs_best(dataset: numpy.array, labels: numpy.array, max_features: int) tuple[source]

Return the variabes with higher f-score

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

tuple

indices of the features selected

static _fs_cfs(dataset: numpy.array, labels: numpy.array, max_features: int) tuple[source]

Correlattion-based feature selection with max_features limit

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

tuple

indices of the features selected

static _fs_fcbf(dataset: numpy.array, labels: numpy.array, max_features: int) tuple[source]

Fast Correlation-based Filter algorithm with max_features limit

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

tuple

indices of the features selected

static _fs_mutual(dataset: numpy.array, labels: numpy.array, max_features: int) tuple[source]

Return the best features with mutual information with labels

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

tuple

indices of the features selected

_fs_random(dataset: numpy.array, labels: numpy.array, max_features: int) tuple[source]

Return the best of five random feature set combinations

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (< number of features in dataset)

tuple

indices of the features selected

static _generate_spaces(features: int, max_features: int) list[source]

Generate at most 5 feature random combinations

featuresint

number of features in each combination

max_featuresint

number of features in dataset

list

list with up to 5 combination of features randomly selected

_get_subspaces_set(dataset: numpy.array, labels: numpy.array, max_features: int) tuple[source]

Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter

datasetnp.array

array of samples

labelsnp.array

labels of the dataset

max_featuresint

number of features of the subspace (<= number of features in dataset)

tuple

indices of the features selected

static _gini(y: numpy.array) float[source]
_impurity(data: numpy.array, y: numpy.array) numpy.array[source]

return column of dataset to be taken into account to split dataset

datanp.array

distances to hyper plane of every class

ynp.array

vector of labels (classes)

np.array

column of dataset to be taken into account to split dataset

static _max_samples(data: numpy.array, y: numpy.array) numpy.array[source]

return column of dataset to be taken into account to split dataset

datanp.array

distances to hyper plane of every class

ynp.array

column of dataset to be taken into account to split dataset

np.array

column of dataset to be taken into account to split dataset

_select_best_set(dataset: numpy.array, labels: numpy.array, features_sets: list) list[source]

Return the best set of features among feature_sets, the criterion is the information gain

datasetnp.array

array of samples (# samples, # features)

labelsnp.array

array of labels

features_setslist

list of features sets to check

list

best feature set

get_subspace(dataset: numpy.array, labels: numpy.array, max_features: int) tuple[source]

Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparameter

datasetnp.array

array of samples (# samples, # features)

labelsnp.array

labels of the dataset

max_featuresint

number of features to form the subspace

tuple

tuple with the dataset with only the features selected and the indices of the features selected

information_gain(labels: numpy.array, labels_up: numpy.array, labels_dn: numpy.array) float[source]

Compute information gain of a split candidate

labelsnp.array

labels of the dataset

labels_upnp.array

labels of one side

labels_dnnp.array

labels on the other side

float

information gain

part(origin: numpy.array) list[source]

Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices

originnp.array

dataset to split

list

list with two splits of the array

partition(samples: numpy.array, node: Splitter.Snode, train: bool)[source]

Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)

samplesnp.array

array of samples (# samples, # features)

nodeSnode

Node of the tree where partition is going to be made

trainbool

Train time - True / Test time - False

partition_impurity(y: numpy.array) numpy.array[source]