Splitter¶
Oblique decision tree classifier based on SVM nodes Splitter class
- class Splitter.Splitter(clf: Optional[sklearn.svm._classes.SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]¶
Bases:
object
- _distances(node: Splitter.Snode, data: numpy.ndarray) numpy.array [source]¶
Compute distances of the samples to the hyperplane of the node
- nodeSnode
node containing the svm classifier
- datanp.ndarray
samples to compute distance to hyperplane
- np.array
array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes
- static _entropy(y: numpy.array) float [source]¶
Compute entropy of a labels set
- ynp.array
set of labels
- float
entropy
- static _fs_best(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]¶
Return the variabes with higher f-score
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _fs_cfs(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]¶
Correlattion-based feature selection with max_features limit
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _fs_fcbf(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]¶
Fast Correlation-based Filter algorithm with max_features limit
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _fs_mutual(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]¶
Return the best features with mutual information with labels
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- _fs_random(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]¶
Return the best of five random feature set combinations
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _generate_spaces(features: int, max_features: int) list [source]¶
Generate at most 5 feature random combinations
- featuresint
number of features in each combination
- max_featuresint
number of features in dataset
- list
list with up to 5 combination of features randomly selected
- _get_subspaces_set(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]¶
Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (<= number of features in dataset)
- tuple
indices of the features selected
- _impurity(data: numpy.array, y: numpy.array) numpy.array [source]¶
return column of dataset to be taken into account to split dataset
- datanp.array
distances to hyper plane of every class
- ynp.array
vector of labels (classes)
- np.array
column of dataset to be taken into account to split dataset
- static _max_samples(data: numpy.array, y: numpy.array) numpy.array [source]¶
return column of dataset to be taken into account to split dataset
- datanp.array
distances to hyper plane of every class
- ynp.array
column of dataset to be taken into account to split dataset
- np.array
column of dataset to be taken into account to split dataset
- _select_best_set(dataset: numpy.array, labels: numpy.array, features_sets: list) list [source]¶
Return the best set of features among feature_sets, the criterion is the information gain
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
array of labels
- features_setslist
list of features sets to check
- list
best feature set
- get_subspace(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]¶
Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparameter
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
labels of the dataset
- max_featuresint
number of features to form the subspace
- tuple
tuple with the dataset with only the features selected and the indices of the features selected
- information_gain(labels: numpy.array, labels_up: numpy.array, labels_dn: numpy.array) float [source]¶
Compute information gain of a split candidate
- labelsnp.array
labels of the dataset
- labels_upnp.array
labels of one side
- labels_dnnp.array
labels on the other side
- float
information gain
- part(origin: numpy.array) list [source]¶
Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices
- originnp.array
dataset to split
- list
list with two splits of the array
- partition(samples: numpy.array, node: Splitter.Snode, train: bool)[source]¶
Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)
- samplesnp.array
array of samples (# samples, # features)
- nodeSnode
Node of the tree where partition is going to be made
- trainbool
Train time - True / Test time - False