Splitter
Oblique decision tree classifier based on SVM nodes Splitter class
- class Splitter.Splitter(clf: Optional[SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]
Bases:
object
Splits a dataset in two based on different criteria
Parameters
- clfSVC, optional
classifier, by default None
- criterionstr, optional
The function to measure the quality of a split (only used if max_features != num_features). Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain., by default “entropy”, by default None
- feature_selectstr, optional
The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. “mutual”: Chooses the best features w.r.t. their mutual info with the label. “cfs”: Apply Correlation-based Feature Selection. “fcbf”: Apply Fast Correlation- Based, by default None
- criteriastr, optional
ecides (just in case of a multi class classification) which column (class) use to split the dataset in a node. max_samples is incompatible with ‘ovo’ multiclass_strategy, by default None
- min_samples_splitint, optional
The minimum number of samples required to split an internal node. 0 (default) for any, by default None
- random_stateoptional
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.Pass an int for reproducible output across multiple function calls, by default None
- normalizebool, optional
If standardization of features should be applied on each node with the samples that reach it , by default False
Raises
- ValueError
clf has to be a sklearn estimator
- ValueError
criterion must be gini or entropy
- ValueError
criteria has to be max_samples or impurity
- ValueError
splitter must be in {random, best, mutual, cfs, fcbf}
- _distances(node: Snode, data: ndarray) array [source]
Compute distances of the samples to the hyperplane of the node
Parameters
- nodeSnode
node containing the svm classifier
- datanp.ndarray
samples to compute distance to hyperplane
Returns
- np.array
array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes
- static _entropy(y: array) float [source]
Compute entropy of a labels set
Parameters
- ynp.array
set of labels
Returns
- float
entropy
- static _fs_best(dataset: array, labels: array, max_features: int) tuple [source]
Return the variabes with higher f-score
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
Returns
- tuple
indices of the features selected
- static _fs_cfs(dataset: array, labels: array, max_features: int) tuple [source]
Correlattion-based feature selection with max_features limit
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
Returns
- tuple
indices of the features selected
- static _fs_fcbf(dataset: array, labels: array, max_features: int) tuple [source]
Fast Correlation-based Filter algorithm with max_features limit
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
Returns
- tuple
indices of the features selected
- static _fs_iwss(dataset: array, labels: array, max_features: int) tuple [source]
Correlattion-based feature selection based on iwss with max_features limit
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
Returns
- tuple
indices of the features selected
- _fs_mutual(dataset: array, labels: array, max_features: int) tuple [source]
Return the best features with mutual information with labels
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
Returns
- tuple
indices of the features selected
- _fs_random(dataset: array, labels: array, max_features: int) tuple [source]
Return the best of five random feature set combinations
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
Returns
- tuple
indices of the features selected
- static _fs_trandom(dataset: array, labels: array, max_features: int) tuple [source]
Return the a random feature set combination
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
Returns
- tuple
indices of the features selected
- static _generate_spaces(features: int, max_features: int) list [source]
Generate at most 5 feature random combinations
Parameters
- featuresint
number of features in each combination
- max_featuresint
number of features in dataset
Returns
- list
list with up to 5 combination of features randomly selected
- _get_subspaces_set(dataset: array, labels: array, max_features: int) tuple [source]
Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter
Parameters
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (<= number of features in dataset)
Returns
- tuple
indices of the features selected
- _impurity(data: array, y: array) array [source]
return column of dataset to be taken into account to split dataset
Parameters
- datanp.array
distances to hyper plane of every class
- ynp.array
vector of labels (classes)
Returns
- np.array
column of dataset to be taken into account to split dataset
- static _max_samples(data: array, y: array) array [source]
return column of dataset to be taken into account to split dataset
Parameters
- datanp.array
distances to hyper plane of every class
- ynp.array
column of dataset to be taken into account to split dataset
Returns
- np.array
column of dataset to be taken into account to split dataset
- _select_best_set(dataset: array, labels: array, features_sets: list) list [source]
Return the best set of features among feature_sets, the criterion is the information gain
Parameters
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
array of labels
- features_setslist
list of features sets to check
Returns
- list
best feature set
- get_subspace(dataset: array, labels: array, max_features: int) tuple [source]
Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparameter
Parameters
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
labels of the dataset
- max_featuresint
number of features to form the subspace
Returns
- tuple
tuple with the dataset with only the features selected and the indices of the features selected
- information_gain(labels: array, labels_up: array, labels_dn: array) float [source]
Compute information gain of a split candidate
Parameters
- labelsnp.array
labels of the dataset
- labels_upnp.array
labels of one side
- labels_dnnp.array
labels on the other side
Returns
- float
information gain
- part(origin: array) list [source]
Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices
Parameters
- originnp.array
dataset to split
Returns
- list
list with two splits of the array
- partition(samples: array, node: Snode, train: bool)[source]
Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)
Parameters
- samplesnp.array
array of samples (# samples, # features)
- nodeSnode
Node of the tree where partition is going to be made
- trainbool
Train time - True / Test time - False