Welcome to STree’s documentation!
STree
Oblique Tree classifier based on SVM nodes. The nodes are built and splitted with sklearn SVC models. Stree is a sklearn estimator and can be integrated in pipelines, grid searches, etc.
License
STree is MIT licensed
Install
The main stable release
pip install stree
or the last development branch
pip install git+https://github.com/doctorado-ml/stree
Tests
python -m unittest -v stree.tests
Hyperparameters
Hyperparameter |
Type/Values |
Default |
Meaning |
|
---|---|---|---|---|
* |
C |
<float> |
1.0 |
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. |
* |
kernel |
{“liblinear”, “linear”, “poly”, “rbf”, “sigmoid”} |
linear |
Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses liblinear library and the rest uses libsvm library through scikit-learn library |
* |
max_iter |
<int> |
1e5 |
Hard limit on iterations within solver, or -1 for no limit. |
* |
random_state |
<int> |
None |
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. |
max_depth |
<int> |
None |
Specifies the maximum depth of the tree |
|
* |
tol |
<float> |
1e-4 |
Tolerance for stopping criterion. |
* |
degree |
<int> |
3 |
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. |
* |
gamma |
{“scale”, “auto”} or <float> |
scale |
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. |
split_criteria |
{“impurity”, “max_samples”} |
impurity |
Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node**. max_samples is incompatible with ‘ovo’ multiclass_strategy |
|
criterion |
{“gini”, “entropy”} |
entropy |
The function to measure the quality of a split (only used if max_features != num_features). |
|
min_samples_split |
<int> |
0 |
The minimum number of samples required to split an internal node. 0 (default) for any |
|
max_features |
<int>, <float> |
None |
The number of features to consider when looking for the split: |
|
splitter |
{“best”, “random”, “trandom”, “mutual”, “cfs”, “fcbf”, “iwss”} |
“random” |
The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. “mutual”: Chooses the best features w.r.t. their mutual info with the label. “cfs”: Apply Correlation-based Feature Selection. “fcbf”: Apply Fast Correlation-Based Filter. “iwss”: IWSS based algorithm |
|
normalize |
<bool> |
False |
If standardization of features should be applied on each node with the samples that reach it |
|
* |
multiclass_strategy |
{“ovo”, “ovr”} |
“ovo” |
Strategy to use with multiclass datasets, “ovo”: one versus one. “ovr”: one versus rest |
* Hyperparameter used by the support vector classifier of every node
** Splitting in a STree node
The decision function is applied to the dataset and distances from samples to hyperplanes are computed in a matrix. This matrix has as many columns as classes the samples belongs to (if more than two, i.e. multiclass classification) or 1 column if it’s a binary class dataset. In binary classification only one hyperplane is computed and therefore only one column is needed to store the distances of the samples to it. If three or more classes are present in the dataset we need as many hyperplanes as classes are there, and therefore one column per hyperplane is needed.
In case of multiclass classification we have to decide which column take into account to make the split, that depends on hyperparameter split_criteria, if “impurity” is chosen then STree computes information gain of every split candidate using each column and chooses the one that maximize the information gain, otherwise STree choses the column with more samples with a predicted class (the column with more positive numbers in it).
Once we have the column to take into account for the split, the algorithm splits samples with positive distances to hyperplane from the rest.
Examples
Notebooks
Sample Code
import time
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from stree import Stree
random_state = 1
X, y = load_iris(return_X_y=True)
Xtrain, Xtest, ytrain, ytest = train_test_split(
X, y, test_size=0.2, random_state=random_state
)
now = time.time()
print("Predicting with max_features=sqrt(n_features)")
clf = Stree(random_state=random_state, max_features="auto")
clf.fit(Xtrain, ytrain)
print(f"Took {time.time() - now:.2f} seconds to train")
print(clf)
print(f"Classifier's accuracy (train): {clf.score(Xtrain, ytrain):.4f}")
print(f"Classifier's accuracy (test) : {clf.score(Xtest, ytest):.4f}")
print("=" * 40)
print("Predicting with max_features=n_features")
clf = Stree(random_state=random_state)
clf.fit(Xtrain, ytrain)
print(f"Took {time.time() - now:.2f} seconds to train")
print(clf)
print(f"Classifier's accuracy (train): {clf.score(Xtrain, ytrain):.4f}")
print(f"Classifier's accuracy (test) : {clf.score(Xtest, ytest):.4f}")
API index
Stree
- class stree.Stree(C: float = 1.0, kernel: str = 'linear', max_iter: int = 100000.0, random_state: Optional[int] = None, max_depth: Optional[int] = None, tol: float = 0.0001, degree: int = 3, gamma='scale', split_criteria: str = 'impurity', criterion: str = 'entropy', min_samples_split: int = 0, max_features=None, splitter: str = 'random', multiclass_strategy: str = 'ovo', normalize: bool = False)[source]
Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Estimator that is based on binary trees of svm nodes can deal with sample_weights in predict, used in boosting sklearn methods inheriting from BaseEstimator implements get_params and set_params methods inheriting from ClassifierMixin implement the attribute _estimator_type with “classifier” as value
- Cfloat, optional
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive., by default 1.0
- kernelstr, optional
Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library, by default “linear”
- max_iterint, optional
Hard limit on iterations within solver, or -1 for no limit., by default 1e5
- random_stateint, optional
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.Pass an int for reproducible output across multiple function calls, by default None
- max_depthint, optional
Specifies the maximum depth of the tree, by default None
- tolfloat, optional
Tolerance for stopping, by default 1e-4
- degreeint, optional
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels., by default 3
- gammastr, optional
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.if gamma=’scale’ (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,if ‘auto’, uses 1 / n_features., by default “scale”
- split_criteriastr, optional
Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node. max_samples is incompatible with ‘ovo’ multiclass_strategy, by default “impurity”
- criterionstr, optional
The function to measure the quality of a split (only used if max_features != num_features). Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain., by default “entropy”
- min_samples_splitint, optional
The minimum number of samples required to split an internal node. 0 (default) for any, by default 0
- max_featuresoptional
The number of features to consider when looking for the split: If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features= sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features= n_features., by default None
- splitterstr, optional
The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. “mutual”: Chooses the best features w.r.t. their mutual info with the label. “cfs”: Apply Correlation-based Feature Selection. “fcbf”: Apply Fast Correlation- Based , by default “random”
- multiclass_strategystr, optional
Strategy to use with multiclass datasets, “ovo”: one versus one. “ovr”: one versus rest, by default “ovo”
- normalizebool, optional
If standardization of features should be applied on each node with the samples that reach it , by default False
- classes_ndarray of shape (n_classes,)
The classes labels.
- n_classes_int
The number of classes
- n_iter_int
Max number of iterations in classifier
- depth_int
Max depht of the tree
- n_features_int
The number of features when
fit
is performed.- n_features_in_int
Number of features seen during fit.
- max_features_int
Number of features to use in hyperplane computation
- tree_Node
root of the tree
- X_ndarray
points to the input dataset
- y_ndarray
points to the input labels
R. Montañana, J. A. Gámez, J. M. Puerta, “STree: a single multi-class oblique decision tree based on support vector machines.”, 2021 LNAI 12882
- _more_tags() dict [source]
Required by sklearn to supply features of the classifier make mandatory the labels array
- Returns
the tag required
- Return type
dict
- static _reorder_results(y: numpy.array, indices: numpy.array) numpy.array [source]
Reorder an array based on the array of indices passed
- ynp.array
data untidy
- indicesnp.array
indices used to set order
- np.array
array y ordered
- _train(X: numpy.ndarray, y: numpy.ndarray, sample_weight: numpy.ndarray, depth: int, title: str) Optional[stree.Splitter.Snode] [source]
Recursive function to split the original dataset into predictor nodes (leaves)
- Xnp.ndarray
samples dataset
- ynp.ndarray
samples labels
- sample_weightnp.ndarray
weight of samples. Rescale C per sample.
- depthint
actual depth in the tree
- titlestr
description of the node
- Optional[Snode]
binary tree
- fit(X: numpy.ndarray, y: numpy.ndarray, sample_weight: Optional[numpy.array] = None) stree.Strees.Stree [source]
Build the tree based on the dataset of samples and its labels
- Stree
itself to be able to chain actions: fit().predict() …
- ValueError
if C < 0
- ValueError
if max_depth < 1
- ValueError
if all samples have 0 or negative weights
Siterator
Oblique decision tree classifier based on SVM nodes Splitter class
- class Splitter.Siterator(tree: Splitter.Snode)[source]
Bases:
object
Stree preorder iterator
- _push(node: Splitter.Snode)[source]
Snode
Oblique decision tree classifier based on SVM nodes Splitter class
- class Splitter.Snode(clf: sklearn.svm._classes.SVC, X: numpy.ndarray, y: numpy.ndarray, features: numpy.array, impurity: float, title: str, weight: Optional[numpy.ndarray] = None, scaler: Optional[sklearn.preprocessing._data.StandardScaler] = None)[source]
Bases:
object
Nodes of the tree that keeps the svm classifier and if testing the dataset assigned to it
- clfSVC
Classifier used
- Xnp.ndarray
input dataset in train time (only in testing)
- ynp.ndarray
input labes in train time
- featuresnp.array
features used to compute hyperplane
- impurityfloat
impurity of the node
- titlestr
label describing the route to the node
- weightnp.ndarray, optional
weights applied to input dataset in train time, by default None
- scalerStandardScaler, optional
scaler used if any, by default None
- classmethod copy(node: Splitter.Snode) Splitter.Snode [source]
- get_down() Splitter.Snode [source]
- get_up() Splitter.Snode [source]
Splitter
Oblique decision tree classifier based on SVM nodes Splitter class
- class Splitter.Splitter(clf: Optional[sklearn.svm._classes.SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]
Bases:
object
Splits a dataset in two based on different criteria
- clfSVC, optional
classifier, by default None
- criterionstr, optional
The function to measure the quality of a split (only used if max_features != num_features). Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain., by default “entropy”, by default None
- feature_selectstr, optional
The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. “mutual”: Chooses the best features w.r.t. their mutual info with the label. “cfs”: Apply Correlation-based Feature Selection. “fcbf”: Apply Fast Correlation- Based, by default None
- criteriastr, optional
ecides (just in case of a multi class classification) which column (class) use to split the dataset in a node. max_samples is incompatible with ‘ovo’ multiclass_strategy, by default None
- min_samples_splitint, optional
The minimum number of samples required to split an internal node. 0 (default) for any, by default None
- random_stateoptional
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.Pass an int for reproducible output across multiple function calls, by default None
- normalizebool, optional
If standardization of features should be applied on each node with the samples that reach it , by default False
- ValueError
clf has to be a sklearn estimator
- ValueError
criterion must be gini or entropy
- ValueError
criteria has to be max_samples or impurity
- ValueError
splitter must be in {random, best, mutual, cfs, fcbf}
- _distances(node: Splitter.Snode, data: numpy.ndarray) numpy.array [source]
Compute distances of the samples to the hyperplane of the node
- nodeSnode
node containing the svm classifier
- datanp.ndarray
samples to compute distance to hyperplane
- np.array
array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes
- static _entropy(y: numpy.array) float [source]
Compute entropy of a labels set
- ynp.array
set of labels
- float
entropy
- static _fs_best(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Return the variabes with higher f-score
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _fs_cfs(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Correlattion-based feature selection with max_features limit
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _fs_fcbf(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Fast Correlation-based Filter algorithm with max_features limit
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _fs_iwss(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Correlattion-based feature selection based on iwss with max_features limit
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- _fs_mutual(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Return the best features with mutual information with labels
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- _fs_random(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Return the best of five random feature set combinations
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _fs_trandom(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Return the a random feature set combination
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (< number of features in dataset)
- tuple
indices of the features selected
- static _generate_spaces(features: int, max_features: int) list [source]
Generate at most 5 feature random combinations
- featuresint
number of features in each combination
- max_featuresint
number of features in dataset
- list
list with up to 5 combination of features randomly selected
- _get_subspaces_set(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (<= number of features in dataset)
- tuple
indices of the features selected
- _impurity(data: numpy.array, y: numpy.array) numpy.array [source]
return column of dataset to be taken into account to split dataset
- datanp.array
distances to hyper plane of every class
- ynp.array
vector of labels (classes)
- np.array
column of dataset to be taken into account to split dataset
- static _max_samples(data: numpy.array, y: numpy.array) numpy.array [source]
return column of dataset to be taken into account to split dataset
- datanp.array
distances to hyper plane of every class
- ynp.array
column of dataset to be taken into account to split dataset
- np.array
column of dataset to be taken into account to split dataset
- _select_best_set(dataset: numpy.array, labels: numpy.array, features_sets: list) list [source]
Return the best set of features among feature_sets, the criterion is the information gain
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
array of labels
- features_setslist
list of features sets to check
- list
best feature set
- get_subspace(dataset: numpy.array, labels: numpy.array, max_features: int) tuple [source]
Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparameter
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
labels of the dataset
- max_featuresint
number of features to form the subspace
- tuple
tuple with the dataset with only the features selected and the indices of the features selected
- information_gain(labels: numpy.array, labels_up: numpy.array, labels_dn: numpy.array) float [source]
Compute information gain of a split candidate
- labelsnp.array
labels of the dataset
- labels_upnp.array
labels of one side
- labels_dnnp.array
labels on the other side
- float
information gain
- part(origin: numpy.array) list [source]
Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices
- originnp.array
dataset to split
- list
list with two splits of the array
- partition(samples: numpy.array, node: Splitter.Snode, train: bool)[source]
Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)
- samplesnp.array
array of samples (# samples, # features)
- nodeSnode
Node of the tree where partition is going to be made
- trainbool
Train time - True / Test time - False