Splitter¶
-
class
stree.
Splitter
(clf: Optional[sklearn.svm._classes.SVC] = None, criterion: Optional[str] = None, feature_select: Optional[str] = None, criteria: Optional[str] = None, min_samples_split: Optional[int] = None, random_state=None, normalize=False)[source]¶ Bases:
object
-
_distances
(node: stree.Strees.Snode, data: numpy.ndarray) → numpy.array[source]¶ Compute distances of the samples to the hyperplane of the node
- nodeSnode
node containing the svm classifier
- datanp.ndarray
samples to compute distance to hyperplane
- np.array
array of shape (m, nc) with the distances of every sample to the hyperplane of every class. nc = # of classes
-
static
_entropy
(y: numpy.array) → float[source]¶ Compute entropy of a labels set
- ynp.array
set of labels
- float
entropy
-
static
_generate_spaces
(features: int, max_features: int) → list[source]¶ Generate at most 5 feature random combinations
- featuresint
number of features in each combination
- max_featuresint
number of features in dataset
- list
list with up to 5 combination of features randomly selected
-
_get_subspaces_set
(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶ Compute the indices of the features selected by splitter depending on the self._feature_select hyper parameter
- datasetnp.array
array of samples
- labelsnp.array
labels of the dataset
- max_featuresint
number of features of the subspace (<= number of features in dataset)
- tuple
indices of the features selected
-
_impurity
(data: numpy.array, y: numpy.array) → numpy.array[source]¶ return column of dataset to be taken into account to split dataset
- datanp.array
distances to hyper plane of every class
- ynp.array
vector of labels (classes)
- np.array
column of dataset to be taken into account to split dataset
-
static
_max_samples
(data: numpy.array, y: numpy.array) → numpy.array[source]¶ return column of dataset to be taken into account to split dataset
- datanp.array
distances to hyper plane of every class
- ynp.array
column of dataset to be taken into account to split dataset
- np.array
column of dataset to be taken into account to split dataset
-
_select_best_set
(dataset: numpy.array, labels: numpy.array, features_sets: list) → list[source]¶ Return the best set of features among feature_sets, the criterion is the information gain
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
array of labels
- features_setslist
list of features sets to check
- list
best feature set
-
get_subspace
(dataset: numpy.array, labels: numpy.array, max_features: int) → tuple[source]¶ Re3turn a subspace of the selected dataset of max_features length. Depending on hyperparmeter
- datasetnp.array
array of samples (# samples, # features)
- labelsnp.array
labels of the dataset
- max_featuresint
number of features to form the subspace
- tuple
tuple with the dataset with only the features selected and the indices of the features selected
-
information_gain
(labels: numpy.array, labels_up: numpy.array, labels_dn: numpy.array) → float[source]¶ Compute information gain of a split candidate
- labelsnp.array
labels of the dataset
- labels_upnp.array
labels of one side
- labels_dnnp.array
labels on the other side
- float
information gain
-
part
(origin: numpy.array) → list[source]¶ Split an array in two based on indices (self._up) and its complement partition has to be called first to establish up indices
- originnp.array
dataset to split
- list
list with two splits of the array
-
partition
(samples: numpy.array, node: stree.Strees.Snode, train: bool)[source]¶ Set the criteria to split arrays. Compute the indices of the samples that should go to one side of the tree (up)
- samplesnp.array
array of samples (# samples, # features)
- nodeSnode
Node of the tree where partition is going to be made
- trainbool
Train time - True / Test time - False
-