Getting Started#
Installation#
From PyPI
pip install xrf
From conda-forge
conda install conda-forge::xrf
Quickstart#
Classification forests#
Let us start by importing the tic-tac-toe dataset from openml.org.
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import OneHotEncoder
dataset = fetch_openml(name="tic-tac-toe", parser="auto")
y = dataset.target.values
X = OneHotEncoder().fit_transform(dataset.data.values).toarray()
Let us split the dataset into a training and a test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.75)
Let us now fit an explainable random forest classifier; we can use the same parameters as for standard random forest classifiers as implemented in scikit-learn.
from xrf import XRandomForestClassifier
rfx = XRandomForestClassifier(n_jobs=-1)
rfx.fit(X_train, y_train)
We get the predictions in the usual way, using either predict or predict_proba, here resulting in exactly the same output as the standard random forest classifiers in scikit-learn.
rfx.predict_proba(X_test)
array([[0.05, 0.95],
[0.56, 0.44],
[0.4 , 0.6 ],
...,
[0.21, 0.79],
[0.17, 0.83],
[0.59, 0.41]])
We may now limit the number of examples involved in a prediction, e.g., to at most 5.
rfx.predict_proba(X_test, k=5)
array([[0. , 1. ],
[0.85416634, 0.14583366],
[0.34500622, 0.65499378],
...,
[0.27464175, 0.72535825],
[0.12693503, 0.87306497],
[1. , 0. ]])
Let us also obtain the example attributions, by setting return_examples and return_weights to True.
predictions, examples, weights = rfx.predict_proba(X_test, k=5,
return_examples=True,
return_weights=True)
Let us also take a look at the example attributions; examples will contain the indexes of the training objects involved in each prediction, while weights will contain the corresponding weights.
examples
array([[ 26, 131, 40, 193, 169],
[ 48, 121, 52, 164, 6],
[203, 176, 213, 110, 99],
...,
[ 52, 167, 194, 175, 53],
[104, 71, 20, 35, 122],
[ 33, 47, 188, 228, 120]])
weights
array([[0.23050922, 0.21026052, 0.19812573, 0.18882078, 0.17228375],
[0.24554293, 0.20930998, 0.20651394, 0.19279949, 0.14583366],
[0.2935989 , 0.25979051, 0.21957101, 0.12543522, 0.10160437],
...,
[0.27464175, 0.23320384, 0.19853987, 0.15467345, 0.13894108],
[0.32220957, 0.21056097, 0.20287181, 0.13742261, 0.12693503],
[0.26857466, 0.20863132, 0.20008477, 0.18560888, 0.13710037]])
Regression forests#
Let us import the Miami housing dataset from openml.org.
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import OneHotEncoder
dataset = fetch_openml(name="miami_housing", parser="auto")
y = dataset.target.values
X = dataset.data.values
Let us split the dataset into a training and a test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.75)
Let us generate and apply an explainable random forest regressor without constraining the number of training examples involved in the predictions.
from xrf import XRandomForestRegressor
rfx = XRandomForestRegressor(n_jobs=-1)
rfx.fit(X_train, y_train)
rfx.predict(X_test)
array([492859., 193170., 260507., ..., 330824., 416856., 241969.])
We may now limit the number of examples involved in a prediction, e.g., to at most 5.
rfx.predict(X_test, k=5)
array([541411.11111111, 196994.81865285, 210900.81300813, ...,
340516.66666667, 389410.25641026, 241550.27422303])
The example attributions are obtained by setting return_examples and return_weights to True.
predictions, examples, weights = rfx.predict(X_test, k=5,
return_examples=True,
return_weights=True)
We may check that the predictions are the same as the weighted targets of the training examples.
import numpy as np
weighted_predictions = np.sum([weights[i]*y_train[examples[i]]
for i in range(len(weights))], axis=1)
np.allclose(predictions, weighted_predictions)
True
You are welcome to download and try out xrf; you may find the following notebook helpful: