Using sample weights for training xgboost (0.7) classifier
The problem is that for evaluation datasets weights are not propagated by the sklearn API.
So you seem to be doomed to use the native API. Just replace the lines starting with your model
definition by the following code:
from xgboost import train, DMatrix
trainDmatrix = DMatrix(X_traintest, label=y_traintest, weight=traintest_sample_weight)
validDmatrix = DMatrix(X_valid, label=y_valid, weight=valid_sample_weight)
booster = train({'eval_metric': 'auc'}, trainDmatrix, num_boost_round=100,
evals=[(trainDmatrix,'train'), (validDmatrix, 'valid')], early_stopping_rounds=50,
verbose_eval=10)
UPD: The xgboost community is aware of it and there is a discussion and even a PR for it: https://github.com/dmlc/xgboost/issues/1804. However, this was never propagated to v0.71 for some reason.
UPD2: After pinging that issue, the relevant code update has been revived and the PR was merged into master in time for the upcoming xgboost 0.72
release on 1 June 2018
Ernie Halberg
Updated on June 26, 2022Comments
-
Ernie Halberg almost 2 years
I am trying to use
sample_weight
inXGBClassifier
to improve the performance of one of our models.However, it seems like the
sample_weight
parameter is not working as expected.sample_weight
is very important for this problem. Please see my code below.Basically the fitting of the model does not seem to take into account the
sample_weight
parameter – it starts at an AUC of 0.5 and drops from there, recommending 0, or 1n_estimators
. There is nothing wrong with the underlying data – we have constructed a very good model using sample weights using another tool, getting a good Gini.The sample data provided does not properly exhibit this behavior but given a consistent random seed throughout we can see that the model objects are identical whether a
weight
/sample_weight
is provided or not.I have tried different components from the xbgoost library that similarly have parameters where one can define weights, but no luck:
XGBClassifier.fit() XGBClassifier.train() Xgboost() XGB.fit() XGB.train() Dmatrix() XGBGridSearchCV()
I have also tried the
fit_params=fit_params
as a parameter as well asweight=weight
andsample_weight=sample_weight
variationsCode:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from xgboost import XGBClassifier df = pd.DataFrame(columns = ['GB_FLAG','sample_weight','f1','f2','f3','f4','f5']) df.loc[0] = [0,1,2046,10,625,8000,2072] df.loc[1] = [0,0.86836,8000,10,705,8800,28] df.loc[2] = [1,1,2303.62,19,674,3000,848] df.loc[3] = [0,0,2754.8,2,570,16300,46] df.loc[4] = [1,0.103474,11119.81,6,0,9500,3885] df.loc[5] = [1,0,1050.83,19,715,3000,-5] df.loc[6] = [1,0.011098,7063.35,11,713,19700,486] df.loc[7] = [0,0.972176,6447.16,18,681,11300,1104] df.loc[8] = [1,0.054237,7461.27,18,0,0,4] df.loc[9] = [0,0.917026,4600.83,8,0,10400,242] df.loc[10] = [0,0.670026,2041.8,21,716,11000,3] df.loc[11] = [1,0.112416,2413.77,22,750,4600,271] df.loc[12] = [0,0,251.81,17,806,3800,0] df.loc[13] = [1,0.026263,20919.2,17,684,8100,1335] df.loc[14] = [0,1,1504.58,15,621,6800,461] df.loc[15] = [0,0.654429,9227.69,4,0,22500,294] df.loc[16] = [0,0.897051,6960.31,22,674,5400,188] df.loc[17] = [1,0.209862,4481.42,18,745,11600,0] df.loc[18] = [0,1,2692.96,22,651,12800,2035] y = np.asarray(df['GB_FLAG']) X = np.asarray(df.drop(['GB_FLAG'], axis=1)) X_traintest, X_valid, y_traintest, y_valid = train_test_split(X, y, train_size=0.7, stratify=y, random_state=1337) traintest_sample_weight = X_traintest[:,0] valid_sample_weight = X_valid[:,0] X_traintest = X_traintest[:,1:] X_valid = X_valid[:,1:] model = XGBClassifier() eval_set = [(X_valid, y_valid)] model.fit(X_traintest, y_traintest, eval_set=eval_set, eval_metric="auc", e early_stopping_rounds=50, verbose = True, sample_weight = traintest_sample_weight)
How do I use sample weights when using
xgboost
for modeling?