Using sample weights for training xgboost (0.7) classifier

python pandas xgboost sample

10,265

The problem is that for evaluation datasets weights are not propagated by the sklearn API.

So you seem to be doomed to use the native API. Just replace the lines starting with your model definition by the following code:

from xgboost import train, DMatrix
trainDmatrix = DMatrix(X_traintest, label=y_traintest, weight=traintest_sample_weight)
validDmatrix = DMatrix(X_valid, label=y_valid, weight=valid_sample_weight)
booster = train({'eval_metric': 'auc'}, trainDmatrix, num_boost_round=100, 
                evals=[(trainDmatrix,'train'), (validDmatrix, 'valid')], early_stopping_rounds=50, 
                verbose_eval=10)

UPD: The xgboost community is aware of it and there is a discussion and even a PR for it: https://github.com/dmlc/xgboost/issues/1804. However, this was never propagated to v0.71 for some reason.

UPD2: After pinging that issue, the relevant code update has been revived and the PR was merged into master in time for the upcoming xgboost 0.72 release on 1 June 2018

10,265

Author by

Ernie Halberg

Updated on June 26, 2022

Comments

Ernie Halberg almost 2 years

I am trying to use sample_weight in XGBClassifier to improve the performance of one of our models.

However, it seems like the sample_weight parameter is not working as expected. sample_weight is very important for this problem. Please see my code below.

Basically the fitting of the model does not seem to take into account the sample_weight parameter – it starts at an AUC of 0.5 and drops from there, recommending 0, or 1 n_estimators. There is nothing wrong with the underlying data – we have constructed a very good model using sample weights using another tool, getting a good Gini.

The sample data provided does not properly exhibit this behavior but given a consistent random seed throughout we can see that the model objects are identical whether a weight/sample_weight is provided or not.

I have tried different components from the xbgoost library that similarly have parameters where one can define weights, but no luck:

XGBClassifier.fit()
XGBClassifier.train()
Xgboost()
XGB.fit()
XGB.train()
Dmatrix()
XGBGridSearchCV()

I have also tried the fit_params=fit_params as a parameter as well as weight=weight and sample_weight=sample_weight variations

Code:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

df = pd.DataFrame(columns = 
['GB_FLAG','sample_weight','f1','f2','f3','f4','f5'])
df.loc[0] = [0,1,2046,10,625,8000,2072]
df.loc[1] = [0,0.86836,8000,10,705,8800,28]
df.loc[2] = [1,1,2303.62,19,674,3000,848]
df.loc[3] = [0,0,2754.8,2,570,16300,46]
df.loc[4] = [1,0.103474,11119.81,6,0,9500,3885]
df.loc[5] = [1,0,1050.83,19,715,3000,-5]
df.loc[6] = [1,0.011098,7063.35,11,713,19700,486]
df.loc[7] = [0,0.972176,6447.16,18,681,11300,1104]
df.loc[8] = [1,0.054237,7461.27,18,0,0,4]
df.loc[9] = [0,0.917026,4600.83,8,0,10400,242]
df.loc[10] = [0,0.670026,2041.8,21,716,11000,3]
df.loc[11] = [1,0.112416,2413.77,22,750,4600,271]
df.loc[12] = [0,0,251.81,17,806,3800,0]
df.loc[13] = [1,0.026263,20919.2,17,684,8100,1335]
df.loc[14] = [0,1,1504.58,15,621,6800,461]
df.loc[15] = [0,0.654429,9227.69,4,0,22500,294]
df.loc[16] = [0,0.897051,6960.31,22,674,5400,188]
df.loc[17] = [1,0.209862,4481.42,18,745,11600,0]
df.loc[18] = [0,1,2692.96,22,651,12800,2035]

y = np.asarray(df['GB_FLAG'])
X = np.asarray(df.drop(['GB_FLAG'], axis=1))

X_traintest, X_valid, y_traintest, y_valid = train_test_split(X, y, 
train_size=0.7, stratify=y, random_state=1337)
traintest_sample_weight = X_traintest[:,0]
valid_sample_weight = X_valid[:,0]

X_traintest = X_traintest[:,1:]
X_valid = X_valid[:,1:]

model = XGBClassifier()
eval_set = [(X_valid, y_valid)]
model.fit(X_traintest, y_traintest, eval_set=eval_set, eval_metric="auc", e 
early_stopping_rounds=50, verbose = True, sample_weight = 
traintest_sample_weight)

How do I use sample weights when using xgboost for modeling?