Using sample weights for training xgboost (0.7) classifier

10,265

The problem is that for evaluation datasets weights are not propagated by the sklearn API.

So you seem to be doomed to use the native API. Just replace the lines starting with your model definition by the following code:

from xgboost import train, DMatrix
trainDmatrix = DMatrix(X_traintest, label=y_traintest, weight=traintest_sample_weight)
validDmatrix = DMatrix(X_valid, label=y_valid, weight=valid_sample_weight)
booster = train({'eval_metric': 'auc'}, trainDmatrix, num_boost_round=100, 
                evals=[(trainDmatrix,'train'), (validDmatrix, 'valid')], early_stopping_rounds=50, 
                verbose_eval=10)

UPD: The xgboost community is aware of it and there is a discussion and even a PR for it: https://github.com/dmlc/xgboost/issues/1804. However, this was never propagated to v0.71 for some reason.

UPD2: After pinging that issue, the relevant code update has been revived and the PR was merged into master in time for the upcoming xgboost 0.72 release on 1 June 2018

Share:
10,265
Ernie Halberg
Author by

Ernie Halberg

Updated on June 26, 2022

Comments

  • Ernie Halberg
    Ernie Halberg almost 2 years

    I am trying to use sample_weight in XGBClassifier to improve the performance of one of our models.

    However, it seems like the sample_weight parameter is not working as expected. sample_weight is very important for this problem. Please see my code below.

    Basically the fitting of the model does not seem to take into account the sample_weight parameter – it starts at an AUC of 0.5 and drops from there, recommending 0, or 1 n_estimators. There is nothing wrong with the underlying data – we have constructed a very good model using sample weights using another tool, getting a good Gini.

    The sample data provided does not properly exhibit this behavior but given a consistent random seed throughout we can see that the model objects are identical whether a weight/sample_weight is provided or not.

    I have tried different components from the xbgoost library that similarly have parameters where one can define weights, but no luck:

    XGBClassifier.fit()
    XGBClassifier.train()
    Xgboost()
    XGB.fit()
    XGB.train()
    Dmatrix()
    XGBGridSearchCV()
    

    I have also tried the fit_params=fit_params as a parameter as well as weight=weight and sample_weight=sample_weight variations

    Code:

    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from xgboost import XGBClassifier
    
    df = pd.DataFrame(columns = 
    ['GB_FLAG','sample_weight','f1','f2','f3','f4','f5'])
    df.loc[0] = [0,1,2046,10,625,8000,2072]
    df.loc[1] = [0,0.86836,8000,10,705,8800,28]
    df.loc[2] = [1,1,2303.62,19,674,3000,848]
    df.loc[3] = [0,0,2754.8,2,570,16300,46]
    df.loc[4] = [1,0.103474,11119.81,6,0,9500,3885]
    df.loc[5] = [1,0,1050.83,19,715,3000,-5]
    df.loc[6] = [1,0.011098,7063.35,11,713,19700,486]
    df.loc[7] = [0,0.972176,6447.16,18,681,11300,1104]
    df.loc[8] = [1,0.054237,7461.27,18,0,0,4]
    df.loc[9] = [0,0.917026,4600.83,8,0,10400,242]
    df.loc[10] = [0,0.670026,2041.8,21,716,11000,3]
    df.loc[11] = [1,0.112416,2413.77,22,750,4600,271]
    df.loc[12] = [0,0,251.81,17,806,3800,0]
    df.loc[13] = [1,0.026263,20919.2,17,684,8100,1335]
    df.loc[14] = [0,1,1504.58,15,621,6800,461]
    df.loc[15] = [0,0.654429,9227.69,4,0,22500,294]
    df.loc[16] = [0,0.897051,6960.31,22,674,5400,188]
    df.loc[17] = [1,0.209862,4481.42,18,745,11600,0]
    df.loc[18] = [0,1,2692.96,22,651,12800,2035]
    
    y = np.asarray(df['GB_FLAG'])
    X = np.asarray(df.drop(['GB_FLAG'], axis=1))
    
    X_traintest, X_valid, y_traintest, y_valid = train_test_split(X, y, 
    train_size=0.7, stratify=y, random_state=1337)
    traintest_sample_weight = X_traintest[:,0]
    valid_sample_weight = X_valid[:,0]
    
    X_traintest = X_traintest[:,1:]
    X_valid = X_valid[:,1:]
    
    model = XGBClassifier()
    eval_set = [(X_valid, y_valid)]
    model.fit(X_traintest, y_traintest, eval_set=eval_set, eval_metric="auc", e 
    early_stopping_rounds=50, verbose = True, sample_weight = 
    traintest_sample_weight)
    

    How do I use sample weights when using xgboost for modeling?