Random Forest Regression Accuracy different for Training set and Test set

python machine-learning scikit-learn regression sklearn-pandas

11,875

Solution 1

Before trying to answer to your points, a comment: I see you are using a Regressor with accuracy as metric. But accuracy is a metric used in classification problems; in regressions models you usually use other metrics, as Mean Squared Error (MSE). See here.

If you just switch to a more adapt metric, maybe you will find that your model is not so bad.

I’m going anyway to reply to your questions.

Why are the Training Accuracy and Test Accuracy so far away? This means that you overfitted your training samples: your model is very strong in predicting the data of the training dataset, but unable to generalise. Is like having a model trained on a set of cat pictures which believe only those pictures are cats, and all the other pictures of all the other cats are not. In fact, you have an accuracy on the test set of ~0.5, which is basically a random guess.

How do I know if this model is being over/under fitted? Exactly form the difference in accuracy between the two sets. The more they are near each other, the more the model is able to generalise. You already know how on overfit looks like. An underfit is generally recognisable because of a low accuracy in both sets.

Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case? There is not a right model to use. Random Forest, and in general all the tree-based model (LightGBM, XGBoost) are the Swiss army knife of machine learning when you are dealing with structured data, because of their simplicity and reliability. Model based on Deep Learning perform better in theory, but much more complex to set up.

How can I build a confusion matrix using the variables I have created? Confusion matrices can be created when you build a classification model, and are built on the output of your model. You are using a regressor, it do not have lot of sense.

How do I validate the performance of the model? In general, for a reliable validation of performances you split the data I three: you train on one (a.k.a. training set), tune the model on a second (a.k.a. validation set, this is what you call test set), and finally, when you are happy with the model and its hyper-parameters, you test it on the third (a.k.a. test set, not to be confused with the one you call test set). This last one tells you if your model generalize well or not. This because when you choose and tune the model you can also overfit the validation set (the one you call test set), maybe selecting a set of hyper-parameters which performs well only on that set. Also, you have to choose a reliable metric, and this depends both on the data and on the model. With regressions, the MSE is pretty good.

Solution 2

With Trees and Ensemble, you have to be carreful of some settings. In your case, the difference comes from "overfitting". That means, your model have learned "too much" your training datas and is not able to generalise to other datas.

One important thing to do is to limit the depth of trees. For every trees, there is a branching factor of 2. That means at depth d, you gonna have 2^d branches.

Let's imagine you have 1000 training values. If you don't limit depth (or/and min_samples_leaf), you can learn your complete dataset with a depth of 10 (because 2^10 = 1024 > N_training).

What you can do is to compare training accuracy and test accuracy for a range of depth ( from let's say 3 to log(n) in base 2 ). If depth is too low, both accuracy will be low as you need more branches to learn properly datas, it will rise a peak then the training data will continue to riseup but test values will go down. It should be something like the following picture with Model Complexity which is your depth.

You can also play with min_samples_split and/or min_samples_leaf which can help you to let's say split only if you have multiple datas in this branch. As a result, this will limit also depth and will allow to have a tree with different depth per branches. Same as previously explained, you can play with the value to look for the best values (with a grid Search).

I hope it helps,

11,875

Author by

Code_Sipra

Manager by Day. Programmer by Night.

Updated on June 05, 2022

Comments

Code_Sipra almost 2 years

I am new to Machine Learning and to Python. I am trying to build a Random Forest Regression model on one of the datasets from the UCI repository. This is my first ML model. I may be entirely wrong in my approach.

The dataset is available here - https://archive.ics.uci.edu/ml/datasets/abalone

Below is the entire working code that I have written. I am using Python 3.6.4 with Windows 7 x64 OS (forgive me for the lengthy code).

import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest

#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window

root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows

#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options

print("Reading input file...")

# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
                                      "At The Prompt, Enter 'Abalone_Data.csv' File.")

# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
    quit()
else:
    del(File_Checker)

file_loop = 0

while (file_loop == 0):
    # Get path of base file
    file_path =  filedialog.askopenfilename(initialdir = "/",
                                            title = "File Selection Prompt",
                                            filetypes = (("XLSX Files","*.*"), ))

    # Condition to check if user selected a file or not
    if (len(file_path) < 1):
        # Pop-up window to warn uer that no file was selected
        result = messagebox.askretrycancel("File Selection Prompt Error",
                                           "No file has been selected. \nWhat do you want to do?")

        # Condition to repeat the loop or quit program execution
        if (result == True):
            continue
        else:
            quit()

    # Get file name
    file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
    file_name = file_name[-1] # extracts the last element of the list

    # Condition to check if correct file was selected or not
    if (file_name != "Abalone_Data.csv"):
        result = messagebox.askretrycancel("File Selection Prompt Error",
                                           "Incorrect file selected. \nWhat do you want to do?")

        # Condition to repeat the loop or quit program execution
        if (result == True):
            continue
        else:
            quit()

    # Read the base file
    input_file = pd.read_csv(file_path,
                             sep = ',',
                             encoding = 'utf-8',
                             low_memory = True)

    break

# Delete unwanted files
del(file_loop, file_name)

#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")

# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])

# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)

#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")

# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)

# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)

# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)

#------------------------------------------------------------------------------------------------------------------------#
y = y.values 
X = X.values

#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")

# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")

# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message

#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")

# Predicting a new result with regression
y_pred = regressor.predict(X_test)

# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
               'Sex_M' : 0,
               'Length' : 0.5,
               'Diameter' : 0.35,
               'Height' : 0.8,
               'Whole_Weight' : 0.223,
               'Shucked_Weight' : 0.09,
               'Viscera_Weight' : 0.05,
               'Shell_Weight' : 0.07}

# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])

# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
                           'Viscera_Weight', 'Sex_I', 'Sex_M']]

# Applying feature scaling
#test_values = sc_X.transform(test_values)

# Predicting values of new data
new_pred = regressor.predict(test_values)

#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")

# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")

# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))

print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)

When I look at the model accuracy, below is what I get.

Getting Model Accuracy...
Training Accuracy =  0.9359702279804791
Test Accuracy =  0.5695080680053354

Below are my questions. 1) Why are the Training Accuracy and Test Accuracy so far away?

2) How do I know if this model is being over/under fitted?

3) Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?

3) How can I build a confusion matrix using the variables I have created?

4) How do I validate the performance of the model?

I am looking for your guidance so that I too can learn from my mistakes and improve on my modelling skills.

Tim Biegeleisen about 6 years

AFAIK we generally don't speak of training vs. test with random forests, because we do not cross validate random forests (and if you're doing this, then your approach is probably wrong). Rather, a random forest just has a single accuracy metric, maybe a few of them, such as the GINI index, which do not depend on training vs. test.
Tim Biegeleisen about 6 years

If the forest you have built does not predict your data well, then yes you may consider other methods.
Code_Sipra about 6 years

Thanks for the inputs @Tim Biegeleisen. I believe you have given me some valuable inputs on the Random Forest model. Let me go back and work on my model. The other question I have is, do we need to handle the dummy variable trap? I have done that in my code. But I am not sure if its even required.
Kristada673 over 5 years

The documentation of sklearn states the following about the score: score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction. So, I don't think its accurate to say Training Accuracy = 0.9359702279804791 and Test Accuracy = 0.5695080680053354, as that would imply Training Accuracy = 93.6% and Test Accuracy = 56.9%. Am I right?

Code_Sipra about 6 years

Thank you @Vincenzo Lavorini for taking the time to answer all my questions! Your inputs are really helpul. I recreated a new model using Multi-Linear regression and I do see the Adjusted R-Squared coming to ~0.54 for both training and test sets. Would this mean this model is better suited to this specific dataset than Random Forest? Also, do we need to handle the dummy variable trap if ever i had to use Random forest? Which model would you use, and why? Looking forward to your inputs.
Vincenzo Lavorini about 6 years

I am not able to reply to this, you should compare the two models with the same score. A fast and easy check: just use the sklearn metric function R^2 at the end of the random forest model: something like print (r2_score(y_test, regressor.predict(x_test))). For the dummy variables: yes, if you have categorical variables is always a good idea to transform them as dummies
Nicolas M. about 6 years

To add an additionnal information about the question how do i determine the right model for this use-case?. You can everytime create a baseline with a linear model. If you reach a good result already, that means your problem is linear and you don't "need" to try more complexe ones (SVM, NN). If the result is bad, you can try non-linear models SVM (small dataset) or NN. Trees works more as region-cutting so they are also good for not linearly separable datas. You have good explaination here jakevdp.github.io/PythonDataScienceHandbook/… with visual overfitting
Code_Sipra about 6 years

Thanks @Nicolas M.. I tried using Grid Search to get the best values. My only doubt is, wont this over-fit the model?
Code_Sipra about 6 years

Thanks everyone. I built 3 different models, one using Random Forest Regression, one using Multi-Linear regression and one using Support Vector regression. All of them are giving me an accuracy score between 0.48 to 0.55 when I use K Fold Cross Validation.. I took the value of K as 10. Does this mean that the dataset itself doesn't have variables with strong relationship with dependent variable? Or is this accuracy good enough for a model?
Vincenzo Lavorini about 6 years

You mean R^2, not accuracy, right? Anyway, I checked around. There is someone which wrote three detailed posts on abalone rings prediction here. A R^2 of ~0.55 is good with this dataset!
Code_Sipra about 6 years

Yes, forgive me for using the incorrect terminologies used. Still getting a hang of things.. And thank you for the guidance.
Nicolas M. about 6 years

What you should do is first do a classic split train/test (80%-20% or 90%-10% based on you dataset size). Then you do the CV on your train set. You find the optimal parameter to have the highest validation score. Then you can confirm those parameter just by fitting again your model on your complete train set and evaluate if on the test set (never used during CV). All result should be closed if your learning is OK