Bayesian Optimization with LightGBM

Bayesian Optimization is a hyperparameter tuning method that builds a probabilistic model to find the best hyperparameters efficiently. When combined with LightGBM which is a fast gradient boosting framework, it helps automatically discover the optimal settings like learning rate, tree depth or number of leaves to maximize model performance while saving time compared to grid or random search.

Bayesian Optimization

Bayesian Optimization is an advanced technique for optimizing complex, costly to evaluate functions which makes it specially useful for tuning hyperparameters in machine learning.
Unlike brute force approaches like grid search which test every possible combination or random search, Bayesian Optimization builds a probabilistic model of the objective function.
This surrogate model predicts how different hyperparameter values are likely to perform and an acquisition function determines which points to sample next, balancing exploration and focusing on promising regions.
By learning from each trial Bayesian Optimization can find near optimal hyperparameters with far fewer evaluations saving significant computation time and improving model performance efficiently.

Implementation

Step 1: Import Necessary Libraries

This installs the scikit optimize library for Bayesian optimization and imports essential libraries: pandas and numpy for data handling, lightgbm for the model, BayesSearchCV for hyperparameter tuning and scikit learn modules for splitting data, encoding labels and evaluating accuracy.

Python

%pip install scikit-optimize
import pandas as pd
import numpy as np
import lightgbm as lgb
from skopt import BayesSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

Step 2: Load Dataset

Download the Titanic dataset directly from a zipped CSV file into a DataFrame and prints the first few rows so you can quickly inspect the structure and content of the data.

Python

df = pd.read_csv('archive.zip')

# Print the first few rows to inspect the data
print("Original Data")
print(df.head())

Output:

Step 3: Preprocess Data

This code fills missing values in the Age column with the median age and fills missing Embarked entries with the most frequent value. Then it converts the categorical columns Sex and Embarked into numerical labels using label encoding to prepare the data for modeling.

Python

from sklearn.preprocessing import LabelEncoder

df['Age'] = df['Age'].fillna(df['Age'].median())

df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

for col in ['Sex', 'Embarked']:
    df[col] = LabelEncoder().fit_transform(df[col])

print("\nAfter Preprocessing")

Output:

Step 4: Select Features and Target

This selects specific columns as features for the model input and assigns them to X while the target variable y is set to the '2urvived' column, presumably indicating survival status in the Titanic dataset.

Python

features = ['Pclass', 'Sex', 'Age', 'sibsp', 'Parch', 'Fare', 'Embarked']
X = df[features]
y = df['2urvived']

Step 5: Train Test Split

This splits the dataset into training and testing sets with 80% of the data used for training and 20% for testing ensuring reproducibility by setting a fixed random seed.

Python

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Step 6: Define LightGBM and Hyperparameter Search Space

This initializes a LightGBM classifier configured for binary classification with 200 trees and a fixed random seed. The search_spaces dictionary defines ranges for key hyperparameters that will be optimized using Bayesian search including tree complexity, learning rate and sampling parameters.

Python

lgbm = lgb.LGBMClassifier(objective='binary', n_estimators=200, random_state=42)

search_spaces = {
    'num_leaves': (20, 100),
    'max_depth': (3, 12),
    'learning_rate': (0.01, 0.3, 'log-uniform'),
    'min_child_samples': (5, 100),
    'subsample': (0.5, 1.0, 'uniform'),
    'colsample_bytree': (0.5, 1.0, 'uniform')
}

Step 7: Setup Bayesian Optimization

This sets up Bayesian optimization with cross validation to tune the LightGBM model’s hyperparameters, running 30 iterations to find the best combination based on accuracy, using 3 fold cross validation and leveraging all CPU cores for parallel processing.

Python

opt = BayesSearchCV(
    estimator=lgbm,
    search_spaces=search_spaces,
    n_iter=30,          
    cv=3,              
    scoring='accuracy',
    random_state=42,
    n_jobs=-1           
)

Output:

Step 8: Train the Model

This starts the Bayesian optimization process by fitting the model on the training data searching through the defined hyperparameter space to find the best settings that improve prediction accuracy.

Python

# Run the Bayesian Optimization process to find the best hyperparameters
opt.fit(X_train, y_train)

Step 9: Check Best Results

This prints the highest cross validated accuracy achieved during optimization along with the hyperparameter values that produced the best model performance.

Python

print("Best CV Accuracy:", round(opt.best_score_, 4))
print("Best Hyperparameters:", opt.best_params_)

Output:

Best CV Accuracy: 0.7975
Best Hyperparameters: OrderedDict([('colsample_bytree', 1.0), ('learning_rate', 0.01), ('max_depth', 3), ('min_child_samples', 83), ('num_leaves', 20), ('subsample', 0.5835858766104881)]

Step 10: Evaluate on Test Data

This extracts the best model found by the optimizer, uses it to predict labels on the test set and calculates the accuracy of these predictions to evaluate how well the model generalizes to unseen data.

Python

best_model = opt.best_estimator_
y_pred = best_model.predict(X_test)

test_acc = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", round(test_acc, 4))

Output:

Test Set Accuracy: 0.7786

You can download the source code from here- Bayesian Optimization with LightGBM

Bayesian Optimization with LightGBM