Bayesian Optimization is a hyperparameter tuning method that builds a probabilistic model to find the best hyperparameters efficiently. When combined with LightGBM which is a fast gradient boosting framework, it helps automatically discover the optimal settings like learning rate, tree depth or number of leaves to maximize model performance while saving time compared to grid or random search.

Bayesian Optimization
- Bayesian Optimization is an advanced technique for optimizing complex, costly to evaluate functions which makes it specially useful for tuning hyperparameters in machine learning.
- Unlike brute force approaches like grid search which test every possible combination or random search, Bayesian Optimization builds a probabilistic model of the objective function.
- This surrogate model predicts how different hyperparameter values are likely to perform and an acquisition function determines which points to sample next, balancing exploration and focusing on promising regions.
- By learning from each trial Bayesian Optimization can find near optimal hyperparameters with far fewer evaluations saving significant computation time and improving model performance efficiently.
Implementation
Step 1: Import Necessary Libraries
This installs the scikit optimize library for Bayesian optimization and imports essential libraries: pandas and numpy for data handling, lightgbm for the model, BayesSearchCV for hyperparameter tuning and scikit learn modules for splitting data, encoding labels and evaluating accuracy.
%pip install scikit-optimize
import pandas as pd
import numpy as np
import lightgbm as lgb
from skopt import BayesSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
Step 2: Load Dataset
Download the Titanic dataset directly from a zipped CSV file into a DataFrame and prints the first few rows so you can quickly inspect the structure and content of the data.
df = pd.read_csv('archive.zip')
# Print the first few rows to inspect the data
print("Original Data")
print(df.head())
Output:

Step 3: Preprocess Data
This code fills missing values in the Age column with the median age and fills missing Embarked entries with the most frequent value. Then it converts the categorical columns Sex and Embarked into numerical labels using label encoding to prepare the data for modeling.
from sklearn.preprocessing import LabelEncoder
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
for col in ['Sex', 'Embarked']:
df[col] = LabelEncoder().fit_transform(df[col])
print("\nAfter Preprocessing")
Output:

Step 4: Select Features and Target
This selects specific columns as features for the model input and assigns them to X while the target variable y is set to the '2urvived' column, presumably indicating survival status in the Titanic dataset.
features = ['Pclass', 'Sex', 'Age', 'sibsp', 'Parch', 'Fare', 'Embarked']
X = df[features]
y = df['2urvived']
Step 5: Train Test Split
This splits the dataset into training and testing sets with 80% of the data used for training and 20% for testing ensuring reproducibility by setting a fixed random seed.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Step 6: Define LightGBM and Hyperparameter Search Space
This initializes a LightGBM classifier configured for binary classification with 200 trees and a fixed random seed. The search_spaces dictionary defines ranges for key hyperparameters that will be optimized using Bayesian search including tree complexity, learning rate and sampling parameters.
lgbm = lgb.LGBMClassifier(objective='binary', n_estimators=200, random_state=42)
search_spaces = {
'num_leaves': (20, 100),
'max_depth': (3, 12),
'learning_rate': (0.01, 0.3, 'log-uniform'),
'min_child_samples': (5, 100),
'subsample': (0.5, 1.0, 'uniform'),
'colsample_bytree': (0.5, 1.0, 'uniform')
}
Step 7: Setup Bayesian Optimization
This sets up Bayesian optimization with cross validation to tune the LightGBM model’s hyperparameters, running 30 iterations to find the best combination based on accuracy, using 3 fold cross validation and leveraging all CPU cores for parallel processing.
opt = BayesSearchCV(
estimator=lgbm,
search_spaces=search_spaces,
n_iter=30,
cv=3,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
Output:

Step 8: Train the Model
This starts the Bayesian optimization process by fitting the model on the training data searching through the defined hyperparameter space to find the best settings that improve prediction accuracy.
# Run the Bayesian Optimization process to find the best hyperparameters
opt.fit(X_train, y_train)
Step 9: Check Best Results
This prints the highest cross validated accuracy achieved during optimization along with the hyperparameter values that produced the best model performance.
print("Best CV Accuracy:", round(opt.best_score_, 4))
print("Best Hyperparameters:", opt.best_params_)
Output:
Best CV Accuracy: 0.7975
Best Hyperparameters: OrderedDict([('colsample_bytree', 1.0), ('learning_rate', 0.01), ('max_depth', 3), ('min_child_samples', 83), ('num_leaves', 20), ('subsample', 0.5835858766104881)]
Step 10: Evaluate on Test Data
This extracts the best model found by the optimizer, uses it to predict labels on the test set and calculates the accuracy of these predictions to evaluate how well the model generalizes to unseen data.
best_model = opt.best_estimator_
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", round(test_acc, 4))
Output:
Test Set Accuracy: 0.7786
You can download the source code from here- Bayesian Optimization with LightGBM