Exploring Basic Ensemble Learning With Python and Sci-Kit Learn in 4 Easy Steps

December 8, 2023 4 mins to read

Ensemble learning is based on a simple yet powerful premise: by combining the predictions from multiple models, we can often produce a more accurate and reliable output than any single model could. This approach is particularly effective in reducing overfitting, where a model performs well on training data but poorly on unseen data.

Types of Ensemble Methods

  • Bagging (Bootstrap Aggregating): This method involves creating multiple models, each trained on a random subset of the training data. The final prediction is typically an average (for regression problems) or a majority vote (for classification problems). A classic example is the Random Forest algorithm.
  • Boosting: Boosting algorithms train models sequentially, with each model learning from the errors of its predecessors. This approach aims to convert weak learners into a strong learner. Examples include AdaBoost and Gradient Boosting.
  • Stacking: Stacking involves training multiple models and then training a new model to aggregate their predictions. The second-level model learns to make a final prediction based on the predictions of the first-level models.

Implementing Ensemble Learning with Sci-Kit Learn

Let’s explore ensemble learning through a very simple, practical example using the Sci-Kit Learn library in Python. We’ll use a dataset for a classification problem, applying both bagging and boosting techniques.

Setting Up the Environment

pip install scikit-learn

Example: Ensemble Learning for Classification

We’ll use the famous Iris dataset, which is included in Sci-Kit Learn. This dataset involves classifying iris plants into three species based on four features.

Step 1: Import Libraries and Load Dataset

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

Step 2: Split the Dataset

Split the dataset into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 3: Apply Bagging (Random Forest)

We’ll use a Random Forest classifier as our bagging model:

# Create and train the Random Forest model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions and evaluate the model
rf_predictions = rf_classifier.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_predictions)}")

Step 4: Apply Boosting (AdaBoost)

Next, we’ll use AdaBoost:

# Create and train the AdaBoost model
ab_classifier = AdaBoostClassifier(n_estimators=100, random_state=42)
ab_classifier.fit(X_train, y_train)

# Make predictions and evaluate the model
ab_predictions = ab_classifier.predict(X_test)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, ab_predictions)}")

Comparing the Results

After running the above code, you’ll get accuracy scores for both the Random Forest and AdaBoost models. Typically, you’ll find that ensemble methods outperform single models on the same dataset, though results can vary based on the complexity of the problem and the nature of the data.

Conclusion

Of course, this is a very gentle first step suitable to new developers or those curious about machine learning. There are many excellent primers that explore the topic in greater detail.

Ensemble learning offers a robust approach to improving machine learning model performance. By leveraging the strengths of multiple models, it’s possible to achieve greater predictive accuracy and reliability. The example provided using Sci-Kit Learn illustrates how straightforward it is to implement these powerful techniques. As with any machine learning approach, it’s crucial to understand the underlying principles and to carefully tune and validate your models to avoid overfitting and ensure that they generalize well to new data.

Ensemble learning continues to be a dynamic and evolving field, with ongoing research and development offering new insights and methodologies. By staying informed and experimenting with different approaches, practitioners can harness the full potential of ensemble methods in their machine learning projects. Beginners curious about other machine learning techniques should read our article about Retrieval augmented generation.

Leave a comment

Your email address will not be published. Required fields are marked *