Regression, Decision Trees and Random Forests
Supervised Learning Concepts
- Linear Regression
- Logistic Regression
- Decision Trees and Random Forests
- Naive Bayes
- k-Nearest Neighbors (k-NN)
- Support Vector Machines (SVM)
- Gradient Boosting and AdaBoost
What is Supervised Learning?
Supervised learning is a type of machine learning where the algorithm is trained on labelled data to predict future outcomes accurately.
Supervised learning is a type of machine learning in which
the model is trained on a labelled dataset, where the target variable is known.
Learning a function that maps input variables to output variables is the aim of
supervised learning. Regression and classification are the two primary subtypes
of supervised learning.
Linear Regression:
A supervised learning approach called linear regression forecasts a continuous target variable. The goal is to find a linear
relationship between the input variables (also known as independent variables)
and the output variable (also known as the dependent variable). Linear
regression is used when the relationship between the input and output variables
is linear.
Algorithmic steps for Linear Regression:
- Set the weights and biases to random values at the beginning.
- Calculate the predicted output by multiplying the weights with the input variables and adding the bias.
- Determine the discrepancy between the output that was expected and what was produced.
- Update the weights and bias using gradient descent to minimize the error.
- Until the error is minimized, repeat steps 2-4.
Python code for Linear Regression:
python code
import numpy as np
class LinearRegression:
def __init__(self, learning_rate=0.01, num_iterations=1000):
self.learning_rate = learning_rate
self.num_iterations = num_iterations
self.weights = None
self.bias = None
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for i in range(self.num_iterations):
y_predicted = np.dot(X, self.weights) + self.bias
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
def predict(self, X):
y_predicted = np.dot(X, self.weights) + self.bias
return y_predicted
Logistic Regression:
A supervised learning approach called logistic regression is applied to classification issues where the output variable is categorical. The goal of logistic regression is to find a relationship between the input variables and the probability of the output variable belonging to a particular category. The logistic function maps the output to a probability value between 0 and 1.
Algorithmic steps for Logistic Regression:- Set the weights and biases to random values at the beginning.
- Calculate the predicted output by multiplying the weights with the input variables and adding the bias.
- Apply the logistic function to the predicted output to get the probability of the output variable belonging to a particular category.
- Calculate the error between the predicted probability and the actual probability.
- Update the weights and bias using gradient descent to minimize the error.
- Repeat steps 2-5 until the error is minimized.
Example of logistic regression using Python and the sci-kit-learn library:
python code
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
# Load data
data = pd.read_csv('data.CSV)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('label', axis=1), data['label'], test_size=0.2)
# Create and fit a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
This code loads a dataset from a CSV file, splits the data into training and testing sets, creates and fits a logistic regression model using sci-kit-learn, makes predictions on the testing set, and calculates the accuracy of the model.
Decision Trees and Random Forests
Decision Trees and Random Forests are popular algorithms in
Supervised Learning, particularly for classification tasks.
Decision Trees Algorithm:
The basic Decision Tree algorithm works by recursively splitting the data into subsets based on the values of attributes until a certain stopping criterion is met. This creates a tree-like model of decisions that can be used to make predictions on new data.
General algorithmic steps for Decision Trees are:
- Calculate the entropy (or Gini index) of the original dataset based on the target variable.
- For each attribute, calculate the information gain (or decrease in impurity) by splitting the dataset based on the values of that attribute.
- Choose the attribute with the highest information gain as the tree's root.
- Split the data into branches, one for each potential value of the selected characteristic.
- Recursively apply steps 1-4 to each subset until a stopping criterion is met, such as reaching a certain depth or having a minimum number of examples in each leaf.
Example of the Decision Tree algorithm in Python using the sci-kit-learn library:
python code
from sklearn.tree import DecisionTreeClassifier
# Load data
X, y = load_data()
# Create a Decision Tree classifier object
dt = DecisionTreeClassifier()
# Train the model on the data
dt.fit(X, y)
# Make predictions on new data
y_pred = dt.predict(X_new)
Random Forests Algorithm:
Random Forests is an extension of the Decision Tree algorithm that builds multiple trees and combines their predictions to improve accuracy and reduce overfitting.
The general algorithmic steps for Random Forests are:
- Randomly select a subset of the original data (with replacement) to create a new dataset for each tree.
- For each tree, randomly select a subset of attributes to use when making splits.
- Build a decision tree for each new dataset using the selected attributes.
- Combine the predictions of all the trees to make a final prediction.
Example of the Random Forests algorithm in Python using the sci-kit-learn library:
python code
from sklearn.ensemble import RandomForestClassifier
# Load data
X, y = load_data()
# Create Random Forest classifier object
rf = RandomForestClassifier()
# Train the model on the data
rf.fit(X, y)
# Make predictions on new data
y_pred = rf.predict(X_new)
Overall, Decision Trees and Random Forests are powerful algorithms for classification tasks that can handle both categorical and continuous data. They are relatively easy to interpret and can provide insights into the important features for making predictions.
Naive Bayes:
A probabilistic algorithm used for classification problems is called Naive Bayes. The Bayes theorem, which asserts that the likelihood of a hypothesis H, given evidence E, is proportional to the probability of the evidence E, given hypothesis H, multiplied by the prior probability of hypothesis H, serves as the foundation for this argument. In other words, it calculates the probability of each class given the input features and selects the class with the highest probability.
Algorithmic Steps:
- Prepare the data by converting it into a suitable format and dividing it into training and testing sets.
- Calculate the prior probabilities for each class by counting the number of instances of each class in the training set.
- Calculate the likelihood probabilities for each feature and each class by counting the number of instances of each feature for each class in the training set.
- For each instance in the testing set, calculate the probability of each class using the Bayes theorem.
- Select the class with the highest probability as the predicted class.
Python Code:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
# Train a Gaussian Naive Bayes model
clf = GaussianNB()
clf.fit(X_train, y_train)
# Predict the classes of the testing set
y_pred = clf.predict(X_test)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
As the anticipated class, pick the one with the highest chance.
k-Nearest Neighbors (k-NN):
k-Nearest Neighbors is a non-parametric algorithm used for classification and regression tasks. It classifies an instance by finding the k nearest neighbours to that instance in the training set and selecting the class that is most common among the neighbours.
Algorithmic Steps:
- Prepare the data by converting it into a suitable format and dividing it into training and testing sets.
- Choose the value of k.
- For each instance in the testing set, find the k nearest neighbours in the training set based on a distance metric.
- Select the class that is most common among the neighbours as the predicted class.
Python Code:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
# Train a k-NN model with k=3
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
# Predict the classes of the testing set
y_pred = clf.predict(X_test)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
Support Vector Machines (SVM):
Popular supervised learning algorithms for classification, regression, and outlier identification include Support Vector Machines (SVM). SVM aims to find the optimal hyperplane in a high-dimensional space that maximally separates data points from different classes.
Algorithmic Steps:
- Input training data
- Select a kernel function and kernel parameters
- Build the kernel matrix based on the training data
- Define the optimization problem for finding the optimal hyperplane
- Use a suitable optimization algorithm to solve the optimization problem.
- Compute the decision boundary and predict the class of new data points based on their position relative to the boundary
Here is an example of using the SVM algorithm in Python with the sci-kit-learn library:
Python Code:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Create an SVM classifier with a linear kernel
clf = svm.SVC(kernel='linear')
# Train the classifier using the training data
clf.fit(X_train, y_train)
# Predict the classes of the testing data
y_pred = clf.predict(X_test)
# Print the accuracy of the classifier
print("Accuracy:", clf.score(X_test, y_test))
Gradient Boosting and Ada Boost:
Gradient Boosting and Ada Boost are ensemble learning methods used for classification and regression problems. These algorithms combine the predictions of several weaker learners to form a strong learner.
Algorithmic Steps:
- Input training data
- Initialize the ensemble with a weak learner
- Train the weak learner on the training data
- Compute the error of the weak learner on the training data
- Update the weights of the training examples based on the error of the weak learner
- Steps 2 through 5 are repeated until a stopping criterion is satisfied, for a predetermined number of iterations.
- Compute the final predictions of the ensemble by combining the predictions of all weak learners
Here is an example of using the Gradient Boosting algorithm in Python with the sci-kit-learn library:
python code
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Create a Gradient Boosting classifier with 100 estimators
clf = GradientBoostingClassifier(n_estimators=100)
# Train the classifier using the training data
clf.fit(X_train, y_train)
# Predict the classes of the testing data
y_pred = clf.predict(X_test)
# Print the accuracy of the classifier
print("Accuracy:", clf.score(X_test, y_test))
Example of AdaBoost algorithm in Python with the sci-kit-learn library:
python code
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Create an AdaBoost classifier with 100 estimators
clf = AdaBoostClassifier(n_estimators=100)
Train the classifier on the training data
clf.fit(X_train, y_train)
Make predictions on the testing data
y_pred = clf.predict(X_test)
Evaluate the performance of the classifierfrom sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Feature importance
print("Feature importances:", clf.feature_importances_)
Visualize decision boundaries (only for two-dimensional datasets)
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
if X.shape[1] == 2:
plot_decision_regions(X, y, clf, legend=2)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Decision boundaries with AdaBoost")
plt.show()
previous (Machine Learning Data Analysis)
continue to (Unsupervised Machine Learning)
Comments
Post a Comment