Skip to main content

What is Unsupervised Learning

 Clustering and Principal Component Analysis

Unsupervised Learning Concepts:

  • Clustering Algorithms (K-Means, Hierarchical Clustering)
  • Principal Component Analysis (PCA)
  • Anomaly Detection
  • Model Evaluation and Selection
  • Model Performance Metrics
  • Cross-Validation Techniques
  • Hyperparameter Tuning
  • Model Selection Techniques
Type of machine learning where the algorithm is trained on unlabeled data to identify hidden patterns or structures


What is Unsupervised Learning?

Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data to identify hidden patterns or structures. 

Unsupervised learning is a machine learning technique where the goal is to discover patterns or relationships in data without any labelled information. The data is unlabeled, and the algorithm must find structure within the data on its own. Clustering is a common unsupervised learning technique used to group similar data points together.

K-Means Clustering Algorithm:

The K-Means algorithm is a popular unsupervised learning technique used to group data into a K number of clusters. It works by partitioning the data into K groups based on the Euclidean distance between the data points and the cluster centroids. 

The general algorithmic steps for K-Means clustering are:

  • Choose the number of clusters K.
  • Randomly initialize K centroids.
  • Each data point should be matched to the closest centroid.
  • Recalculate the centroid of each cluster.
  • Up until the centroids are no longer moving noticeably, repeat steps 3 and 4 again.

Example of K-Means clustering in Python using the sci-kit-learn library:

python code

from sklearn import datasets

from sklearn.cluster import KMeans

Load the iris dataset

iris = datasets.load_iris()

X = iris.data

Create a K-Means clustering model with 3 clusters

kmeans = KMeans(n_clusters=3)

Fit the model to the data

kmeans.fit(X)

Predict the cluster labels for the data

labels = kmeans.predict(X)

Group data into number of clusters


Hierarchical Clustering Algorithm:

Another well-liked unsupervised learning method for clustering data is hierarchical clustering. It works by creating a tree-like structure of nested clusters. Agglomeration and divisive hierarchical clustering are the two main forms. Agglomeration clustering is more commonly used, and it works by starting with each data point in its own cluster and iteratively merging the closest clusters until all data points belong to a single cluster. 

The general algorithmic steps for agglomeration hierarchical clustering are

  • Each data point in its own cluster should come first.
  • Calculate the matrix of distances between each pair of clusters.
  • Create a new cluster by merging the two nearby clusters.
  • Recompute the distance matrix between the new cluster and all other clusters.
  • Repeat steps 2-4 until all data points belong to a single cluster.

Example of hierarchical clustering in Python using the Scipy library:

python code

from sklearn import datasets

from scipy.cluster.hierarchy import linkage, dendrogram

Load the iris dataset

iris = datasets.load_iris()

X = iris.data

Compute the linkage matrix using the Ward method

Z = linkage(X, method='ward')

Plot the dendrogram

dendrogram(Z)

Principal Component Analysis (PCA):

An unsupervised machine learning method for reducing dimensionality is principal component analysis. It is a statistical method that transforms a large dataset into a smaller one, retaining most of the relevant information. PCA is often used for exploratory data analysis, pattern recognition, and feature extraction.

Algorithmic Steps:

  • Standardize the data.
  • Compute the covariance matrix.
  • Calculate the covariance matrix's eigenvalues and eigenvectors.
  • Sort the eigenvectors by decreasing eigenvalues.
  • Choose the k eigenvectors, where k is the dimension of the new feature subspace, that match the k biggest eigenvalues.
  • Create the projection matrix W using the k eigenvectors that you have chosen.
  • To obtain the k-dimensional feature subspace Y, transform the original dataset X by W.

Example of PCA using the sci-kit-learn library in Python:

python code

from sklearn.datasets import load_iris

from sklearn.decomposition import PCA

# Load the iris dataset

iris = load_iris()

X = iris.data

# Create a PCA object and fit the data

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

# Print the first 5 rows of the transformed data

print(X_pca[:5])

Anomaly Detection:

Anomaly detection is a type of unsupervised learning that identifies unusual patterns or observations in a dataset that do not conform to expected behaviour. Applications for it include fraud detection, intrusion detection, and preventative maintenance.

Algorithmic Steps:

  • Choose a suitable anomaly detection algorithm (e.g., Gaussian mixture models, isolation forests, or one-class SVM).
  • Train the algorithm on a dataset that contains only normal behaviour (i.e., non-anomalous data).
  • Test the algorithm on a separate dataset that contains both normal and anomalous behaviour.
  • Identify any observations that the algorithm labels as anomalous.

Example of anomaly detection using the isolation forest algorithm in the sci-kit-learn library in Python:

python code

from sklearn.datasets import make_blobs

from sklearn.ensemble import IsolationForest

# Generate a dataset with 1000 samples and 2 features

X, y = make_blobs(n_samples=1000, n_features=2, centers=3, random_state=42)

# Create an isolation forest object and fit the data

clf = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)

clf.fit(X)

# Predict the labels of the data

y_pred = clf.predict(X)

# Print the number of normal and anomalous samples

print(f"Normal samples: {sum(y_pred == 1)}, Anomalous samples: {sum(y_pred == -1)}")

Model Evaluation and Selection

Model evaluation and selection are crucial steps in machine learning. The goal is to evaluate the performance of different models and select the best one that fits the data and provides the best predictions. Model performance metrics are used to measure the performance of different models and to compare them.

Model Performance Metrics:

Model performance metrics are used to measure the performance of a model. These metrics are used to evaluate the accuracy of a model's predictions. There are various model performance metrics, and the choice of metric depends on the specific problem and the type of model being used. Some commonly used performance metrics include:

Accuracy

Measures the proportion of correct predictions.

Measures the percentage of accurate positive predictions compared to all positive forecasts.

Measures the percentage of accurate positive forecasts among all real positive outcomes.

F1-score:

Gauges the harmony of recall and precision.

ROC curve and AUC

Used to evaluate the trade-off between true positive rate and false positive rate.

Model Evaluation and Selection

Model Evaluation:

Model evaluation is the process of estimating the performance of a machine learning model on an independent dataset. The goal is to determine how well the model generalizes to new, unseen data. The most common technique for model evaluation is cross-validation, where the dataset is split into training and testing sets, and the model is trained on the training set and evaluated on the testing set. This process is repeated several times, with different splits of the data, to obtain a more robust estimate of the model's performance.

Example of how to use the sci-kit-learn library to evaluate the performance of a classification model using cross-validation and various performance metrics:

python code

from sklearn.datasets import load_iris

from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset

iris = load_iris()

X = iris.data

y = iris.target

# Create a decision tree classifier

clf = DecisionTreeClassifier()

# Evaluate the model using cross-validation and various metrics

scores = cross_val_score(clf, X, y, cv=5, scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])

# Print the results

print("Accuracy: %0.2f (+/- %0.2f)" % (scores['test_accuracy'].mean(), scores['test_accuracy'].std() * 2))

print("Precision: %0.2f (+/- %0.2f)" % (scores['test_precision'].mean(), scores['test_precision'].std() * 2))

print("Recall: %0.2f (+/- %0.2f)" % (scores['test_recall'].mean(), scores['test_recall'].std() * 2))

print("F1-score: %0.2f (+/- %0.2f)" % (scores['test_f1'].mean(), scores['test_f1'].std() * 2))

print("AUC-ROC: %0.2f (+/- %0.2f)" % (scores['test_roc_auc'].mean(), scores['test_roc_auc'].std() * 2))

In this example, we load the iris dataset and create a decision tree classifier. We then use cross-validation to evaluate the performance of the model using various performance metrics. The cross_val_score function returns a dictionary with the scores for each metric, which we print to the console.

Cross-Validation Techniques:

A method for assessing a machine learning model's performance is cross-validation. It involves dividing the dataset into multiple subsets, where one subset is used for testing and the remaining subsets are used for training the model. This process is repeated multiple times, with each subset being used for testing and training, and the performance of the model is averaged across all the iterations.

Example: Let's say we have a dataset of housing prices, and we want to build a model to predict the price of a house based on its features like the number of bedrooms, bathrooms, square footage, etc. We can use k-fold cross-validation to evaluate the performance of our model. Here, k is the number of folds. For example, if we use 5-fold cross-validation, we divide the dataset into 5 equal subsets. We then train our model on 4 subsets and test it on the remaining subset. We repeat this process 5 times, with each subset being used for testing once. Finally, we average the performance of the model across all the iterations.

Algorithmic Steps:

  • Split the dataset into k equal subsets.
  • For each iteration, select one subset for testing and the remaining subsets for training.
  • The model should be tested after being trained on the training set.
  • Repeat steps 2-3 k times, with each subset being used for testing once.
  • Calculate the average performance of the model across all the iterations.

Python code for k-fold cross-validation:

from sklearn.model_selection import KFold

from sklearn.linear_model import LinearRegression

X = # feature matrix

y = # target variable

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

model = LinearRegression()

scores = []

for train_idx, test_idx in kfold.split(X):

    X_train, X_test = X[train_idx], X[test_idx]

    y_train, y_test = y[train_idx], y[test_idx]

    model.fit(X_train, y_train)

    score = model.score(X_test, y_test)

    scores.append(score)

avg_score = sum(scores) / len(scores)

print("Average R-squared score:", avg_score)

Parameter Tuning

Hyperparameter Tuning:

Hyperparameters are the parameters of a machine learning algorithm that are not learned from the data but are set before training the model. Examples of hyperparameters include the learning rate of a neural network, the regularization parameter of a linear regression model, and the depth and width of a decision tree. In order to enhance the performance of the model, these hyperparameters must be tuned to their optimal values.

Example: Let's say we are training a support vector machine (SVM) model on a dataset of images to classify them as cats or dogs. The SVM has hyperparameters such as the kernel type, regularization parameter, and the degree of the polynomial kernel. We can use grid search or random search to find the best values for these hyperparameters.

Algorithmic Steps:

  • For each hyperparameter, specify a range of values.
  • Create a grid of all possible combinations of hyperparameter values.
  • For each combination of hyperparameter values, train the model and evaluate its performance using cross-validation.
  • Select the combination of hyperparameter values that gives the best performance.

Python code for grid search hyperparameter tuning:

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

param_grid = {

    'C': [0.1, 1, 10],

    'kernel': ['linear', 'rbf', 'poly'],

    'degree': [2, 3, 4]

}

model = SVC()

grid_search = GridSearchCV(model, param_grid=param_grid, cv=5)

grid_search.fit(X, y)

print("Best hyperparameters:", grid_search.best_params_)

print("Best cross-validation score:", grid_search.best_score_)

vb net code

Model Selection Techniques:

Model selection is the process of selecting the best machine-learning model for a given problem. It involves comparing the performance of different models on the same dataset and selecting the one that gives the best performance. Model selection techniques include comparing the performance of models using a holdout set, cross-validation, and information criteria.

Example: Let's say we have a dataset of images, and we want to classify them into different categories. We can compare the performance of different models such as logistic regression, decision tree, and support vector machine (SVM) using cross-validation and select the best model based on its performance.

Algorithmic Steps:

  • Choose a set of machine learning models to compare.
  • Create training and testing sets from the dataset.
  • Train each model on the training set and evaluate its performance on the testing set.
  • Select the model that gives the best performance on the testing set.

Python code for comparing different models:

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

X = # feature matrix

y = # target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {

    'Logistic Regression': LogisticRegression(),

    'Decision Tree': DecisionTreeClassifier(),

    'Support Vector Machine': SVC()

}

best_model = None

best_score = -1

for name, model in models.items():

    model.fit(X_train, y_train)

    score = model.score(X_test, y_test)

    print(f"{name} score: {score}")  

    if score > best_score:

        best_score = score

        best_model = name     

print(f"\n Best model: {best_model}") 

Previous ( Supervised Machine Learning) 

Continue to (Artificial Neural Networks)



 

Comments

Popular posts from this blog

What is Machine Learning

Definition of  Machine Learning and Introduction Concepts of Machine Learning Introduction What is machine learning ? History of Machine Learning Benefits of Machine Learning Advantages of Machine Learning Disadvantages of Machine Learning

Know the Machine Learning Syllabus

Learn Machine Learning Step-by-step INDEX  1. Introduction to Machine Learning What is Machine Learning? Applications of Machine Learning Machine Learning Lifecycle Types of Machine Learning   2. Exploratory Data Analysis Data Cleaning and Preprocessing Data Visualization Techniques Feature Extraction and Feature Selection  

What is Analytical Machine Learning

Analytical  and  Explanation-based learning  with domain theories  Analytical Learning Concepts Introduction Learning with perfect domain theories: PROLOG-EBG Explanation-based learning Explanation-based learning of search control knowledge Analytical Learning Definition :  Analytical learning is a type of machine learning that uses statistical and mathematical techniques to analyze and make predictions based on data.

What is Well-posed learning

  Perspectives and Issues of Well-posed learning What is well-posed learning? Well-posed learning is a type of machine learning where the problem is well-defined, and there exists a unique solution to the problem.  Introduction Designing a learning system Perspectives and issues in machine learning

What is Bayes Theorem

Bayesian Theorem and Concept Learning  Bayesian learning Topics Introduction Bayes theorem Concept learning Maximum Likelihood and least squared error hypotheses Maximum likelihood hypotheses for predicting probabilities Minimum description length principle, Bayes optimal classifier, Gibs algorithm, Naïve Bayes classifier, an example: learning to classify text,  Bayesian belief networks, the EM algorithm. What is Bayesian Learning? Bayesian learning is a type of machine learning that uses Bayesian probability theory to make predictions and decisions based on data.

Total Pageviews

Followers