Showing posts with label Principal Component Analysis. Show all posts
Showing posts with label Principal Component Analysis. Show all posts

Tuesday, February 20, 2024

PRINCIPAL COMPONENT ANALYSIS IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Principal Component Analysis

  • Curse of Dimensionality and Dimensionality Reduction
  • Common Terms and Mathematical Concepts in PCA
  • Steps of PCA Algorithm
  • Applications and Advantages of PCA
  • Disadvantages and Limitations of PCA

The number of features or dimensions in the dataset directly relates to the growth of computational and time costs. Dealing with high-dimensional data often leads to overfitting and decreased model accuracy, a phenomenon known as the "curse of dimensionality."

This issue arises due to the exponential growth in possible combinations as the dimensions increase, leading to more computationally intensive operations. To overcome the effects of dimensionality, we can use many different feature engineering approaches, like feature extraction and selection.

In feature extraction, dimensionality reduction is the process of reducing the number of input properties while preserving the integrity of the original data. One widely used method in this field is Principal Component Analysis (PCA) in machine learning.

Through orthogonal transformation, PCA, an unsupervised learning algorithm first presented by Karl Pearson in 1901, statistically converts correlated data into a set of linearly uncorrelated features. PCA finds strong patterns in a dataset by lowering variances and looking for lower-dimensional surfaces to project high-dimensional data onto.

This method is used in machine learning for both predictive modeling and exploratory data analysis. Many consider it to be a more generalized kind of factor analysis, with some parallels to regression's line of best fit. Principal component analysis (PCA) is a useful tool for reducing the dimensionality of data while preserving important patterns or relationships between variables. PCA machine learning does not require prior knowledge of the target variables.

PCA evaluates the variance of each attribute to pinpoint those with significant variations, suggesting effective class distinctions, which in turn aids in reducing dimensionality. Its practical uses span diverse fields such as image analysis, movie recommendations, and optimizing resource allocation in communication networks.

The PCA analysis algorithm is based on some mathematical concepts like:

  • Variance and covariance
  • Eigenvalues and Eigne factors
PCA operates by reducing the dimensionality of a dataset through the discovery of a smaller set of variables that preserves the bulk of information present in the samples. These transformed variables prove beneficial for both regression and classification tasks.


Image source original

In fact, principal component analysis (PCA) uses a collection of orthogonal axes called principle components to extract the data's maximum variance. The initial component encapsulates the most substantial variance, with successive components capturing additional orthogonal variations. PCA's versatility extends across different fields, including data visualization, feature extraction, and data compression, operating on the premise that vital information lies within the variance of the features.

Unerstanding the basics of principal component analysis (PCA) can greatly enhance the performance and interpretability of convolutional neural networks (CNNs) by reducing the dimensionality of the input data while preserving its essential characteristics.

Real-World Example for Principal Component Analysis (PCA)

Let’s suppose there is a megacity in this city Alex a data scientist lives. He found himself facing a daunting challenge, that is the city’s transportation system was in an immediate need of optimization to overcome the congestion and improve efficiency. Alex has a vast amount of traffic data; he uses this data in Principal Component Analysis (PCA) to unravel the complexities of the city’s traffic patterns.

In the beginning, Alex started by collecting data on various factors that influence traffic, including vehicle counts, road conditions, weather conditions, and time of day. With these multidimensional datasets, he applied PCA to extract the most significant component driving traffic behavior.

As the PCA algorithm analyzes the data Alex discovered new hidden patterns and relationships among the different variables. He identified principal components representing key factors such as rush hour congestion, weather-related delays, and road closures due to accidents or construction.

Using the insights gained from PCA, Alex developed a comprehensive traffic model that can accurately predict congestion hotspots and potential bottlenecks across the city. By leveraging the principal components identified through PCA, he created a simplified yet accurate representation of the megacity's intricate traffic dynamics.

Some Common terms used in the PCA algorithm:

Dimensionality in a dataset signifies the count of features or variables, denoting the number of columns present within the data.

Correlation denotes the relationship between two values. It measures how one value changes when the other changes. A correlation value lies between -1 and +1, where -1 implies an inverse relationship, +1 denotes a direct relationship, and 0 indicates no correlation.

Orthogonal signifies that variables are unrelated or uncorrelated, resulting in a correlation of zero between the variable pairs.

Eigenvectors are vectors when multiplied by a square matrix ( M ), produce a new vector ( Av) that is a scalar multiple of the original vector ( v ).

The Covariance Matrix encompasses covariances between pairs of variables.

Main Components in PCA

The resulting new features after applying PCA are termed Principal Components (PCs). These components can either match the number of original features or be fewer. Key properties of principal components include
  • Principal components are formed as linear combinations of the original features.
  • These components are orthogonal, indicating a correlation of zero between variable pairs.
  • The significance of each component diminishes progressively from 1 to n. PC 1 holds the highest importance, whereas the nth PC is the least significant.
Steps of PCA algorithm
  • To begin, divide the dataset in half. A validation set and a training set will be created afterward.
  • Logically arrange the dataset. To represent the independent variables, create a matrix with features in the columns and data items in the rows (X). To know the size of the dataset we must look at the columns’ number.
  • Standardize the data to ensure consistency throughout the dataset. When a feature's variance is higher, it is considered more significant. Normalize each column to obtain the Z-matrix by dividing each data point by the column standard deviation. This will guarantee that variation has no effect on the feature's importance.
  • Z's covariance matrix can be found by multiplying it by Z after transposing the Z matrix.
  • Determine the eigenvalues and eigenvectors of the Z-covariance matrix. The vectors that represent the directions of the high-information axis are called eigenvectors, and the coefficients of these vectors are known as eigenvalues.
  • Sort the eigenvalues in the same manner that you arranged the eigenvectors in the eigenvalue matrix P*. P*, a sorted matrix, is the result.
  • Z* is the result of multiplying the P* matrix by Z to find more features. Each observation is converted into a linear combination of the original features, and the columns of Z* are autonomous.
  • Select pertinent characteristics to keep and eliminate the remaining ones to do feature selection. Eliminate unnecessary elements while retaining those that are crucial.
Applications of Principal Component Analysis
  • PCA is useful for decreasing dimensionality in many artificial intelligence applications, including computer vision and picture compression.
  • It can unveil new or concealed patterns within high-dimensional data, with applications spanning fields like finance, data mining, and psychology.
Step-by-step mathematical explanation of PCA (Principal Component Analysis)
Step 1 – standardization
The dataset must first be normalized, which calls for all of the variables to have values between 0 and 1.

In the above equation:

μ is the mean of independent features 
σ is the standard deviation of independent features 

Step 2 – covariance Matrix Computation

Any two or more variables' covariance indicates how much they vary from one another and, therefore, serves as a measure of their total variability. We can use the following formula to find the covariance:
The value of the covariance can be positive, negative, or zeros.
  • Positive: as the x1 increases x2 also increases.
  • Negative: as the x1 increases x2 decreases.
  • Zeros: No direct relation.

Step 3: Determine the primary components by calculating the covariance matrix's eigenvalues and eigenvectors

Let A be a square nXn matrix and X be a non-zero vector for which

Regarding a few scalar values o. Matrix A's eigenvalue is represented by λ, while X is the eigenvector of the same matrix for the same eigenvalue.

We can also write the equation as

In the provided equation, matrix A and its identity matrix, matrix I, have the same structure. The conditions given above will only hold if (A-λI) is not invertible,
Using the prior equation, we can obtain the eigenvalues lambda. Using the following equation, we can calculate the relevant eigenvector.
AX= λX
Advantages of Principal Component Analysis
  • Dimensionality Reduction: PCA is renowned for reducing the number of variables in a dataset, simplifying data analysis, enhancing performance, and facilitating data visualization.
  • Feature Selection: PCA can be employed for selecting essential variables from a dataset, especially beneficial in scenarios with numerous variables that are challenging to prioritize.
  • Data Visualization: Utilizing PCA, high-dimensional data can be represented in two or three dimensions, aiding easier interpretation and visualization.
  • Multicollinearity Management: Addressing multicollinearity issues in regression analysis is another forte of PCA, which identifies correlated variables and generates uncorrelated ones for regression modeling.
  • Noise Reduction: PCA contributes to noise reduction by eliminating principal components with low variance, effectively improving the signal-to-noise ratio and unveiling underlying data structures.
  • Using a smaller selection of components that represent the majority of the data variability, principal component analysis (PCA) aids in data compression. This dimensionality reduction not only lowers storage requirements but also enhances processing performance.
  • Principal component analysis (PCA) can be used to identify data points that considerably depart from the norm inside the principal component space as outliers.
Disadvantages of Principal Component Analysis
  • Explaining the outcomes of PCA, which produces principal components as linear combinations of original variables, can be challenging when it comes to interpretation for others.
  • Data Scaling: PCA's performance can be affected by data scaling. Inappropriate data scaling might undermine the effectiveness of PCA, warranting careful data scaling before its application.
  • The process of reducing variables via PCA may result in information loss, which correlates with the number of principal components retained. Thus, careful selection of principal components is essential to mitigate the risk of excessive information loss.
  • PCA operates under the assumption of linear relationships between variables. However, its effectiveness may decrease in situations where non-linear relationships are present, highlighting a limitation in its applicability.
  • Computational Complexity: Computationally, PCA can be resource-intensive for extensive datasets, particularly when the dataset contains numerous variables.
  • Overfitting can happen when a model is trained on a small dataset or with an excessive number of primary components, which might impair the model's ability to generalize to new data.
Summary
Data dimensionality reduction is made possible by the application of Principal Component Analysis (PCA). The goal is to convert high-dimensional datasets into spaces with fewer dimensions while maintaining important information. By locating new orthogonal axes, known as principle components, that represent the largest variation in the dataset, it finds the most significant patterns in the data. PCA facilitates noise reduction, streamlines data representation, and expedites machine learning methods. But it makes the assumption that variables have linear relationships, therefore it might not work as well as it could in some nonlinear datasets.

Python Code
Below is the PCA in Python code: -


HIERARCHICAL CLUSTERING IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Hierarchical Clustering

  • Why Hierarchical Clustering?
  • Types of Hierarchical Clustering
  • Agglomerative Hierarchical Clustering Algorithm
  • Working of Hierarchical Clustering
  • Advantages of Hierarchical Clustering
  • Disadvantages of Hierarchical Clustering

A kind of unsupervised machine learning known as hierarchical clustering arranges data into a tree-like hierarchical structure that is sometimes shown as a dendrogram. In this, at first every data point is handled independently as a cluster. Subsequently, the following steps are executed:
  • Identifying the two clusters that are closest to each other.
  • Merging the two most similar clusters. This merging process continues until all clusters are amalgamated.

While the outcomes of K-means clustering and hierarchical clustering Python might seem alike at times, their methodologies differ significantly based on their operational approach. The starting set of clusters is not needed to calculate hierarchical clustering, and this is a major difference between K-means and the hierarchy. Hierarchical clustering, also known as the hierarchical classification machine learning cluster analysis, yields a structured diagram of clusters within a dataset. The hierarchical clustering procedure initiates by considering each data point as a separate cluster, discovering the clusters’ closest neighbors, and combining them until the calculated stopping threshold is reached.

Why do we need Hierarchical Clustering?

Hierarchical clustering is markedly different from K-means in that it doesn't need calculations of the initial cluster set. Named also hierarchical cluster analysis, the hierarchical clustering method gives a ranked diagram or dendrogram for the dataset. The hierarchical clustering algorithm finds, for each cluster, its nearest neighbors first to establish an initial cluster and then merges these with others progressively (towards both larger sizes and finer structures) until what seems to happen most often is that two clusters fall apart.

Real-World Example for Hierarchical Clustering

Let’s suppose there is a village. In this village, there are many farmers live who grow many different grains and vegetables in this village nature lovers also live. Even it is ideal to live the villagers face challenges in organizing their annual Harvest Festival.

To overcome this challenge the villagers, hire Emma an event planner. She knew that she needed a more strategic approach, so Emma turned to hierarchical clustering to organize the festival activities and attractions.

Emma began by gathering data on the festival’s past attendance, popular attractions, and demographic preferences. After gaining all this information, she applied hierarchical clustering to group similar festival activities and identify clusters of interest.

As the clustering algorithm went through the data, Emma discovered several different clusters of festival attractions. One cluster includes traditional harvest-themed activities such as pumpkin carving and apple picking, while another comprises live music performances and artisanal craft stalls. She also found a cluster focused on children’s entertainment, featuring interactive games and storytelling sessions.

Using these insights, Emma devised a tiered approach to organizing the Harvest Festival. She created a hierarchical structure with main clusters representing broad categories of attractions and subclusters highlighting specific activities within each category.

This method allowed participants to quickly browse through the many attractions and select the ones that most closely matched their preferences. Young families may explore the children's amusement group, and foodies might taste delicious meals in the food and drink area.

Types of Hierarchical Clustering

There are two types of hierarchical clustering:
  1. Agglomerative clustering analysis
  2. Divisive clustering

Agglomerative Hierarchical Clustering, a popular method in hierarchical cluster analysis (HCA), follows a bottom-up approach. It starts by treating each data point as a separate cluster. Then, in each iteration, it merges the closest pair of clusters until all clusters are merged into a single cluster encompassing the entire dataset.

Agglomerative clustering algorithm

The algorithm for Agglomerative Hierarchical Clustering is:
  • Compute the similarity between each cluster and all other clusters, resulting in a proximity matrix.
  • Initially, treat each data point as its cluster.
  • Merge clusters that exhibit high similarity or proximity.
  • Reassess the proximity matrix for each newly formed cluster.
  • Iterate through steps 3 and 4 until only one cluster remains.

Image source original

Let’s look at algorithms working in a little detail and understand them with images.

Agglomerative clustering example

In the first step, we create each point as a single cluster. If there are N data points then there will be N clusters as shown in the below image. 


Every dot has a cluster (image source original)
In the next step, we take two datasets or points or clusters that are closest and combine them to make one single cluster. It leads to N-1 clusters

Image source original
Now this process takes the 2 closest clusters and combines them to form one cluster. Now it will be N-2 clusters
Image source original

This combining of the clusters process continues to repeat the 3 and 4 steps until we have only one cluster left shown in the below images.
Image source original

After each cluster is combined into one big cluster then we can make a dendrogram (as shown in the above image) which divides the clusters according to the problem.

Divisive Hierarchical clustering

In the divisive approach to several methods, data points are classified as a first step into a single big cluster and then divided into smaller clusters based on how gently or roughly similar they are. That process gives Porscen code that forms N clusters In doing so iteration proceeds until every data point is counted.

Image source original

Working of dendrogram in Hierarchical clustering

A dendrogram is used by the Hierarchical Clustering (HC) technique to show graphically each clustering stage. With the Euclidean distances between data points shown on the y-axis and all of the dataset's data points shown on the x-axis, the dendrogram resembles a tree. This dendrogram provides a comprehensive overview of the clustering process, showcasing the merging of clusters and the distances between them.
Image source original


In the diagram on the left side, the clusters formed by the agglomerative clustering in the machine learning process are depicted, while the corresponding dendrogram is illustrated on the right side.
  • The initial step shows the combination of data points P2 and P3, forming a cluster, which is reflected in the dendrogram by the connection between P2 and P3. The height in the dendrogram signifies the Euclidean distance between these data points.
  • Subsequently, another cluster is formed by P5 and P6, and its corresponding dendrogram emerges. The height of this linkage is greater than the last one, indicating a slightly bigger distance along P5 and P6 compared to P2 and P3.
  • Two additional dendrograms are created, combining P1, P2, and P3 into one dendrogram, and P4, P5, and P6 into another.
  • Finally, a final dendrogram is constructed, amalgamating all the data points together.
Advantages of Hierarchical Clustering
Hierarchical clustering holds several strengths:
  • Its capability to accommodate non-convex clusters, also clusters of various sizes and densities, makes it versatile for diverse datasets.
  • Effective handling of missing and noisy data, contributing to robustness in the clustering process.
  • The hierarchical structure revealed by the dendrogram provides valuable insights into the relationships among clusters, helping to comprehend intricate inter-cluster connections within the dataset.
Disadvantages of Hierarchical Clustering
Hierarchical clustering also faces several challenges:
  • Determining a stopping criterion to ascertain the final number of clusters, which can be subjective and challenging.
  • Higher computational demands, and memory requirements, especially with larger datasets.
  • Sensitivity to initial conditions, impacting the final clusters identified.
  • Despite its ability to handle diverse data and unveil relationships among clusters, its high computational cost and sensitivity to specific conditions remain notable drawbacks.
Summary

An unsupervised method for building a hierarchy of clusters is called hierarchical clustering. It arranges data into a structure like a tree, with each node standing in a cluster. Based on how similar the two clusters are, it might be agglomerative (bottom-up) or divisive (top-down), merging or dividing them. The number of clusters can be determined at any time using hierarchical clustering, which also provides information about the connections between the data points. But for huge datasets, it can be computationally demanding, and once it's set up, there's no way to go back and change the hierarchy.


Python Code


Featured Post

ASSOCIATION RULE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Association rule   Rule Evaluation Metrics Applications of Association Rule Learning Advantages of Association Rule Mining Disadvantages of ...

Popular