Showing posts with label t-Distributed Stochastic Neighbor Embedding. Show all posts
Showing posts with label t-Distributed Stochastic Neighbor Embedding. Show all posts

Tuesday, February 20, 2024

T-DISTRIBUTED STOCHASTIC NEIGHBOUR EMBEDDING IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

T-Distributed Stochastic Neighbor Embedding (t-SNE)

  • Dimensionality Reduction in t-SNE
  • t-SNE Algorithm
  • Applications of t-SNE
  • Advantages of t-SNE
  • Disadvantages of t-SNE
  • Summary

One often employed machine learning and data visualization method is t-Distributed Stochastic Neighbor Embedding. Its primary objective is to preserve local commonalities between data points while condensing multi-dimensional data to two or three dimensions. Ever since Laurens van der Maaten and Geoffrey Hinton presented it in 2008, t-SNE has shown to be an essential tool for understanding and evaluating complex datasets that are difficult to define in their original high-dimensional form.

Specially, the intricate local structure found in the data points' lower-dimensional embedding is captured by t-SNE. Pairwise relationships between data points are preserved when comparing t-SNE to linear techniques such as principal component analysis (PCA). Because of this local connection emphasis, t-SNE is able to highlight clusters and groups of data points and reveal subtle patterns and correlations within the data. As so, knowledgeable decision-making is facilitated by the insightful knowledge that researchers and analysts may have about the fundamental structure of the dataset.

Furthermore, t-SNE is quite good at condensing and making understandable high-dimensional data. Through the projection of the data onto lower-dimensional spaces like two-dimensional scatterplots or three-dimensional embeddings, t-SNE enables simple viewing of complex datasets. Users can more effectively analyze the data, spot clusters or patterns, and explain their findings if they have this visual component. For exploratory data analysis, t-SNE has therefore evolved into a necessary tool that helps users understand the structure and relationships inside high-dimensional datasets.

Real-World Example for t-SNE

There is a town, in which Clara a biologist lives, she aims to unravel the intricate relationships hidden within the town’s diverse ecosystem. Clara employed a wealth of biodiversity data collected from various locations and applied t-SNE (t-Distributed Stochastic Neighbor Embedding) to uncover patterns in species distributions and ecological relationships.

Clara started her research with a collection of data points that represent different species found across the town’s forests, meadows, and wetlands. Each data point has plenty of ecological attributes, including species abundance, habitat preferences, and interactions with other organisms.

With the help of t-SNE Clara set out to map the multidimensional landscape of the town’s biodiversity onto a two-dimensional plot. As the t-SNE algorithm unfolded, it preserved the local structure of species distributions while unraveling the complex web of ecological relationships.

With each iteration of the t-SNE algorithm, clusters of data points emerged, revealing distinct ecological communities and species assemblages within Town habitats. Clara is surprised at how t-SNE seamlessly captured the nuanced interactions between plants, animals, and their environment, painting a vivid picture of the town's ecological tapestry.

Dimensionality reduction in t-SNE

Dimensionality reduction techniques are essential when dealing with datasets containing numerous features. They aim to streamline intricate data by reducing the number of variables while upholding essential attributes, typically depicted in two or three dimensions.

Real-life datasets can comprise thousands or millions of features, leading to computational challenges like overfitting and result interpretation complexities.

These methods can be differentiated roughly into two categories: non-linear and linear. Principal Component Analysis (PCA) and other linear techniques presume linearity in the dataset structure and use linear algebra for analysis. On the other hand, complex non-linear data relationships can be easily handled using non-linear methods such as t-SNE.

Due to its non-linear nature, t-SNE is adept at capturing intricate data patterns, rendering it highly effective for machine learning professionals contending with datasets characterized by high dimensionality.

What is the t-SNE algorithm?

t-SNE is a non-linear dimensionality reduction technique that employs randomized methods to reduce dataset dimensionality. Its primary focus is on preserving the local information or structure of the dataset in the reduced dimensions.

t-SNE allows the exploration and mapping of high-dimensional data into lower-dimensional forms, typically displayed in 2D or 3D graphs, which facilitate understanding and analysis because it maintains the local structures.

Working of t-SNE 

The t-SNE technique compares the features of similar data points to identify patterns and reduces the number of dimensions in the data. Calculating similarity involves the use of conditional probability.

The aim of t-SNE is to minimize the difference in conditional probabilities between higher and lower dimensions for an accurate representation in lower dimensions. However, due to the extensive computations involved, it's time and space-consuming, exhibiting quadratic complexity in the number of data points.

The primary steps of t-SNE involve:

At first, the code searches the high-dimensional space for comparable pairs of data points using a distance-based Gaussian kernel.

  1. Low-dimensional embeddings are then initialized randomly, and these embeddings are optimized throughout the process to better represent the data.
  2. The subsequent phase bears a striking resemblance to the initial one: a Gaussian kernel is employed to ascertain which low-dimensional embeddings are most interchangeable.
  3. Next, using the pairwise similarity distributions as a basis, the method calculates the Kullback-Leibler divergence between the high-dimensional data and the low-dimensional embeddings. The degree of difference between the two distributions is indicated by their divergence.
  4. Using gradient descent, the algorithm minimizes the Kullback-Leibler divergence. This optimization process adjusts the low-dimensional embeddings to align them more closely with the similarities observed in the high-dimensional data.
  5. Finally, steps 3 to 5 are repeated iteratively until convergence is achieved, ensuring that the low-dimensional embeddings accurately capture the structure of the high-features data.

Upon completion of these procedures, you need to possess a low-dimensional data model that preserves local commonalities between your high-dimensional data points. This basically indicates that in two dimensions, objects that are close together in three dimensions remain close together.

Application of t-SNE

One of the many applications of t-SNE in machine learning is data visualization. Let us look at some of the common applications of t-SNE:

  1. t-SNE is the perfect tool for processing images and videos because of its exceptional data analysis and visualization capabilities. Reducing the quantity of video and image features allows a technique called t-SNE to assist in discovering patterns in huge datasets and in clustering comparable images or frames. It finds applications in tasks like categorization, segmentation, and pattern recognition in visual data.
  2. Word-to-word semantic links in textual data are identified in Natural Language Processing (NLP) using t-SNE. Through the reduction of word embedding dimensionality, t-SNE aids in the clustering of words having comparable meanings, therefore aiding the discovery of semantic patterns and relationships in text corpora.
  3. In the domain of genomics in particular, the capacity of t-SNE to reduce dimensionality is useful for biological data analysis. By lowering dimensionality, it makes it easier to analyze high-dimensional gene expression data and, in turn, to find patterns and clusters among genes with comparable expression profiles, hence advancing our knowledge of biological processes.
  4. t-SNE helps identify anomalies as well. t-SNE's low-dimensional data visualization facilitates anomaly discovery in large datasets by identifying anomalous data point clusters.
  5. Recommender systems can identify products that are similar to one another based on qualities by using t-SNE. Using t-SNE, which lowers the dimensionality of item data and enables the grouping of related items, makes it simpler to build recommendation systems that suggest products to people with comparable features.
  6. t-SNE is helpful for SNA since it can display large groups of people in social networks. By reducing the dimensionality of social network features, the t-SNE method aids in understanding the structure of social networks and identifies influential individuals or groups as well as clusters of related individuals.

Advantages of t-SNE

  • Preserves local and global structures: t-SNE is effective in revealing both local and global structures of the data. It maintains the relative distances between neighboring data points, emphasizing local relationships in the lower-dimensional space.
  • Ideal for visualizing datasets with numerous dimensions, t-SNE proves invaluable in rendering high-dimensional datasets into more manageable two- or three-dimensional representations. This enables humans to grasp the overall patterns and structures present in complex data, facilitating easier interpretation and analysis.
  • Nonlinearity preservation: unlike PCA, t-SNE can capture nonlinear relationships between variables, making it adept at revealing complex structures in the data.
  • Cluster separation: t-SNE tends to create more separable clusters by pulling together similar data points and pushing dissimilar ones apart, which aids in cluster identification.
  • Robustness to outliers: it’s relatively robust to outliers as it primarily focuses on local structures, reducing the impact of outliers on the visualization.
  • The changeable complexity parameter is essential for managing the effective neighbor count per data point in t-SNE. This parameter offers flexibility in capturing various local structures present within the dataset, enabling t-SNE to adaptively adjust to the intricacies of the data distribution.
  • High interpretability: t-SNE produces graphics that are frequently simple to grasp and help to reveal the underlying patterns in complicated datasets.
  • Many fields have adopted t-SNE because of its remarkable ability to display high-dimensional data, including biology, natural language processing, image analysis, and others. Its adaptability makes it a useful instrument for delving into intricate data and making informed decisions.

Disadvantages of t-SNE

  • The computational demands of t-SNE can pose challenges, particularly when dealing with sizable datasets. Its processing requirements can lead to longer execution times, especially when handling large amounts of data, thus impacting efficiency and resource utilization.
  • Sensitivity to hyperparameters: performance can be sensitive to the choice of hyperparameters, particularly the perplexity parameter. Different perplexity values can lead to significantly different results, requiring careful tuning.
  • Nonlinear transformation only: t-SNE only preserves local and global structure but doesn’t preserve distances accurately in high-dimensional space. It’s primarily useful for visualization and might not be suitable for other tasks like distance calculations or clustering in the transformed space.
  • t-SNE visualizations might be difficult to understand because the relationships in the lower-dimensional space can not exactly reflect those in the original high-dimensional space. This discrepancy can complicate the interpretation of distances between points and their significance in the dataset's structure.
  • Lack of scalability: Scaling t-SNE to handle very high-dimensional data or datasets with a large number of samples can pose challenges due to its computational requirements.
  • Random initialization: different runs of t-SNE on the same data might produce different embeddings due to its random initialization, making it less deterministic.
  • Overcrowding in visualizations: high-density regions might suffer from overcrowding in t-SNEvisualizations, potentially obscuring certain patterns or clusters.

Summary

A potent method for representing data with high dimensions in a lower-dimensional space, often 2D or 3D, is t-SNE. it excels at revealing complex structures by preserving local relationships among data points. While it’s great for visualization and cluster identifications, t-SNE is computationally intensive, sensitive to hyperparameters like perplexity, and doesn’t accurately maintain distances. Its primary use lies in gaining insights from data rather than precise distance calculations or computations. 

Python Code


PRINCIPAL COMPONENT ANALYSIS IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Principal Component Analysis

  • Curse of Dimensionality and Dimensionality Reduction
  • Common Terms and Mathematical Concepts in PCA
  • Steps of PCA Algorithm
  • Applications and Advantages of PCA
  • Disadvantages and Limitations of PCA

The number of features or dimensions in the dataset directly relates to the growth of computational and time costs. Dealing with high-dimensional data often leads to overfitting and decreased model accuracy, a phenomenon known as the "curse of dimensionality."

This issue arises due to the exponential growth in possible combinations as the dimensions increase, leading to more computationally intensive operations. To overcome the effects of dimensionality, we can use many different feature engineering approaches, like feature extraction and selection.

In feature extraction, dimensionality reduction is the process of reducing the number of input properties while preserving the integrity of the original data. One widely used method in this field is Principal Component Analysis (PCA) in machine learning.

Through orthogonal transformation, PCA, an unsupervised learning algorithm first presented by Karl Pearson in 1901, statistically converts correlated data into a set of linearly uncorrelated features. PCA finds strong patterns in a dataset by lowering variances and looking for lower-dimensional surfaces to project high-dimensional data onto.

This method is used in machine learning for both predictive modeling and exploratory data analysis. Many consider it to be a more generalized kind of factor analysis, with some parallels to regression's line of best fit. Principal component analysis (PCA) is a useful tool for reducing the dimensionality of data while preserving important patterns or relationships between variables. PCA machine learning does not require prior knowledge of the target variables.

PCA evaluates the variance of each attribute to pinpoint those with significant variations, suggesting effective class distinctions, which in turn aids in reducing dimensionality. Its practical uses span diverse fields such as image analysis, movie recommendations, and optimizing resource allocation in communication networks.

The PCA analysis algorithm is based on some mathematical concepts like:

  • Variance and covariance
  • Eigenvalues and Eigne factors
PCA operates by reducing the dimensionality of a dataset through the discovery of a smaller set of variables that preserves the bulk of information present in the samples. These transformed variables prove beneficial for both regression and classification tasks.


Image source original

In fact, principal component analysis (PCA) uses a collection of orthogonal axes called principle components to extract the data's maximum variance. The initial component encapsulates the most substantial variance, with successive components capturing additional orthogonal variations. PCA's versatility extends across different fields, including data visualization, feature extraction, and data compression, operating on the premise that vital information lies within the variance of the features.

Unerstanding the basics of principal component analysis (PCA) can greatly enhance the performance and interpretability of convolutional neural networks (CNNs) by reducing the dimensionality of the input data while preserving its essential characteristics.

Real-World Example for Principal Component Analysis (PCA)

Let’s suppose there is a megacity in this city Alex a data scientist lives. He found himself facing a daunting challenge, that is the city’s transportation system was in an immediate need of optimization to overcome the congestion and improve efficiency. Alex has a vast amount of traffic data; he uses this data in Principal Component Analysis (PCA) to unravel the complexities of the city’s traffic patterns.

In the beginning, Alex started by collecting data on various factors that influence traffic, including vehicle counts, road conditions, weather conditions, and time of day. With these multidimensional datasets, he applied PCA to extract the most significant component driving traffic behavior.

As the PCA algorithm analyzes the data Alex discovered new hidden patterns and relationships among the different variables. He identified principal components representing key factors such as rush hour congestion, weather-related delays, and road closures due to accidents or construction.

Using the insights gained from PCA, Alex developed a comprehensive traffic model that can accurately predict congestion hotspots and potential bottlenecks across the city. By leveraging the principal components identified through PCA, he created a simplified yet accurate representation of the megacity's intricate traffic dynamics.

Some Common terms used in the PCA algorithm:

Dimensionality in a dataset signifies the count of features or variables, denoting the number of columns present within the data.

Correlation denotes the relationship between two values. It measures how one value changes when the other changes. A correlation value lies between -1 and +1, where -1 implies an inverse relationship, +1 denotes a direct relationship, and 0 indicates no correlation.

Orthogonal signifies that variables are unrelated or uncorrelated, resulting in a correlation of zero between the variable pairs.

Eigenvectors are vectors when multiplied by a square matrix ( M ), produce a new vector ( Av) that is a scalar multiple of the original vector ( v ).

The Covariance Matrix encompasses covariances between pairs of variables.

Main Components in PCA

The resulting new features after applying PCA are termed Principal Components (PCs). These components can either match the number of original features or be fewer. Key properties of principal components include
  • Principal components are formed as linear combinations of the original features.
  • These components are orthogonal, indicating a correlation of zero between variable pairs.
  • The significance of each component diminishes progressively from 1 to n. PC 1 holds the highest importance, whereas the nth PC is the least significant.
Steps of PCA algorithm
  • To begin, divide the dataset in half. A validation set and a training set will be created afterward.
  • Logically arrange the dataset. To represent the independent variables, create a matrix with features in the columns and data items in the rows (X). To know the size of the dataset we must look at the columns’ number.
  • Standardize the data to ensure consistency throughout the dataset. When a feature's variance is higher, it is considered more significant. Normalize each column to obtain the Z-matrix by dividing each data point by the column standard deviation. This will guarantee that variation has no effect on the feature's importance.
  • Z's covariance matrix can be found by multiplying it by Z after transposing the Z matrix.
  • Determine the eigenvalues and eigenvectors of the Z-covariance matrix. The vectors that represent the directions of the high-information axis are called eigenvectors, and the coefficients of these vectors are known as eigenvalues.
  • Sort the eigenvalues in the same manner that you arranged the eigenvectors in the eigenvalue matrix P*. P*, a sorted matrix, is the result.
  • Z* is the result of multiplying the P* matrix by Z to find more features. Each observation is converted into a linear combination of the original features, and the columns of Z* are autonomous.
  • Select pertinent characteristics to keep and eliminate the remaining ones to do feature selection. Eliminate unnecessary elements while retaining those that are crucial.
Applications of Principal Component Analysis
  • PCA is useful for decreasing dimensionality in many artificial intelligence applications, including computer vision and picture compression.
  • It can unveil new or concealed patterns within high-dimensional data, with applications spanning fields like finance, data mining, and psychology.
Step-by-step mathematical explanation of PCA (Principal Component Analysis)
Step 1 – standardization
The dataset must first be normalized, which calls for all of the variables to have values between 0 and 1.

In the above equation:

μ is the mean of independent features 
σ is the standard deviation of independent features 

Step 2 – covariance Matrix Computation

Any two or more variables' covariance indicates how much they vary from one another and, therefore, serves as a measure of their total variability. We can use the following formula to find the covariance:
The value of the covariance can be positive, negative, or zeros.
  • Positive: as the x1 increases x2 also increases.
  • Negative: as the x1 increases x2 decreases.
  • Zeros: No direct relation.

Step 3: Determine the primary components by calculating the covariance matrix's eigenvalues and eigenvectors

Let A be a square nXn matrix and X be a non-zero vector for which

Regarding a few scalar values o. Matrix A's eigenvalue is represented by λ, while X is the eigenvector of the same matrix for the same eigenvalue.

We can also write the equation as

In the provided equation, matrix A and its identity matrix, matrix I, have the same structure. The conditions given above will only hold if (A-λI) is not invertible,
Using the prior equation, we can obtain the eigenvalues lambda. Using the following equation, we can calculate the relevant eigenvector.
AX= λX
Advantages of Principal Component Analysis
  • Dimensionality Reduction: PCA is renowned for reducing the number of variables in a dataset, simplifying data analysis, enhancing performance, and facilitating data visualization.
  • Feature Selection: PCA can be employed for selecting essential variables from a dataset, especially beneficial in scenarios with numerous variables that are challenging to prioritize.
  • Data Visualization: Utilizing PCA, high-dimensional data can be represented in two or three dimensions, aiding easier interpretation and visualization.
  • Multicollinearity Management: Addressing multicollinearity issues in regression analysis is another forte of PCA, which identifies correlated variables and generates uncorrelated ones for regression modeling.
  • Noise Reduction: PCA contributes to noise reduction by eliminating principal components with low variance, effectively improving the signal-to-noise ratio and unveiling underlying data structures.
  • Using a smaller selection of components that represent the majority of the data variability, principal component analysis (PCA) aids in data compression. This dimensionality reduction not only lowers storage requirements but also enhances processing performance.
  • Principal component analysis (PCA) can be used to identify data points that considerably depart from the norm inside the principal component space as outliers.
Disadvantages of Principal Component Analysis
  • Explaining the outcomes of PCA, which produces principal components as linear combinations of original variables, can be challenging when it comes to interpretation for others.
  • Data Scaling: PCA's performance can be affected by data scaling. Inappropriate data scaling might undermine the effectiveness of PCA, warranting careful data scaling before its application.
  • The process of reducing variables via PCA may result in information loss, which correlates with the number of principal components retained. Thus, careful selection of principal components is essential to mitigate the risk of excessive information loss.
  • PCA operates under the assumption of linear relationships between variables. However, its effectiveness may decrease in situations where non-linear relationships are present, highlighting a limitation in its applicability.
  • Computational Complexity: Computationally, PCA can be resource-intensive for extensive datasets, particularly when the dataset contains numerous variables.
  • Overfitting can happen when a model is trained on a small dataset or with an excessive number of primary components, which might impair the model's ability to generalize to new data.
Summary
Data dimensionality reduction is made possible by the application of Principal Component Analysis (PCA). The goal is to convert high-dimensional datasets into spaces with fewer dimensions while maintaining important information. By locating new orthogonal axes, known as principle components, that represent the largest variation in the dataset, it finds the most significant patterns in the data. PCA facilitates noise reduction, streamlines data representation, and expedites machine learning methods. But it makes the assumption that variables have linear relationships, therefore it might not work as well as it could in some nonlinear datasets.

Python Code
Below is the PCA in Python code: -


Featured Post

ASSOCIATION RULE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Association rule   Rule Evaluation Metrics Applications of Association Rule Learning Advantages of Association Rule Mining Disadvantages of ...

Popular