Surusha_tutorials

Welcome to our blog page, your ultimate destination for in-depth content on technical courses. Explore comprehensive tutorials covering programming languages like C, Java, and Python, including advanced topics in libraries such as NumPy and Pandas. Dive into the fascinating realms of Artificial Intelligence (AI), Machine Learning (ML), and Data Science, where you'll find insightful articles, practical guides, and cutting-edge insights to enhance your skills and knowledge

Monday, February 19, 2024

K-NEAREST NEIGHBOUR IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

K-Nearest Neighbors

The Intuition Behind K-NN Algorithm
How KNN works
Steps Involved in the K-NN Algorithm
Distance Metrics Used in KNN
Choosing the K value
Applications of the KNN Algorithm
Advantages and Disadvantages

Regression as well as classification problems may be handled using the simple supervised machine learning algorithm K-Nearest Neighbours (K-NN). By comparing the new and old data points, this approach places the new example in a category comparable to the most similar ones already in existence. The main idea is to save predetermined data and classify incoming data according to how similar they are to the previously saved dataset.

Popularity of the K-Nearest Neighbours model is maintained by its stability and simplicity. Its non-parametric feature implies that it is independent of data distribution presumptions. Although applicable to both regression and classification, it predominantly finds use in classification problems. Termed a "lazy learner" algorithm, K-NN delays learning from the training set, instead storing data for on-the-spot classification when needed.

For instance, consider a scenario with a variety of dog breed images used to train a K-NN model. When presented with a new dog image, the K-NN algorithm identifies similarities between its features and those of various dog breeds within the dataset. Based on these shared features, the algorithm classifies the image into the most similar breed category.

By emphasizing similarity measures, the K-Nearest Neighbour algorithm proves valuable in scenarios where a new data point's classification or regression depends heavily on its resemblance to existing data. In this chapter the k nearest neighbour learning algorithm we learn.

Let's take a real-world example for KNN

In this k nearest neighbor example let's suppose there is a mechanic he/she has a talent for fixing cars, but new electric vehicles stumped them. The error codes were unlike those he/she had seen before. For her/him diagnosing these electric vehicles felt like navigating a foreign language. Deciding to get clever, he/she looked to KNN, a technique like having a team of expert mechanics by him/her side.

Imagine KNN as a toolbox filed with past cases. Each case is a car with its symptoms (error codes, battery reading, etc.) and the mechanic’s diagnosis (problem identified). When a new electric car with a puzzling error rolls in, KNN swings into action.

KNN doesn’t jump to conclusions, instead, it considers the past cases in the toolbox. It analyzes the new car’s symptoms and compares them to all the past cases. After that KNN picks a small group, the K's nearest neighbors, the most similar cars based on their symptoms.

These K's closest neighbors become like a consultation team for him/her. By looking at the problems those similar cars had (their diagnoses), KNN helps the mechanic to predict the most likely issue with the new car. It’s like having a group of experts who’ve tackled similar problems before, whispering their insights to mechanics.

This collaborative approach proved valuable. Faced with an unfamiliar error code, mechanics could consult their KNN toolbox, identify similar past cases, and make a well-informed diagnosis. With KNN as their secret weapon, the mechanic became the go-to electric car mechanic, their skills boosted by the help of KNN.

The intuition behind the KNN algorithm

Image source original

In the above graph, we can see two clusters or groups. Now if we have another point (also printed on the graph) that is unclassified then we can easily assign it to a group. One approach to accomplish this is by examining the group to which its nearest neighbors belong. This means that the point in the above diagram will be ‘green’ because it is close to the green cluster.

Why do we need the KNN algorithm?

The widespread utilization of the K-Nearest Neighbors (K-NN) algorithm stems from its versatility and wide applicability. Its simplicity and straightforward implementation are key factors driving its use. One of its standout features is its lack of reliance on prediction about the underlying data distribution or how data is scattered, making it suitable for various datasets, whether dealing with numerical or categorical data in classification and regression tasks.

A significant advantage lies in its non-parametric nature, leading to predictions grounded in the similarity among data points within the dataset. Furthermore, K-NN exhibits a higher tolerance for outliers compared to alternative algorithms.

To discover the K nearest neighbors, the K-NN approach typically uses the Euclidean distance as its distance metric of choice. A data point's class or value is determined by averaging or by taking into account the outcomes of its K nearest neighbors. Because of this adaptive strategy, the algorithm can identify various patterns in the data and forecast outcomes based on the local structure of the data.

How KNN work?

K-Nearest Neighbors (K-NN) stands as a versatile and widely embraced algorithm due to its adaptability across various applications. Its allure resides in its simplicity and ease of implementation. Notably, it distinguishes itself by eschewing assumptions about the inherent data distribution and rendering. It is suitable for datasets of diverse natures encompassing numerical and categorical data in both classification and regression domains.

Its non-parametric nature forms a cornerstone for predictions, relying on the inherent similarities among data points housed within the dataset. Another distinctive trait is its robustness against outliers, surpassing the resilience of alternative algorithms in this aspect.

By utilizing distance metrics, particularly the prevalent Euclidean distance, the K-NN algorithm identifies the K closest neighbors. Subsequently, the determination of a data point's classification or value involves a process of averaging or considering the majority vote among these K neighbors. This adaptive methodology empowers the algorithm to discern varied data patterns and make predictions rooted in the localized structure of the dataset.

Choosing the K data points from the dataset (X) that show the shortest distance to the target data point (x) is the first step in the K-Nearest Neighbors (K-NN) technique. The method in classification problems finds the most common label (y) among these K nearest neighbors (x). To determine the projected value for x in a regression job, the method first calculates the mean, or weighted average or weight mean, of the y values connected to the K nearest neighbors. Whether determining the most common label in classification or estimating a continuous value in regression scenarios, this technique allows the algorithm to generate well-informed predictions based on the features of the nearest neighbors.

Steps involve in K-NN algorithm working

We can explain the working of KNN in the following steps:

Step 1- The number of neighbors, K, needs to be selected appropriately.

Step 2 – After that, we need to calculate the Euclidean distance (or any other distance) of K number of neighbors

Step 3 – Now we need to take the K nearest neighbors as per our calculated Euclidean distance or any other distance.

Step 4 – after calculating these k neighbors, we need to count the number of the data points in each category.

Step 5 – Now we need to assign the new data points to that category which have a maximum number of neighbors.

Step 6 – our model is ready to use.

Distance metrics used in the KNN algorithm

The K-NN (K-Nearest Neighbors) algorithm works by identifying the closest data points or clusters to a given query point. This is achieved by using various distance metrics to measure the closeness of data points. Let's look at one of these metrics:

Euclidean Distance represents the simplest distance measure between two points in a plane. It quantifies the Cartesian distance between these points, essentially illustrating the length of the straight line connecting the considered points. This metric serves to calculate the total displacement undergone between two states of an object or entities in a given space. Visualizing this distance helps in understanding the direct spatial relationship and proximity between data points within a plane or hyperplane.

Manhattan Distance – The Manhattan distance, also known as the taxicab distance or city block distance, is widely used when the entire distance traveled by an object needs to be calculated rather than just moved. This metric is obtained by adding the absolute differences in the coordinates of two points in a space with dimensions n.

Minkowski Distance – One illustration of both the geometric and Manhattan distances is the Minkowski distance. The Minkowski distance is an extension of the Euclidean and Manhattan distances, and it is a metric in a normed vector space. This distance measure encapsulates various distance metrics within a single framework, providing a more generalized approach to measuring distances between points in a vector space.

By the above formula, we can say that when p = 2 then is it the same as the Euclidean distance formula, and when p = 1 then we can get the formula of the Manhattan distance.

The above-described metrics are the most common metrics that we can use for dealing with machine learning problems, but we can also use other distance metrics as well if we like or the problem needs it. Hamming distance is an example of such a metric which is quite useful when we have problems that require overlapping comparisons between two vectors whose contents can be Boolean as well as string values.

Choosing the K value for the KNN algorithm

In the K-Nearest Neighbors (K-NN) algorithm, the parameter "k" plays an important role because it determines the number of neighbors considered when running the algorithm. Choosing the right value of 'k' is key and should be guided by the particularities of the input data. In situations where the data set is noisy or contains a significant amount of outliers, choosing a larger value of k often produces more accurate results.

For improved classification accuracy, it's advantageous to choose an odd 'k' value, as this helps prevent ties during the classification process. This choice can enhance the decisiveness of the algorithm when assigning a class to the query point.

Employing cross-validation techniques serves as a valuable method for determining the most suitable 'k' value for a given dataset. By systematically assessing different 'k' values and their impact on the model's performance, cross-validation aids in identifying the optimal 'k' that maximizes the algorithm's accuracy and generalizability.

Applications of KNN algorithm

Data Preprocessing – When diving into a machine learning problem, the initial step often involves Exploratory Data Analysis (EDA). In this phase, identifying missing values in the data prompts the need for imputation methods. Among these methods, the KNN Imputer stands out as a sophisticated and effective approach for handling missing data.
Pattern Recognition – The effectiveness of the K-Nearest Neighbors (KNN) algorithm shines through in scenarios such as training it using the MNIST dataset. Upon evaluation, this model demonstrates notably high accuracy, showcasing the algorithm's prowess in pattern recognition tasks.
Recommendation Engines – KNN finds prominent applications in recommendation engines. This algorithm's primary task involves assigning a new query point to an existing group, which has been formed using extensive datasets. In the realm of recommender systems, this capability proves crucial, enabling the allocation of users to specific groups based on their preferences. Subsequently, it facilitates personalized recommendations tailored to these groups' inclinations.

Advantages of the KNN algorithm

Easy to implement – it is easy to implement because its complexity is not very high like other algorithms.
Adapts Easily – Due to the KNN algorithm's ability to retain all data in memory, whenever a new example or data point is added, it can automatically adapt to the new information and contribute to future predictions.
Few Hyperparameters: The KNN method requires only two parameters: the choice of 'k' and the choice of the distance metric for our evaluation measure. Due to the small number of hyperparameters, we can easily customize the algorithm.

Disadvantages of the KNN algorithm

Does not scale – due to the KNN's other name, the lazy algorithm. It got its moniker because this method uses a lot of processing power and data storage, which makes it resource- and time-intensive.
Curse of Dimensionality – Peaking phenomena, which is caused by the curse of dimensionality, is a term used in KNN. The method has difficulty correctly identifying the data points we have a dataset that has high dimensionality, which is known as the "curse of dimensionality."
Prone to overfitting – because the algorithm has the problem of dimensionality curse therefore the algorithm is prone to overfitting as well. Therefore, we need to add feature selection and dimensionality reduction techniques that can help to overcome this problem.

Summary

An easy-to-use yet powerful technique for classification and regression issues is K-nearest neighbors or CNN. Its task is to classify or assign a value to each new data point in the training set by averaging its k nearest neighbors or by using the majority vote. KNN can be slow when working with large datasets and is sensitive to features that are irrelevant or noisy, but it is simple to understand and performs well with small to medium-sized datasets. Despite being an extremely helpful and user-friendly algorithm, it can also encounter issues like overfitting and the curse of dimensionality.

Python Code

here is the k nearest neighbor Python code: -