K-Nearest Neighbors
- The Intuition Behind K-NN Algorithm
- How KNN works
- Steps Involved in the K-NN Algorithm
- Distance Metrics Used in KNN
- Choosing the K value
- Applications of the KNN Algorithm
- Advantages and Disadvantages
Regression as well as classification problems may be handled using the simple supervised machine learning algorithm K-Nearest Neighbours (K-NN). By comparing the new and old data points, this approach places the new example in a category comparable to the most similar ones already in existence. The main idea is to save predetermined data and classify incoming data according to how similar they are to the previously saved dataset.
Popularity of the K-Nearest Neighbours model is maintained by its stability and simplicity. Its non-parametric feature implies that it is independent of data distribution presumptions. Although applicable to both regression and classification, it predominantly finds use in classification problems. Termed a "lazy learner" algorithm, K-NN delays learning from the training set, instead storing data for on-the-spot classification when needed.
For instance, consider a
scenario with a variety of dog breed images used to train a K-NN model. When
presented with a new dog image, the K-NN algorithm identifies similarities
between its features and those of various dog breeds within the dataset. Based
on these shared features, the algorithm classifies the image into the most
similar breed category.
By emphasizing similarity
measures, the K-Nearest Neighbour algorithm proves valuable in scenarios where a new data point's
classification or regression depends heavily on its resemblance to existing
data. In this chapter the k nearest neighbour learning algorithm we learn.
Let's take a real-world example for KNN
In this k nearest neighbor example let's suppose there is a
mechanic he/she has a talent for fixing cars, but new electric vehicles stumped
them. The error codes were unlike those he/she had seen before. For her/him
diagnosing these electric vehicles felt like navigating a foreign language. Deciding
to get clever, he/she looked to KNN, a technique like having a team of expert
mechanics by him/her side.
Imagine KNN as a toolbox
filed with past cases. Each case is a car with its symptoms (error codes,
battery reading, etc.) and the mechanic’s diagnosis (problem identified). When
a new electric car with a puzzling error rolls in, KNN swings into action.
KNN doesn’t jump to conclusions,
instead, it considers the past cases in the toolbox. It analyzes the new car’s
symptoms and compares them to all the past cases. After that KNN picks a small
group, the K's nearest neighbors, the most similar cars based on their symptoms.
These K's closest
neighbors become like a consultation team for him/her. By looking at the
problems those similar cars had (their diagnoses), KNN helps the mechanic to
predict the most likely issue with the new car. It’s like having a group of
experts who’ve tackled similar problems before, whispering their insights to
mechanics.
This collaborative
approach proved valuable. Faced with an unfamiliar error code, mechanics could
consult their KNN toolbox, identify similar past cases, and make a
well-informed diagnosis. With KNN as their secret weapon, the mechanic became
the go-to electric car mechanic, their skills boosted by the help of KNN.
The intuition behind the KNN algorithm
In the above graph, we
can see two clusters or groups. Now if we have another point (also printed on
the graph) that is unclassified then we can easily assign it to a group. One
approach to accomplish this is by examining the group to which its nearest
neighbors belong. This means that the point in the above diagram will be ‘green’
because it is close to the green cluster.
Why do we need the KNN algorithm?
The widespread
utilization of the K-Nearest Neighbors (K-NN) algorithm stems from its
versatility and wide applicability. Its simplicity and straightforward
implementation are key factors driving its use. One of its standout features is
its lack of reliance on prediction about the underlying data distribution or
how data is scattered, making it suitable for various datasets, whether dealing
with numerical or categorical data in classification and regression tasks.
A significant advantage
lies in its non-parametric nature, leading to predictions grounded in the
similarity among data points within the dataset. Furthermore, K-NN exhibits a
higher tolerance for outliers compared to alternative algorithms.
To discover the K nearest neighbors, the K-NN approach typically uses the Euclidean distance as its distance metric of choice. A data point's class or value is determined by averaging or by taking into account the outcomes of its K nearest neighbors. Because of this adaptive strategy, the algorithm can identify various patterns in the data and forecast outcomes based on the local structure of the data.
How KNN work?
K-Nearest Neighbors
(K-NN) stands as a versatile and widely embraced algorithm due to its
adaptability across various applications. Its allure resides in its simplicity
and ease of implementation. Notably, it distinguishes itself by eschewing
assumptions about the inherent data distribution and rendering. It is suitable for
datasets of diverse natures encompassing numerical and categorical data in both
classification and regression domains.
Its non-parametric
nature forms a cornerstone for predictions, relying on the inherent
similarities among data points housed within the dataset. Another distinctive
trait is its robustness against outliers, surpassing the resilience of
alternative algorithms in this aspect.
By utilizing distance metrics, particularly the prevalent Euclidean distance, the K-NN algorithm identifies the K closest neighbors. Subsequently, the determination of a data point's classification or value involves a process of averaging or considering the majority vote among these K neighbors. This adaptive methodology empowers the algorithm to discern varied data patterns and make predictions rooted in the localized structure of the dataset.
Choosing the K data points from the dataset (X)
that show the shortest distance to the target data point (x) is the first step
in the K-Nearest Neighbors (K-NN) technique. The method in classification
problems finds the most common label (y) among these K nearest neighbors (x). To
determine the projected value for x in a regression job, the method first
calculates the mean, or weighted average or weight mean, of the y values
connected to the K nearest neighbors. Whether determining the most common label
in classification or estimating a continuous value in regression scenarios,
this technique allows the algorithm to generate well-informed predictions based
on the features of the nearest neighbors.
Steps involve in K-NN algorithm working
We can explain the
working of KNN in the following steps:
Step 1- The number of
neighbors, K, needs to be selected appropriately.
Step 2 – After that, we
need to calculate the Euclidean distance (or any other distance) of K number of
neighbors
Step 3 – Now we need to
take the K nearest neighbors as per our calculated Euclidean distance or any
other distance.
Step 4 – after
calculating these k neighbors, we need to count the number of the data points
in each category.
Step 5 – Now we need to
assign the new data points to that category which have a maximum number of
neighbors.
Step 6 – our model is ready to use.
Distance metrics used in the KNN
algorithm
The K-NN (K-Nearest
Neighbors) algorithm works by identifying the closest data points or clusters
to a given query point. This is achieved by using various distance metrics to
measure the closeness of data points. Let's look at one of these metrics:
Euclidean Distance represents the simplest distance measure between two points in a plane. It quantifies the Cartesian distance between these points, essentially illustrating the length of the straight line connecting the considered points. This metric serves to calculate the total displacement undergone between two states of an object or entities in a given space. Visualizing this distance helps in understanding the direct spatial relationship and proximity between data points within a plane or hyperplane.
Manhattan Distance – The
Manhattan distance, also known as the taxicab distance or city block distance,
is widely used when the entire distance traveled by an object needs to be
calculated rather than just moved. This metric is obtained by adding the
absolute differences in the coordinates of two points in a space with
dimensions n.
Minkowski Distance – One illustration of both
the geometric and Manhattan distances is the Minkowski distance. The Minkowski
distance is an extension of the Euclidean and Manhattan distances, and it is a
metric in a normed vector space. This distance measure encapsulates various
distance metrics within a single framework, providing a more generalized
approach to measuring distances between points in a vector space.
By the above formula, we can
say that when p = 2 then is it the same as the Euclidean distance formula, and when
p = 1 then we can get the formula of the Manhattan distance.
The above-described metrics
are the most common metrics that we can use for dealing with machine learning
problems, but we can also use other distance metrics as well if we like or the
problem needs it. Hamming distance is an example of such a metric which is quite
useful when we have problems that require overlapping comparisons between two
vectors whose contents can be Boolean as well as string values.
Choosing the K value for the KNN
algorithm
In the K-Nearest
Neighbors (K-NN) algorithm, the parameter "k" plays an important role
because it determines the number of neighbors considered when running the
algorithm. Choosing the right value of 'k' is key and should be guided by the
particularities of the input data. In situations where the data set is noisy or
contains a significant amount of outliers, choosing a larger value of k often
produces more accurate results.
For improved
classification accuracy, it's advantageous to choose an odd 'k' value, as this
helps prevent ties during the classification process. This choice can enhance
the decisiveness of the algorithm when assigning a class to the query point.
Employing cross-validation techniques serves as a valuable method for determining the most suitable 'k' value for a given dataset. By systematically assessing different 'k' values and their impact on the model's performance, cross-validation aids in identifying the optimal 'k' that maximizes the algorithm's accuracy and generalizability.
Applications of KNN algorithm
- Data Preprocessing – When diving into a machine learning problem, the initial step often involves Exploratory Data Analysis (EDA). In this phase, identifying missing values in the data prompts the need for imputation methods. Among these methods, the KNN Imputer stands out as a sophisticated and effective approach for handling missing data.
- Pattern Recognition – The effectiveness of the K-Nearest Neighbors (KNN) algorithm shines through in scenarios such as training it using the MNIST dataset. Upon evaluation, this model demonstrates notably high accuracy, showcasing the algorithm's prowess in pattern recognition tasks.
- Recommendation Engines – KNN finds prominent applications in recommendation engines. This algorithm's primary task involves assigning a new query point to an existing group, which has been formed using extensive datasets. In the realm of recommender systems, this capability proves crucial, enabling the allocation of users to specific groups based on their preferences. Subsequently, it facilitates personalized recommendations tailored to these groups' inclinations.
Advantages of the KNN algorithm
- Easy to implement – it is easy to implement because its complexity is not very high like other algorithms.
- Adapts Easily – Due to the KNN algorithm's ability to retain all data in memory, whenever a new example or data point is added, it can automatically adapt to the new information and contribute to future predictions.
- Few Hyperparameters: The KNN method requires only two parameters: the choice of 'k' and the choice of the distance metric for our evaluation measure. Due to the small number of hyperparameters, we can easily customize the algorithm.
Disadvantages of the KNN
algorithm
- Does not scale – due to the KNN's other name, the lazy algorithm. It got its moniker because this method uses a lot of processing power and data storage, which makes it resource- and time-intensive.
- Curse of Dimensionality – Peaking phenomena, which is caused by the curse of dimensionality, is a term used in KNN. The method has difficulty correctly identifying the data points we have a dataset that has high dimensionality, which is known as the "curse of dimensionality."
- Prone to overfitting – because the algorithm has the problem of dimensionality curse therefore the algorithm is prone to overfitting as well. Therefore, we need to add feature selection and dimensionality reduction techniques that can help to overcome this problem.
Summary
An easy-to-use yet
powerful technique for classification and regression issues is K-nearest
neighbors or CNN. Its task is to classify or assign a value to each new data
point in the training set by averaging its k nearest neighbors or by using the
majority vote. KNN can be slow when working with large datasets and is
sensitive to features that are irrelevant or noisy, but it is simple to
understand and performs well with small to medium-sized datasets. Despite being
an extremely helpful and user-friendly algorithm, it can also encounter issues
like overfitting and the curse of dimensionality.
Python Code
here is the k nearest neighbor Python code: -