Showing posts with label Decision Trees. Show all posts
Showing posts with label Decision Trees. Show all posts

Monday, February 19, 2024

DECISION TREE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Decision Trees

  • Decision Trees
  • Advantages of Decision Trees
  • Disadvantages of Decision Trees
  • Appropriate Problems for Decision Tree Learning
  • Practical Issues in Learning Decision Trees
  • Classification and Regression Tree Algorithm 

One famous supervised ML technique that can handle ML classification and regression issues is random forest. It uses the combined knowledge of several classifiers to solve complex problems, based on ensemble learning concepts. This technique improves the learning system's overall performance by using the strengths of many models.

Random forest improves accuracy and reduces overfitting by combining predictions from several decision trees. There is intrinsic substantial diversity in individual decision trees. The resultant variation, however, is reduced when these trees are simultaneously integrated into a random forest. This decrease happens because training each decision tree on a different set of data ensures that the result is dependent on numerous trees rather than just one, which in turn reduces the overall variance.

Raising the tree count in a random forest model improves accuracy and strengthens the model by reducing the likelihood of overfitting.

Utilized in machine learning, decision trees facilitate the organization of judgment generation by reducing complicated situations to a set of alternatives derived from incoming data. Using these decision nodes, we were able to segment the information and forecast a target variable. Every node stands for a feature, and the branches indicate possible outcomes that are based on the feature's value. Because of their use in classification and regression issues as well as their ease of understanding and analysis, decision trees find widespread application in many domains, including healthcare, marketing, and finance they are also dynamic decision tree. With its straightforward structure and capacity to capture nonlinear relationships and interactions among parts, decision trees prove to be an invaluable tool for modeling intricate data patterns. Overfitting can occur if regularization or pruning is not done correctly, leading to models that are not as generalizable. However, by serving as the basis for more advanced ensemble approaches like as random forests and gradient boosting, decision trees significantly advance the areas of predictive modeling and data analysis. In this blog, we learn about decision tree machine learning Python

Why do we use Decision Trees?

The two main reasons why we should use decision trees are:

  1. Decision trees aim to replicate human decision-making processes, making them straightforward to comprehend.
  2. We can easily understand it because it looks like a tree structure.

Decision tree example

Let’s suppose Noah, a park ranger with a passion for wildlife, was tasked with a new challenge: the challenge is to identify the increasing number of mysterious animal tracks appearing on the trails. The methods previously used are not enough. Noah needed a smarter way to decipher these cryptic clues. Here enters the decision tree method, Noah's secret weapon for unraveling the tracks’ secrets.

Imagine a decision tree machine learning as a series of questions, like a branching path in the forest. Each question focuses on a specific characteristic of the tracks, like size, number of toes, and claw marks, depth, stride length. Noah fed these data and more related data into the decision tree, paired with the corresponding animal identified by experienced trackers.

The decision tree doesn’t just memorize facts, it learns by analyzing the data. It creates a series of yes-or-no questions based on the most relevant track features. For example, the first question might be: “Are the tracks larger than 6 inches?” depending on the answer, the decision tree would branch out, leading to further questions about the number of toes or claw marks.

This branching structure very much looks like a detective’s flowchart, narrowing down the possibilities with each answered question. The decision tree continues branching until it concludes – the most likely animal that left the tracks.

The decision tree algorithm real test comes when Noah encounters a set of fresh tracks unlike any he’d seen before and he carefully measures and observes them, and provides this information to the decision tree algorithm to predict the outcome or in this case the animal.

Noah started with the initial question: “Are the tracks larger than 6 inches?” Yes. The tree branched, leading to the next question: “Do the tracks have tree toes?” No. This eliminated possibilities like deer or rabbits. The tree continued branching, asking about claw marks, stride length, and other details.

Finally, after a series of questions, the decision tree concluded: and this conclusion is that the tracks are most likely left by a bobcat.

Decision tree terminologies

Key Terminologies in Building a Decision Tree:

Root NodeLocated at the top of the tree structure, the root node encompasses all the data, acting as the starting point for decision-making within the decision tree.

Decisions based on features added to the network are represented by these nodes. Internal nodes can give rise to more internal nodes or even leaf nodes as their offshoots.

A terminal node, also known as a leaf node, is a node that has no child nodes and is used to indicate a class name or a number value.

Splitting - divides a node into several smaller nodes according to a split criteria and a chosen attribute.

Beginning from an inner node and ending at a leaf node makes up the branch/sub-tree segment of the decision tree.

A node that splits to produce one or more child nodes is known as a parent node.

Parent Node: After passing through a split, this node gives birth to one or more offspring nodes.

Child Node: These nodes are created when a parent node is divided through a splitting process.

Impurity: It evaluates the consistency of the target variable in a subset of the data and indicates the randomness or uncertainty of the data set. Decision trees typically use additional metrics such as "entropy" and "Gini index" for classification tasks.

Variance: Variance in decision trees for regression problems illustrates the differences between the actual and predicted target variables in various dataset samples. This variance is typically measured using several metrics, such as mean square error, mean absolute error, Friedman's MSE, and half-Poisson Deviance.

Information Gain: This measure evaluates the uncertainty reduction achieved by partitioning the data tree according to a given feature of the decision tree. It evaluates the feature with the largest information gain for each node to determine which is the most informative feature to share, enabling the production of cleaner subsets.

Pruning: This entails deleting branches from the decision tree that do not give extra information that could potentially cause overfitting.

Attribute Selection Measures

Building a Decision Tree requires a learning phase where the initial dataset is partitioned into subsets using Attribute Selection Measures (ASM). ASM plays a crucial role in decision tree algorithms by evaluating the effectiveness of various attributes in dividing datasets. Its main goal is to identify attributes that result in the most similar-looking subsets after splitting so that the data benefit is maximized. This iterative process of recursive partitioning occurs for each subset or subtree, driving the gradual construction of the decision tree.

One notable aspect is that the construction of a decision tree classifier doesn't necessitate prior domain knowledge or specific parameter settings, making it valuable for exploratory knowledge discovery. Additionally, decision trees are adept at handling high-dimensional data.

Entropy serves as a metric to gauge the level of randomness or uncertainty within a dataset. In datasets designed for classification tasks, entropy serves as a measure of randomness, calculated according to the distribution of class labels present in the dataset.

For a given subset of the original dataset containing a K-class for the ith node, entropy serves as a metric of the level of disorder or uncertainty in that subset. This evaluation helps to evaluate the impurity of a node and influences the selection of attributes during the construction of the decision tree


In the above equation:

  • S is the dataset sample.
  • k is the particular class from K classes
  • p(k) is the proportion of the data that belong to class k to the total number of data points in dataset sample S.
  • In the equation p(i, k) must not equal to zero.

There are some important points we should remember if we are using entropy, and they are:

  1. If the dataset is fully homogeneous then the entropy is 0, which means that every instance or data point in the dataset belongs ton a single class. It is the lowest entropy which indicates that there is no uncertainty in the dataset sample.
  2. The entropy is the dataset's highest value if it is equally divided into several subclasses. This suggests that maximal uncertainty in the dataset sample is indicated by a uniform distribution of class labels, which is when entropy is highest.
  3. Entropy is also utilized to evaluate the effectiveness of a split. It aims to create more homogeneous subsets within the dataset concerning class labels, thereby minimizing the entropy of the resulting subsets.
  4. The decision tree algorithm selects the attribute with the highest information gain as the splitting criterion, and this process iterates to construct the decision tree further.

Gini Impurity or Index – The Gini index serves as a metric to assess the accuracy of a split among classified groups, providing scores between 0 and 1. A score of 0 indicates that all observations belong to a single class, while a score of 1 signifies a random distribution of elements across classes. Hence, aiming for a lower Gini index score is ideal, indicating more homogeneous subsets after the split. This metric serves as an evaluation tool for decision tree model, allowing the assessment of the model's effectiveness in creating informative splits.

In the above equation, p_i (p subscript i) is the proportion of elements in the set that belongs to the ith category.

Information Gain – Information gain measures the decrease in entropy or variance achieved by dividing a dataset according to a specific attribute. Within decision tree methods or algorithms, it evaluates the importance of a feature by creating subsets that are more uniform or homogeneous concerning the class or target variable. A higher data gain suggests that the feature is more useful for predicting the target variable, highlighting its importance in improving the uniformity of subsets after splitting.

The information gain of an attribute A, concerning a dataset S, is calculated as follows:


In the above equation

  • A is the specific attribute or class label
  • |H| is the entropy of dataset sample S
  • |H_V | (H subscript V) is the number of instances in the subset S that have the value v for attributes A.

When building a decision tree, the primary criterion for partitioning is the most informative attribute. The amount of entropy or variation that is reduced when a data collection is divided based on attribute A is indicated by data validation.

Data gathering is essential to the operation of regression and classification decision trees. Regression trees consider variance, while classification trees consider entropy when evaluating additiveness. In their information gain computations, both types use variance or entropy, which is the same regardless of the impurity measurements used.

Decision tree algorithm and its working

The decision tree analyzes the dataset to generate predictions regarding its categorization. It looks at the base node of the tree first. The record's attribute has now been compared by the algorithm to the values of the dataset's root attribute. Afterward, it proceeds down the branches using this comparison, verifying the specific attribute requirements met at every stage to decide whether to go on to the next node or not.

Iteratively analyzing and comparing the dataset or datasets is done at each subsequent node of the decision tree. Until it reaches the leaf node of the tree, the cycle keeps going. You will be able to comprehend the full operation by following these steps.

First Step: The dataset is contained in the root node, S, at the very top of the tree.

Second Step: The algorithm's second phase involves determining which attribute in the dataset is the best one using the Attribute Selection Measure (ASM).

Third Step: The algorithm splits dataset S into subsets containing possible values for the best attribute in the third stage.

Fourth Step: Making a decision tree node with the selected top attribute is the fourth step.

Fifth Step: The process iteratively creates new decision trees using the dataset subsets created in Step 3. The recursive construction process ends and the final node in the categorization or regression tree process is designated as a leaf node when more classification is no longer possible.

Let’s suppose there is a man who wants to buy a new mobile phone now he needs to decide which type of mobile phone he wants to solve this problem we can use a decision tree and it starts with the operating system the use want after that we can split the node into parts which also have decision node, then if the user selects the operating system then it move to next node which is related to camera quality and then it split into different nodes (it also has decision node) after selecting the node it split into processor name and brand and then split into further category this process continues till we goes to leaf node (every node has its own decision node). By using this mechanism we can easily select or decide which mobile phone users want.


further category this process continues till we go to the leaf node (every node has its own decision node). By using this mechanism we can easily select or decide which mobile phone users want. The above decision tree chart shows the working of the decision tree.

Advantages of the Decision Tree
  • It is easy to understand because it also tries to follow the same process that human follows to make a decision.
  • It is very useful in cases where we need to make a decision.
  • It can help to think about what may also be the possible outcome of the problem and also what are the other results.
  • For this algorithm we need to clean the data less as compared to other algorithms.

Disadvantages of the Decision Tree

  • As the decision tree grows, it develops several layers, and when the data is very heterogeneous, the tree expands with additional layers, increasing complexity.
  • Another issue related to decision handicaps is potential overfitting, which occurs when a model picks up noise in the training data instead of general patterns. One way to solve this problem is to use the Random Forest algorithm.
  • As the number of class identifiers increases, the computational complexity of the decision tree can also increase.

Appropriate problems for Decision tree learning

Decision tree learning is particularly well-suited for problems characterized by the following traits:

Instances represented by attribute-value pairs: Commonly, instances are portrayed using attribute-value pairs, such as temperature, and corresponding values like hot, cold, or mild. Ideally, attributes possess a finite set of distinct values, simplifying the construction of decision trees. Advanced versions of decision tree algorithms can handle attributes with continuous numerical values, allowing the representation of variables like temperature on a numerical scale.

Discrete output values for the target function: Decision trees are commonly developed for categorical Boolean examples, like binary outcomes like yes or no. While primarily used for dual outcomes, decision tree methods can be developed to handle functions with multiple distinct output values, albeit applications with numeric outputs are less frequent.

Need for disjunctive descriptions: Decision trees naturally accommodate disjunctive expressions, enabling effective representation of complex relationships within data.

Resilience towards errors in training data: Decision tree learning techniques demonstrate resilience towards errors present in training data, including inconsistencies in categorization or discrepancies in feature details characterizing cases.

Handling missing attribute values in training data: In some cases, training data might have missing or absent characteristics. Despite encountering unknown features in certain training samples, decision tree approaches can still be applied effectively. For example, when considering humidity levels throughout the day, this data might be available for only a specific subset of training samples.

Practical issues in learning decision trees include
  • Selecting the decision tree's growth depth 
  • Taking care of the enduring traits
  • Choosing the Most Effective Attribute Selection Metric
  • Interpreting training data in the presence of blank values for attributes
  • Handling attributes with different price points.
  • Increasing the efficiency of computing
The CART (Classification and Regression Trees) algorithm is utilized to create decision trees for both classification and regression purposes. It functions akin to techniques such as Gini impurity or Information Gain, prioritizing the selection of the optimal split at each node according to a designated metric. The basic procedure of the CART algorithm can be outlined as follows:
  1. Root Node Initialization: Begin with the root node of the tree, representing the complete training dataset.
  2. Evaluate Feature Impurity: Examine all features in the dataset to measure the level of impurity in the data. Classification tasks use different metrics to quantify impurities, such as the Gini index and entropy, while regression tasks use metrics such as root mean square error, mean absolute error, friedman_mse, or semi-Poisson deviation.
  3. Select the Feature That Will Provide the Most Information: Determine whether the data-splitting feature minimizes impurity or produces the most relevant information.
  4. Partition the Database: Divide the data set into two groups, one for each possible attribute value that has been selected; for instance, "yes" and "no" for each conceivable attribute value. The goal is to create subsets that are as comparable to the dependent variable as possible.
  5. Assess Subset Impurity: Evaluate the impurity of each resulting subset based on the target variable.
  6. Iterative Process: Continuously repeat steps 2-5 for each subset until the termination condition is met. The stopping criteria can reach the maximum tree depth, reach how many minimum samples are needed to split or reach the minimum limit of impurities.
  7. Assign Values to Terminal Nodes: For each terminal node, also known as a leaf node, determine the most frequent class in the tree for classification tasks or the average for regression tasks. This task ensures that the model can make predictions for new data instances using the models learned during training.

These steps collectively facilitate the creation of an effective decision tree through iterative evaluation and feature splitting, catering to both classification and regression tasks.

Classification and Regression Tree algorithm for Classification

Let’s have data at node m be Q_m (Q subscript m) and it has n_m (n subscript m) samples, and t_m (t subscript m) as the threshold for the node m. After that, we can write the classification method using a regression tree:
In the above equation
  • The impurity measure for the left and right subsets at node m is denoted as H. H's value can be determined by analyzing entropy or Gini impurities.
  • n_m (n subscript m) is the total number of instances encountered at node m during the left and right sunsets.
For selecting the parameter, we can write the equation as:

Classification and Regression Tree algorithm for Regression

For Regression Problems We can do the following to get the equation, let the data available at node m be Q_m and it has n_m samples and t_m as the threshold for them. Then, the classification and regression algorithm for regression can be written as:
In the above equation, the MSE is the mean squared error.

n_m (n subscript m) is the number of instances in the left and right subsets at the node m.
For selecting the parameter, we can write the equation as:
Strengths of the Decision Tree Approach
  • Decision trees offer interpretability due to their rule-based nature, allowing easy comprehension of generated rules.
  • They perform classification tasks with minimal computational demand.
  • Capable of handling both continuous and categorical variables, making them versatile.
  • Highlighting the importance of fields for prediction or classification purposes.
  • Simplicity in usage and implementation, requiring no specialized expertise, rendering them accessible to a broad user base.
  • Scalability we can scale the Decision tree to very long.
  • The ability to flexibly handle missing data or values makes them well-suited for incomplete or missing data sets.
  • Ability to handle non-linear relationships between variables, enhancing suitability for complex datasets.
  • Adequacy in handling imbalanced datasets by adjusting node importance based on class distribution, ensuring functionality even with heavily skewed class representation.

Weaknesses of the Decision Tree Approach

  • When it comes to predicting continuous attribute values, decision trees tend to be less efficient for estimation tasks.
  • In classification scenarios with numerous classes and a limited training dataset, decision trees often encounter challenges, resulting in higher error rates.
  • The computational expense in training decision trees can be notable, particularly when growing trees to fit larger datasets. Sorting candidate splits and searching for optimal combinations of fields during tree construction can be resource-intensive. Similarly, pruning algorithms can be costly as they involve forming and comparing numerous sub-trees.
  • There's a high risk of overfitting, particularly with complex and deep trees, impacting the performance on new, not-seen data.
  • Little variations in training data might yield entirely new or separated decision trees, complicating result comparison or reproduction.
  • Many decision tree models encounter difficulties when dealing with missing data, requiring strategies such as imputation or deletion of records containing missing values to address this issue.
  • Biased trees may result from improper initial data splitting, particularly in cases of unbalanced datasets or rare classes, which can affect the accuracy of the model.
  • The scaling of input features can significantly impact decision trees, particularly when employing distance-based metrics or comparison-intensive decision rules.
  • Decision trees have limitations in representing intricate relationships between variables, especially concerning nonlinear or interactive effects. This can lead to less accurate modeling of complex relationships within the dataset.

Summary

Among the most practical supervised learning models are decision trees for both regression and classification. They can separate data based on features by using a tree structure, where nodes represent features, branches represent decisions, and leaf nodes represent outcomes or forecasts.  Known for their interpretation and visualization capabilities, decision trees contain various types of data, including numerical and categorical data. However, they run the risk of overfitting without proper pruning and may not capture complex data relationships as effectively as alternative algorithms. Using techniques such as ensemble methods can improve their efficiency and robustness.

Python Code








K-NEAREST NEIGHBOUR IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

K-Nearest Neighbors

  • The Intuition Behind K-NN Algorithm
  • How KNN works
  • Steps Involved in the K-NN Algorithm
  • Distance Metrics Used in KNN
  • Choosing the K value
  • Applications of the KNN Algorithm
  • Advantages and Disadvantages

Regression as well as classification problems may be handled using the simple supervised machine learning algorithm K-Nearest Neighbours (K-NN). By comparing the new and old data points, this approach places the new example in a category comparable to the most similar ones already in existence. The main idea is to save predetermined data and classify incoming data according to how similar they are to the previously saved dataset.

Popularity of the K-Nearest Neighbours model is maintained by its stability and simplicity. Its non-parametric feature implies that it is independent of data distribution presumptions. Although applicable to both regression and classification, it predominantly finds use in classification problems. Termed a "lazy learner" algorithm, K-NN delays learning from the training set, instead storing data for on-the-spot classification when needed.

For instance, consider a scenario with a variety of dog breed images used to train a K-NN model. When presented with a new dog image, the K-NN algorithm identifies similarities between its features and those of various dog breeds within the dataset. Based on these shared features, the algorithm classifies the image into the most similar breed category.

By emphasizing similarity measures, the K-Nearest Neighbour algorithm proves valuable in scenarios where a new data point's classification or regression depends heavily on its resemblance to existing data. In this chapter the k nearest neighbour learning algorithm we learn.

Let's take a real-world example for KNN

In this k nearest neighbor example let's suppose there is a mechanic he/she has a talent for fixing cars, but new electric vehicles stumped them. The error codes were unlike those he/she had seen before. For her/him diagnosing these electric vehicles felt like navigating a foreign language. Deciding to get clever, he/she looked to KNN, a technique like having a team of expert mechanics by him/her side.

Imagine KNN as a toolbox filed with past cases. Each case is a car with its symptoms (error codes, battery reading, etc.) and the mechanic’s diagnosis (problem identified). When a new electric car with a puzzling error rolls in, KNN swings into action.

KNN doesn’t jump to conclusions, instead, it considers the past cases in the toolbox. It analyzes the new car’s symptoms and compares them to all the past cases. After that KNN picks a small group, the K's nearest neighbors, the most similar cars based on their symptoms.

These K's closest neighbors become like a consultation team for him/her. By looking at the problems those similar cars had (their diagnoses), KNN helps the mechanic to predict the most likely issue with the new car. It’s like having a group of experts who’ve tackled similar problems before, whispering their insights to mechanics.

This collaborative approach proved valuable. Faced with an unfamiliar error code, mechanics could consult their KNN toolbox, identify similar past cases, and make a well-informed diagnosis. With KNN as their secret weapon, the mechanic became the go-to electric car mechanic, their skills boosted by the help of KNN. 

The intuition behind the KNN algorithm


Image source original

In the above graph, we can see two clusters or groups. Now if we have another point (also printed on the graph) that is unclassified then we can easily assign it to a group. One approach to accomplish this is by examining the group to which its nearest neighbors belong. This means that the point in the above diagram will be ‘green’ because it is close to the green cluster.

Why do we need the KNN algorithm?

The widespread utilization of the K-Nearest Neighbors (K-NN) algorithm stems from its versatility and wide applicability. Its simplicity and straightforward implementation are key factors driving its use. One of its standout features is its lack of reliance on prediction about the underlying data distribution or how data is scattered, making it suitable for various datasets, whether dealing with numerical or categorical data in classification and regression tasks.

A significant advantage lies in its non-parametric nature, leading to predictions grounded in the similarity among data points within the dataset. Furthermore, K-NN exhibits a higher tolerance for outliers compared to alternative algorithms.

To discover the K nearest neighbors, the K-NN approach typically uses the Euclidean distance as its distance metric of choice. A data point's class or value is determined by averaging or by taking into account the outcomes of its K nearest neighbors. Because of this adaptive strategy, the algorithm can identify various patterns in the data and forecast outcomes based on the local structure of the data. 

How KNN work?

K-Nearest Neighbors (K-NN) stands as a versatile and widely embraced algorithm due to its adaptability across various applications. Its allure resides in its simplicity and ease of implementation. Notably, it distinguishes itself by eschewing assumptions about the inherent data distribution and rendering. It is suitable for datasets of diverse natures encompassing numerical and categorical data in both classification and regression domains.

Its non-parametric nature forms a cornerstone for predictions, relying on the inherent similarities among data points housed within the dataset. Another distinctive trait is its robustness against outliers, surpassing the resilience of alternative algorithms in this aspect.

By utilizing distance metrics, particularly the prevalent Euclidean distance, the K-NN algorithm identifies the K closest neighbors. Subsequently, the determination of a data point's classification or value involves a process of averaging or considering the majority vote among these K neighbors. This adaptive methodology empowers the algorithm to discern varied data patterns and make predictions rooted in the localized structure of the dataset.

Choosing the K data points from the dataset (X) that show the shortest distance to the target data point (x) is the first step in the K-Nearest Neighbors (K-NN) technique. The method in classification problems finds the most common label (y) among these K nearest neighbors (x). To determine the projected value for x in a regression job, the method first calculates the mean, or weighted average or weight mean, of the y values connected to the K nearest neighbors. Whether determining the most common label in classification or estimating a continuous value in regression scenarios, this technique allows the algorithm to generate well-informed predictions based on the features of the nearest neighbors.

Steps involve in K-NN algorithm working

We can explain the working of KNN in the following steps:

Step 1- The number of neighbors, K, needs to be selected appropriately.

Step 2 – After that, we need to calculate the Euclidean distance (or any other distance) of K number of neighbors

Step 3 – Now we need to take the K nearest neighbors as per our calculated Euclidean distance or any other distance.

Step 4 – after calculating these k neighbors, we need to count the number of the data points in each category.

Step 5 – Now we need to assign the new data points to that category which have a maximum number of neighbors.

Step 6 – our model is ready to use.

Distance metrics used in the KNN algorithm

The K-NN (K-Nearest Neighbors) algorithm works by identifying the closest data points or clusters to a given query point. This is achieved by using various distance metrics to measure the closeness of data points. Let's look at one of these metrics:

Euclidean Distance represents the simplest distance measure between two points in a plane. It quantifies the Cartesian distance between these points, essentially illustrating the length of the straight line connecting the considered points. This metric serves to calculate the total displacement undergone between two states of an object or entities in a given space. Visualizing this distance helps in understanding the direct spatial relationship and proximity between data points within a plane or hyperplane.

Manhattan Distance – The Manhattan distance, also known as the taxicab distance or city block distance, is widely used when the entire distance traveled by an object needs to be calculated rather than just moved. This metric is obtained by adding the absolute differences in the coordinates of two points in a space with dimensions n.

Minkowski Distance – One illustration of both the geometric and Manhattan distances is the Minkowski distance. The Minkowski distance is an extension of the Euclidean and Manhattan distances, and it is a metric in a normed vector space. This distance measure encapsulates various distance metrics within a single framework, providing a more generalized approach to measuring distances between points in a vector space.

By the above formula, we can say that when p = 2 then is it the same as the Euclidean distance formula, and when p = 1 then we can get the formula of the Manhattan distance.

The above-described metrics are the most common metrics that we can use for dealing with machine learning problems, but we can also use other distance metrics as well if we like or the problem needs it. Hamming distance is an example of such a metric which is quite useful when we have problems that require overlapping comparisons between two vectors whose contents can be Boolean as well as string values.

Choosing the K value for the KNN algorithm

In the K-Nearest Neighbors (K-NN) algorithm, the parameter "k" plays an important role because it determines the number of neighbors considered when running the algorithm. Choosing the right value of 'k' is key and should be guided by the particularities of the input data. In situations where the data set is noisy or contains a significant amount of outliers, choosing a larger value of k often produces more accurate results.

For improved classification accuracy, it's advantageous to choose an odd 'k' value, as this helps prevent ties during the classification process. This choice can enhance the decisiveness of the algorithm when assigning a class to the query point.

Employing cross-validation techniques serves as a valuable method for determining the most suitable 'k' value for a given dataset. By systematically assessing different 'k' values and their impact on the model's performance, cross-validation aids in identifying the optimal 'k' that maximizes the algorithm's accuracy and generalizability.

Applications of KNN algorithm

  • Data Preprocessing – When diving into a machine learning problem, the initial step often involves Exploratory Data Analysis (EDA). In this phase, identifying missing values in the data prompts the need for imputation methods. Among these methods, the KNN Imputer stands out as a sophisticated and effective approach for handling missing data.
  • Pattern Recognition – The effectiveness of the K-Nearest Neighbors (KNN) algorithm shines through in scenarios such as training it using the MNIST dataset. Upon evaluation, this model demonstrates notably high accuracy, showcasing the algorithm's prowess in pattern recognition tasks.
  • Recommendation Engines – KNN finds prominent applications in recommendation engines. This algorithm's primary task involves assigning a new query point to an existing group, which has been formed using extensive datasets. In the realm of recommender systems, this capability proves crucial, enabling the allocation of users to specific groups based on their preferences. Subsequently, it facilitates personalized recommendations tailored to these groups' inclinations.

Advantages of the KNN algorithm

  • Easy to implement – it is easy to implement because its complexity is not very high like other algorithms.
  • Adapts Easily – Due to the KNN algorithm's ability to retain all data in memory, whenever a new example or data point is added, it can automatically adapt to the new information and contribute to future predictions.
  • Few Hyperparameters: The KNN method requires only two parameters: the choice of 'k' and the choice of the distance metric for our evaluation measure. Due to the small number of hyperparameters, we can easily customize the algorithm.

Disadvantages of the KNN algorithm

  • Does not scale – due to the KNN's other name, the lazy algorithm. It got its moniker because this method uses a lot of processing power and data storage, which makes it resource- and time-intensive.
  • Curse of Dimensionality – Peaking phenomena, which is caused by the curse of dimensionality, is a term used in KNN. The method has difficulty correctly identifying the data points we have a dataset that has high dimensionality, which is known as the "curse of dimensionality."
  • Prone to overfitting – because the algorithm has the problem of dimensionality curse therefore the algorithm is prone to overfitting as well. Therefore, we need to add feature selection and dimensionality reduction techniques that can help to overcome this problem.

Summary

An easy-to-use yet powerful technique for classification and regression issues is K-nearest neighbors or CNN. Its task is to classify or assign a value to each new data point in the training set by averaging its k nearest neighbors or by using the majority vote. KNN can be slow when working with large datasets and is sensitive to features that are irrelevant or noisy, but it is simple to understand and performs well with small to medium-sized datasets. Despite being an extremely helpful and user-friendly algorithm, it can also encounter issues like overfitting and the curse of dimensionality.

Python Code

here is the k nearest neighbor Python code: - 




Featured Post

ASSOCIATION RULE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Association rule   Rule Evaluation Metrics Applications of Association Rule Learning Advantages of Association Rule Mining Disadvantages of ...

Popular