Monday, February 19, 2024

DECISION TREE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Decision Trees

  • Decision Trees
  • Advantages of Decision Trees
  • Disadvantages of Decision Trees
  • Appropriate Problems for Decision Tree Learning
  • Practical Issues in Learning Decision Trees
  • Classification and Regression Tree Algorithm 

One famous supervised ML technique that can handle ML classification and regression issues is random forest. It uses the combined knowledge of several classifiers to solve complex problems, based on ensemble learning concepts. This technique improves the learning system's overall performance by using the strengths of many models.

Random forest improves accuracy and reduces overfitting by combining predictions from several decision trees. There is intrinsic substantial diversity in individual decision trees. The resultant variation, however, is reduced when these trees are simultaneously integrated into a random forest. This decrease happens because training each decision tree on a different set of data ensures that the result is dependent on numerous trees rather than just one, which in turn reduces the overall variance.

Raising the tree count in a random forest model improves accuracy and strengthens the model by reducing the likelihood of overfitting.

Utilized in machine learning, decision trees facilitate the organization of judgment generation by reducing complicated situations to a set of alternatives derived from incoming data. Using these decision nodes, we were able to segment the information and forecast a target variable. Every node stands for a feature, and the branches indicate possible outcomes that are based on the feature's value. Because of their use in classification and regression issues as well as their ease of understanding and analysis, decision trees find widespread application in many domains, including healthcare, marketing, and finance they are also dynamic decision tree. With its straightforward structure and capacity to capture nonlinear relationships and interactions among parts, decision trees prove to be an invaluable tool for modeling intricate data patterns. Overfitting can occur if regularization or pruning is not done correctly, leading to models that are not as generalizable. However, by serving as the basis for more advanced ensemble approaches like as random forests and gradient boosting, decision trees significantly advance the areas of predictive modeling and data analysis. In this blog, we learn about decision tree machine learning Python

Why do we use Decision Trees?

The two main reasons why we should use decision trees are:

  1. Decision trees aim to replicate human decision-making processes, making them straightforward to comprehend.
  2. We can easily understand it because it looks like a tree structure.

Decision tree example

Let’s suppose Noah, a park ranger with a passion for wildlife, was tasked with a new challenge: the challenge is to identify the increasing number of mysterious animal tracks appearing on the trails. The methods previously used are not enough. Noah needed a smarter way to decipher these cryptic clues. Here enters the decision tree method, Noah's secret weapon for unraveling the tracks’ secrets.

Imagine a decision tree machine learning as a series of questions, like a branching path in the forest. Each question focuses on a specific characteristic of the tracks, like size, number of toes, and claw marks, depth, stride length. Noah fed these data and more related data into the decision tree, paired with the corresponding animal identified by experienced trackers.

The decision tree doesn’t just memorize facts, it learns by analyzing the data. It creates a series of yes-or-no questions based on the most relevant track features. For example, the first question might be: “Are the tracks larger than 6 inches?” depending on the answer, the decision tree would branch out, leading to further questions about the number of toes or claw marks.

This branching structure very much looks like a detective’s flowchart, narrowing down the possibilities with each answered question. The decision tree continues branching until it concludes – the most likely animal that left the tracks.

The decision tree algorithm real test comes when Noah encounters a set of fresh tracks unlike any he’d seen before and he carefully measures and observes them, and provides this information to the decision tree algorithm to predict the outcome or in this case the animal.

Noah started with the initial question: “Are the tracks larger than 6 inches?” Yes. The tree branched, leading to the next question: “Do the tracks have tree toes?” No. This eliminated possibilities like deer or rabbits. The tree continued branching, asking about claw marks, stride length, and other details.

Finally, after a series of questions, the decision tree concluded: and this conclusion is that the tracks are most likely left by a bobcat.

Decision tree terminologies

Key Terminologies in Building a Decision Tree:

Root NodeLocated at the top of the tree structure, the root node encompasses all the data, acting as the starting point for decision-making within the decision tree.

Decisions based on features added to the network are represented by these nodes. Internal nodes can give rise to more internal nodes or even leaf nodes as their offshoots.

A terminal node, also known as a leaf node, is a node that has no child nodes and is used to indicate a class name or a number value.

Splitting - divides a node into several smaller nodes according to a split criteria and a chosen attribute.

Beginning from an inner node and ending at a leaf node makes up the branch/sub-tree segment of the decision tree.

A node that splits to produce one or more child nodes is known as a parent node.

Parent Node: After passing through a split, this node gives birth to one or more offspring nodes.

Child Node: These nodes are created when a parent node is divided through a splitting process.

Impurity: It evaluates the consistency of the target variable in a subset of the data and indicates the randomness or uncertainty of the data set. Decision trees typically use additional metrics such as "entropy" and "Gini index" for classification tasks.

Variance: Variance in decision trees for regression problems illustrates the differences between the actual and predicted target variables in various dataset samples. This variance is typically measured using several metrics, such as mean square error, mean absolute error, Friedman's MSE, and half-Poisson Deviance.

Information Gain: This measure evaluates the uncertainty reduction achieved by partitioning the data tree according to a given feature of the decision tree. It evaluates the feature with the largest information gain for each node to determine which is the most informative feature to share, enabling the production of cleaner subsets.

Pruning: This entails deleting branches from the decision tree that do not give extra information that could potentially cause overfitting.

Attribute Selection Measures

Building a Decision Tree requires a learning phase where the initial dataset is partitioned into subsets using Attribute Selection Measures (ASM). ASM plays a crucial role in decision tree algorithms by evaluating the effectiveness of various attributes in dividing datasets. Its main goal is to identify attributes that result in the most similar-looking subsets after splitting so that the data benefit is maximized. This iterative process of recursive partitioning occurs for each subset or subtree, driving the gradual construction of the decision tree.

One notable aspect is that the construction of a decision tree classifier doesn't necessitate prior domain knowledge or specific parameter settings, making it valuable for exploratory knowledge discovery. Additionally, decision trees are adept at handling high-dimensional data.

Entropy serves as a metric to gauge the level of randomness or uncertainty within a dataset. In datasets designed for classification tasks, entropy serves as a measure of randomness, calculated according to the distribution of class labels present in the dataset.

For a given subset of the original dataset containing a K-class for the ith node, entropy serves as a metric of the level of disorder or uncertainty in that subset. This evaluation helps to evaluate the impurity of a node and influences the selection of attributes during the construction of the decision tree


In the above equation:

  • S is the dataset sample.
  • k is the particular class from K classes
  • p(k) is the proportion of the data that belong to class k to the total number of data points in dataset sample S.
  • In the equation p(i, k) must not equal to zero.

There are some important points we should remember if we are using entropy, and they are:

  1. If the dataset is fully homogeneous then the entropy is 0, which means that every instance or data point in the dataset belongs ton a single class. It is the lowest entropy which indicates that there is no uncertainty in the dataset sample.
  2. The entropy is the dataset's highest value if it is equally divided into several subclasses. This suggests that maximal uncertainty in the dataset sample is indicated by a uniform distribution of class labels, which is when entropy is highest.
  3. Entropy is also utilized to evaluate the effectiveness of a split. It aims to create more homogeneous subsets within the dataset concerning class labels, thereby minimizing the entropy of the resulting subsets.
  4. The decision tree algorithm selects the attribute with the highest information gain as the splitting criterion, and this process iterates to construct the decision tree further.

Gini Impurity or Index – The Gini index serves as a metric to assess the accuracy of a split among classified groups, providing scores between 0 and 1. A score of 0 indicates that all observations belong to a single class, while a score of 1 signifies a random distribution of elements across classes. Hence, aiming for a lower Gini index score is ideal, indicating more homogeneous subsets after the split. This metric serves as an evaluation tool for decision tree model, allowing the assessment of the model's effectiveness in creating informative splits.

In the above equation, p_i (p subscript i) is the proportion of elements in the set that belongs to the ith category.

Information Gain – Information gain measures the decrease in entropy or variance achieved by dividing a dataset according to a specific attribute. Within decision tree methods or algorithms, it evaluates the importance of a feature by creating subsets that are more uniform or homogeneous concerning the class or target variable. A higher data gain suggests that the feature is more useful for predicting the target variable, highlighting its importance in improving the uniformity of subsets after splitting.

The information gain of an attribute A, concerning a dataset S, is calculated as follows:


In the above equation

  • A is the specific attribute or class label
  • |H| is the entropy of dataset sample S
  • |H_V | (H subscript V) is the number of instances in the subset S that have the value v for attributes A.

When building a decision tree, the primary criterion for partitioning is the most informative attribute. The amount of entropy or variation that is reduced when a data collection is divided based on attribute A is indicated by data validation.

Data gathering is essential to the operation of regression and classification decision trees. Regression trees consider variance, while classification trees consider entropy when evaluating additiveness. In their information gain computations, both types use variance or entropy, which is the same regardless of the impurity measurements used.

Decision tree algorithm and its working

The decision tree analyzes the dataset to generate predictions regarding its categorization. It looks at the base node of the tree first. The record's attribute has now been compared by the algorithm to the values of the dataset's root attribute. Afterward, it proceeds down the branches using this comparison, verifying the specific attribute requirements met at every stage to decide whether to go on to the next node or not.

Iteratively analyzing and comparing the dataset or datasets is done at each subsequent node of the decision tree. Until it reaches the leaf node of the tree, the cycle keeps going. You will be able to comprehend the full operation by following these steps.

First Step: The dataset is contained in the root node, S, at the very top of the tree.

Second Step: The algorithm's second phase involves determining which attribute in the dataset is the best one using the Attribute Selection Measure (ASM).

Third Step: The algorithm splits dataset S into subsets containing possible values for the best attribute in the third stage.

Fourth Step: Making a decision tree node with the selected top attribute is the fourth step.

Fifth Step: The process iteratively creates new decision trees using the dataset subsets created in Step 3. The recursive construction process ends and the final node in the categorization or regression tree process is designated as a leaf node when more classification is no longer possible.

Let’s suppose there is a man who wants to buy a new mobile phone now he needs to decide which type of mobile phone he wants to solve this problem we can use a decision tree and it starts with the operating system the use want after that we can split the node into parts which also have decision node, then if the user selects the operating system then it move to next node which is related to camera quality and then it split into different nodes (it also has decision node) after selecting the node it split into processor name and brand and then split into further category this process continues till we goes to leaf node (every node has its own decision node). By using this mechanism we can easily select or decide which mobile phone users want.


further category this process continues till we go to the leaf node (every node has its own decision node). By using this mechanism we can easily select or decide which mobile phone users want. The above decision tree chart shows the working of the decision tree.

Advantages of the Decision Tree
  • It is easy to understand because it also tries to follow the same process that human follows to make a decision.
  • It is very useful in cases where we need to make a decision.
  • It can help to think about what may also be the possible outcome of the problem and also what are the other results.
  • For this algorithm we need to clean the data less as compared to other algorithms.

Disadvantages of the Decision Tree

  • As the decision tree grows, it develops several layers, and when the data is very heterogeneous, the tree expands with additional layers, increasing complexity.
  • Another issue related to decision handicaps is potential overfitting, which occurs when a model picks up noise in the training data instead of general patterns. One way to solve this problem is to use the Random Forest algorithm.
  • As the number of class identifiers increases, the computational complexity of the decision tree can also increase.

Appropriate problems for Decision tree learning

Decision tree learning is particularly well-suited for problems characterized by the following traits:

Instances represented by attribute-value pairs: Commonly, instances are portrayed using attribute-value pairs, such as temperature, and corresponding values like hot, cold, or mild. Ideally, attributes possess a finite set of distinct values, simplifying the construction of decision trees. Advanced versions of decision tree algorithms can handle attributes with continuous numerical values, allowing the representation of variables like temperature on a numerical scale.

Discrete output values for the target function: Decision trees are commonly developed for categorical Boolean examples, like binary outcomes like yes or no. While primarily used for dual outcomes, decision tree methods can be developed to handle functions with multiple distinct output values, albeit applications with numeric outputs are less frequent.

Need for disjunctive descriptions: Decision trees naturally accommodate disjunctive expressions, enabling effective representation of complex relationships within data.

Resilience towards errors in training data: Decision tree learning techniques demonstrate resilience towards errors present in training data, including inconsistencies in categorization or discrepancies in feature details characterizing cases.

Handling missing attribute values in training data: In some cases, training data might have missing or absent characteristics. Despite encountering unknown features in certain training samples, decision tree approaches can still be applied effectively. For example, when considering humidity levels throughout the day, this data might be available for only a specific subset of training samples.

Practical issues in learning decision trees include
  • Selecting the decision tree's growth depth 
  • Taking care of the enduring traits
  • Choosing the Most Effective Attribute Selection Metric
  • Interpreting training data in the presence of blank values for attributes
  • Handling attributes with different price points.
  • Increasing the efficiency of computing
The CART (Classification and Regression Trees) algorithm is utilized to create decision trees for both classification and regression purposes. It functions akin to techniques such as Gini impurity or Information Gain, prioritizing the selection of the optimal split at each node according to a designated metric. The basic procedure of the CART algorithm can be outlined as follows:
  1. Root Node Initialization: Begin with the root node of the tree, representing the complete training dataset.
  2. Evaluate Feature Impurity: Examine all features in the dataset to measure the level of impurity in the data. Classification tasks use different metrics to quantify impurities, such as the Gini index and entropy, while regression tasks use metrics such as root mean square error, mean absolute error, friedman_mse, or semi-Poisson deviation.
  3. Select the Feature That Will Provide the Most Information: Determine whether the data-splitting feature minimizes impurity or produces the most relevant information.
  4. Partition the Database: Divide the data set into two groups, one for each possible attribute value that has been selected; for instance, "yes" and "no" for each conceivable attribute value. The goal is to create subsets that are as comparable to the dependent variable as possible.
  5. Assess Subset Impurity: Evaluate the impurity of each resulting subset based on the target variable.
  6. Iterative Process: Continuously repeat steps 2-5 for each subset until the termination condition is met. The stopping criteria can reach the maximum tree depth, reach how many minimum samples are needed to split or reach the minimum limit of impurities.
  7. Assign Values to Terminal Nodes: For each terminal node, also known as a leaf node, determine the most frequent class in the tree for classification tasks or the average for regression tasks. This task ensures that the model can make predictions for new data instances using the models learned during training.

These steps collectively facilitate the creation of an effective decision tree through iterative evaluation and feature splitting, catering to both classification and regression tasks.

Classification and Regression Tree algorithm for Classification

Let’s have data at node m be Q_m (Q subscript m) and it has n_m (n subscript m) samples, and t_m (t subscript m) as the threshold for the node m. After that, we can write the classification method using a regression tree:
In the above equation
  • The impurity measure for the left and right subsets at node m is denoted as H. H's value can be determined by analyzing entropy or Gini impurities.
  • n_m (n subscript m) is the total number of instances encountered at node m during the left and right sunsets.
For selecting the parameter, we can write the equation as:

Classification and Regression Tree algorithm for Regression

For Regression Problems We can do the following to get the equation, let the data available at node m be Q_m and it has n_m samples and t_m as the threshold for them. Then, the classification and regression algorithm for regression can be written as:
In the above equation, the MSE is the mean squared error.

n_m (n subscript m) is the number of instances in the left and right subsets at the node m.
For selecting the parameter, we can write the equation as:
Strengths of the Decision Tree Approach
  • Decision trees offer interpretability due to their rule-based nature, allowing easy comprehension of generated rules.
  • They perform classification tasks with minimal computational demand.
  • Capable of handling both continuous and categorical variables, making them versatile.
  • Highlighting the importance of fields for prediction or classification purposes.
  • Simplicity in usage and implementation, requiring no specialized expertise, rendering them accessible to a broad user base.
  • Scalability we can scale the Decision tree to very long.
  • The ability to flexibly handle missing data or values makes them well-suited for incomplete or missing data sets.
  • Ability to handle non-linear relationships between variables, enhancing suitability for complex datasets.
  • Adequacy in handling imbalanced datasets by adjusting node importance based on class distribution, ensuring functionality even with heavily skewed class representation.

Weaknesses of the Decision Tree Approach

  • When it comes to predicting continuous attribute values, decision trees tend to be less efficient for estimation tasks.
  • In classification scenarios with numerous classes and a limited training dataset, decision trees often encounter challenges, resulting in higher error rates.
  • The computational expense in training decision trees can be notable, particularly when growing trees to fit larger datasets. Sorting candidate splits and searching for optimal combinations of fields during tree construction can be resource-intensive. Similarly, pruning algorithms can be costly as they involve forming and comparing numerous sub-trees.
  • There's a high risk of overfitting, particularly with complex and deep trees, impacting the performance on new, not-seen data.
  • Little variations in training data might yield entirely new or separated decision trees, complicating result comparison or reproduction.
  • Many decision tree models encounter difficulties when dealing with missing data, requiring strategies such as imputation or deletion of records containing missing values to address this issue.
  • Biased trees may result from improper initial data splitting, particularly in cases of unbalanced datasets or rare classes, which can affect the accuracy of the model.
  • The scaling of input features can significantly impact decision trees, particularly when employing distance-based metrics or comparison-intensive decision rules.
  • Decision trees have limitations in representing intricate relationships between variables, especially concerning nonlinear or interactive effects. This can lead to less accurate modeling of complex relationships within the dataset.

Summary

Among the most practical supervised learning models are decision trees for both regression and classification. They can separate data based on features by using a tree structure, where nodes represent features, branches represent decisions, and leaf nodes represent outcomes or forecasts.  Known for their interpretation and visualization capabilities, decision trees contain various types of data, including numerical and categorical data. However, they run the risk of overfitting without proper pruning and may not capture complex data relationships as effectively as alternative algorithms. Using techniques such as ensemble methods can improve their efficiency and robustness.

Python Code








No comments:

Post a Comment

Featured Post

ASSOCIATION RULE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Association rule   Rule Evaluation Metrics Applications of Association Rule Learning Advantages of Association Rule Mining Disadvantages of ...

Popular