Decision Trees
- Decision Trees
- Advantages of Decision Trees
- Disadvantages of Decision Trees
- Appropriate Problems for Decision Tree Learning
- Practical Issues in Learning Decision Trees
- Classification and Regression Tree Algorithm
One famous supervised ML technique that can handle ML classification and regression issues is random forest. It uses the combined knowledge of several classifiers to solve complex problems, based on ensemble learning concepts. This technique improves the learning system's overall performance by using the strengths of many models.
Random forest improves accuracy and reduces overfitting by combining predictions from several decision trees. There is intrinsic substantial diversity in individual decision trees. The resultant variation, however, is reduced when these trees are simultaneously integrated into a random forest. This decrease happens because training each decision tree on a different set of data ensures that the result is dependent on numerous trees rather than just one, which in turn reduces the overall variance.
Raising the tree count in a random forest model improves accuracy and strengthens the model by reducing the likelihood of overfitting.
The two main reasons why
we should use decision trees are:
- Decision trees aim to replicate human decision-making processes, making them straightforward to comprehend.
- We can easily understand it because it looks like a tree structure.
Decision tree example
Let’s suppose Noah, a
park ranger with a passion for wildlife, was tasked with a new challenge: the
challenge is to identify the increasing number of mysterious animal tracks
appearing on the trails. The methods previously used are not enough. Noah
needed a smarter way to decipher these cryptic clues. Here enters the decision
tree method, Noah's secret weapon for unraveling the tracks’ secrets.
Imagine a decision tree machine learning as a series of questions, like a branching path in the forest. Each question
focuses on a specific characteristic of the tracks, like size, number of toes,
and claw marks, depth, stride length. Noah fed these data and more related data
into the decision tree, paired with the corresponding animal identified by
experienced trackers.
The decision tree doesn’t
just memorize facts, it learns by analyzing the data. It creates a series of
yes-or-no questions based on the most relevant track features. For example, the
first question might be: “Are the tracks larger than 6 inches?” depending on
the answer, the decision tree would branch out, leading to further questions
about the number of toes or claw marks.
This branching structure
very much looks like a detective’s flowchart, narrowing down the possibilities
with each answered question. The decision tree continues branching until it
concludes – the most likely animal that left the tracks.
The decision tree
algorithm real test comes when Noah encounters a set of fresh tracks unlike any
he’d seen before and he carefully measures and observes them, and provides this
information to the decision tree algorithm to predict the outcome or in this
case the animal.
Noah started with the
initial question: “Are the tracks larger than 6 inches?” Yes. The tree branched,
leading to the next question: “Do the tracks have tree toes?” No. This
eliminated possibilities like deer or rabbits. The tree continued branching,
asking about claw marks, stride length, and other details.
Finally, after a series
of questions, the decision tree concluded: and this conclusion is that the
tracks are most likely left by a bobcat.
Decision tree terminologies
Key Terminologies in Building a Decision Tree:
Root NodeLocated at the
top of the tree structure, the root node encompasses all the data, acting as
the starting point for decision-making within the decision tree.
Decisions based on
features added to the network are represented by these nodes. Internal nodes
can give rise to more internal nodes or even leaf nodes as their offshoots.
A terminal node, also
known as a leaf node, is a node that has no child nodes and is used to indicate
a class name or a number value.
Splitting - divides a
node into several smaller nodes according to a split criteria and a chosen
attribute.
Beginning from an inner
node and ending at a leaf node makes up the branch/sub-tree segment of the
decision tree.
A node that splits to
produce one or more child nodes is known as a parent node.
Parent Node: After
passing through a split, this node gives birth to one or more offspring nodes.
Child Node: These nodes
are created when a parent node is divided through a splitting process.
Impurity: It evaluates
the consistency of the target variable in a subset of the data and indicates
the randomness or uncertainty of the data set. Decision trees typically use
additional metrics such as "entropy" and "Gini index" for
classification tasks.
Variance: Variance in decision trees for regression problems illustrates the
differences between the actual and predicted target variables in various
dataset samples. This variance is typically measured using several metrics,
such as mean square error, mean absolute error, Friedman's MSE, and half-Poisson Deviance.
Information Gain: This
measure evaluates the uncertainty reduction achieved by partitioning the data
tree according to a given feature of the decision tree. It evaluates the
feature with the largest information gain for each node to determine which is
the most informative feature to share, enabling the production of cleaner
subsets.
Pruning: This entails deleting branches from the decision tree that do not give extra information that could potentially cause overfitting.
Attribute Selection Measures
Building a Decision Tree
requires a learning phase where the initial dataset is partitioned into subsets
using Attribute Selection Measures (ASM). ASM plays a crucial role in decision
tree algorithms by evaluating the effectiveness of various attributes in
dividing datasets. Its main goal is to identify attributes that result in the
most similar-looking subsets after splitting so that the data benefit is maximized. This iterative process
of recursive partitioning occurs for each subset or subtree, driving the
gradual construction of the decision tree.
One notable aspect is
that the construction of a decision tree classifier doesn't necessitate prior
domain knowledge or specific parameter settings, making it valuable for
exploratory knowledge discovery. Additionally, decision trees are adept at
handling high-dimensional data.
Entropy serves as a
metric to gauge the level of randomness or uncertainty within a dataset. In
datasets designed for classification tasks, entropy serves as a measure of
randomness, calculated according to the distribution of class labels present in
the dataset.
For a given subset of the original dataset containing a K-class for the ith node, entropy serves as a metric of the level of disorder or uncertainty in that subset. This evaluation helps to evaluate the impurity of a node and influences the selection of attributes during the construction of the decision tree.
- S is the dataset sample.
- k is the particular class from K classes
- p(k) is the proportion of the data that belong to class k to the total number of data points in dataset sample S.
- In the equation p(i, k) must not equal to zero.
There are some important points we should remember if we are using entropy, and they are:
- If the dataset is fully homogeneous then the entropy is 0, which means that every instance or data point in the dataset belongs ton a single class. It is the lowest entropy which indicates that there is no uncertainty in the dataset sample.
- The entropy is the dataset's highest value if it is equally divided into several subclasses. This suggests that maximal uncertainty in the dataset sample is indicated by a uniform distribution of class labels, which is when entropy is highest.
- Entropy is also utilized to evaluate the effectiveness of a split. It aims to create more homogeneous subsets within the dataset concerning class labels, thereby minimizing the entropy of the resulting subsets.
- The decision tree algorithm selects the attribute with the highest information gain as the splitting criterion, and this process iterates to construct the decision tree further.
Gini Impurity or Index – The Gini index serves as a metric to assess the
accuracy of a split among classified groups, providing scores between 0 and 1.
A score of 0 indicates that all observations belong to a single class, while a
score of 1 signifies a random distribution of elements across classes. Hence,
aiming for a lower Gini index score is ideal, indicating more homogeneous
subsets after the split. This metric serves as an evaluation tool for decision
tree model, allowing the assessment of the model's effectiveness in creating
informative splits.
In the above equation, p_i (p subscript i) is the proportion of elements in the set that
belongs to the ith category.
Information Gain – Information
gain measures the decrease in entropy or variance achieved by dividing a dataset
according to a specific attribute. Within decision tree methods or algorithms,
it evaluates the importance of a feature by creating subsets that are more
uniform or homogeneous concerning the class or target variable. A higher data
gain suggests that the feature is more useful for predicting the target
variable, highlighting its importance in improving the uniformity of subsets
after splitting.
The information gain of
an attribute A, concerning a dataset S, is calculated as follows:
- A is the specific attribute or class label
- |H| is the entropy of dataset sample S
- |H_V | (H subscript V) is the number of instances in the subset S that have the value v for attributes A.
When building a
decision tree, the primary criterion for partitioning is the most informative
attribute. The amount of entropy or variation that is reduced when a data
collection is divided based on attribute A is indicated by data validation.
Data gathering is essential to the operation of regression and classification decision trees. Regression trees consider variance, while classification trees consider entropy when evaluating additiveness. In their information gain computations, both types use variance or entropy, which is the same regardless of the impurity measurements used.
Decision tree algorithm and its
working
The decision tree
analyzes the dataset to generate predictions regarding its categorization. It
looks at the base node of the tree first. The record's attribute has now been
compared by the algorithm to the values of the dataset's root attribute.
Afterward, it proceeds down the branches using this comparison, verifying the
specific attribute requirements met at every stage to decide whether to go on
to the next node or not.
Iteratively analyzing and
comparing the dataset or datasets is done at each subsequent node of the
decision tree. Until it reaches the leaf node of the tree, the cycle keeps
going. You will be able to comprehend the full operation by following these
steps.
First Step: The dataset
is contained in the root node, S, at the very top of the tree.
Second Step: The
algorithm's second phase involves determining which attribute in the dataset is
the best one using the Attribute Selection Measure (ASM).
Third Step: The algorithm
splits dataset S into subsets containing possible values for the best attribute
in the third stage.
Fourth Step: Making a
decision tree node with the selected top attribute is the fourth step.
Fifth Step: The process
iteratively creates new decision trees using the dataset subsets created in
Step 3. The recursive construction process ends and the final node in the
categorization or regression tree process is designated as a leaf node when
more classification is no longer possible.
Let’s suppose there is a man who wants to buy a new mobile phone now he needs to decide which type of mobile phone he wants to solve this problem we can use a decision tree and it starts with the operating system the use want after that we can split the node into parts which also have decision node, then if the user selects the operating system then it move to next node which is related to camera quality and then it split into different nodes (it also has decision node) after selecting the node it split into processor name and brand and then split into further category this process continues till we goes to leaf node (every node has its own decision node). By using this mechanism we can easily select or decide which mobile phone users want.
- It is easy to understand because it also tries to follow the same process that human follows to make a decision.
- It is very useful in cases where we need to make a decision.
- It can help to think about what may also be the possible outcome of the problem and also what are the other results.
- For this algorithm we need to clean the data less as compared to other algorithms.
Disadvantages of the Decision Tree
- As the decision tree grows, it develops several layers, and when the data is very heterogeneous, the tree expands with additional layers, increasing complexity.
- Another issue related to decision handicaps is potential overfitting, which occurs when a model picks up noise in the training data instead of general patterns. One way to solve this problem is to use the Random Forest algorithm.
- As the number of class identifiers increases, the computational complexity of the decision tree can also increase.
Appropriate problems for
Decision tree learning
Decision tree learning is
particularly well-suited for problems characterized by the following traits:
Instances represented by
attribute-value pairs: Commonly, instances are portrayed using attribute-value
pairs, such as temperature, and corresponding values like hot, cold, or mild.
Ideally, attributes possess a finite set of distinct values, simplifying the
construction of decision trees. Advanced versions of decision tree algorithms
can handle attributes with continuous numerical values, allowing the representation
of variables like temperature on a numerical scale.
Discrete output values
for the target function: Decision trees are commonly developed for categorical
Boolean examples, like binary outcomes like yes or no. While primarily used for
dual outcomes, decision tree methods can be developed to handle functions with
multiple distinct output values, albeit applications with numeric outputs are
less frequent.
Need for disjunctive
descriptions: Decision trees naturally accommodate disjunctive expressions,
enabling effective representation of complex relationships within data.
Resilience towards errors
in training data: Decision tree learning techniques demonstrate resilience
towards errors present in training data, including inconsistencies in
categorization or discrepancies in feature details characterizing cases.
Handling missing attribute values in training data: In some cases, training data might have missing or absent characteristics. Despite encountering unknown features in certain training samples, decision tree approaches can still be applied effectively. For example, when considering humidity levels throughout the day, this data might be available for only a specific subset of training samples.
- Selecting the decision tree's growth depth
- Taking care of the enduring traits
- Choosing the Most Effective Attribute Selection Metric
- Interpreting training data in the presence of blank values for attributes
- Handling attributes with different price points.
- Increasing the efficiency of computing
- Root Node Initialization: Begin with the root node of the tree, representing the complete training dataset.
- Evaluate Feature Impurity: Examine all features in the dataset to measure the level of impurity in the data. Classification tasks use different metrics to quantify impurities, such as the Gini index and entropy, while regression tasks use metrics such as root mean square error, mean absolute error, friedman_mse, or semi-Poisson deviation.
- Select the Feature That Will Provide the Most Information: Determine whether the data-splitting feature minimizes impurity or produces the most relevant information.
- Partition the Database: Divide the data set into two groups, one for each possible attribute value that has been selected; for instance, "yes" and "no" for each conceivable attribute value. The goal is to create subsets that are as comparable to the dependent variable as possible.
- Assess Subset Impurity: Evaluate the impurity of each resulting subset based on the target variable.
- Iterative Process: Continuously repeat steps 2-5 for each subset until the termination condition is met. The stopping criteria can reach the maximum tree depth, reach how many minimum samples are needed to split or reach the minimum limit of impurities.
- Assign Values to Terminal Nodes: For each terminal node, also known as a leaf node, determine the most frequent class in the tree for classification tasks or the average for regression tasks. This task ensures that the model can make predictions for new data instances using the models learned during training.
These steps collectively
facilitate the creation of an effective decision tree through iterative
evaluation and feature splitting, catering to both classification and
regression tasks.
Classification and Regression
Tree algorithm for Classification
- The impurity measure for the left and right subsets at node m is denoted as H. H's value can be determined by analyzing entropy or Gini impurities.
- n_m (n subscript m) is the total number of instances encountered at node m during the left and right sunsets.
Classification and Regression
Tree algorithm for Regression
- Decision trees offer interpretability due to their rule-based nature, allowing easy comprehension of generated rules.
- They perform classification tasks with minimal computational demand.
- Capable of handling both continuous and categorical variables, making them versatile.
- Highlighting the importance of fields for prediction or classification purposes.
- Simplicity in usage and implementation, requiring no specialized expertise, rendering them accessible to a broad user base.
- Scalability we can scale the Decision tree to very long.
- The ability to flexibly handle missing data or values makes them well-suited for incomplete or missing data sets.
- Ability to handle non-linear relationships between variables, enhancing suitability for complex datasets.
- Adequacy in handling imbalanced datasets by adjusting node importance based on class distribution, ensuring functionality even with heavily skewed class representation.
Weaknesses of the Decision Tree Approach
- When it comes to predicting continuous attribute values, decision trees tend to be less efficient for estimation tasks.
- In classification scenarios with numerous classes and a limited training dataset, decision trees often encounter challenges, resulting in higher error rates.
- The computational expense in training decision trees can be notable, particularly when growing trees to fit larger datasets. Sorting candidate splits and searching for optimal combinations of fields during tree construction can be resource-intensive. Similarly, pruning algorithms can be costly as they involve forming and comparing numerous sub-trees.
- There's a high risk of overfitting, particularly with complex and deep trees, impacting the performance on new, not-seen data.
- Little variations in training data might yield entirely new or separated decision trees, complicating result comparison or reproduction.
- Many decision tree models encounter difficulties when dealing with missing data, requiring strategies such as imputation or deletion of records containing missing values to address this issue.
- Biased trees may result from improper initial data splitting, particularly in cases of unbalanced datasets or rare classes, which can affect the accuracy of the model.
- The scaling of input features can significantly impact decision trees, particularly when employing distance-based metrics or comparison-intensive decision rules.
- Decision trees have limitations in representing intricate relationships between variables, especially concerning nonlinear or interactive effects. This can lead to less accurate modeling of complex relationships within the dataset.
Summary
Among the most practical supervised learning models
are decision trees for both regression and classification. They can separate
data based on features by using a tree structure, where nodes represent
features, branches represent decisions, and leaf nodes represent outcomes or
forecasts. Known for their interpretation and
visualization capabilities, decision trees contain various types of data,
including numerical and categorical data. However, they run the risk of
overfitting without proper pruning and may not capture complex data
relationships as effectively as alternative algorithms. Using techniques such
as ensemble methods can improve their efficiency and robustness.