Showing posts with label ECLAT Algorithm. Show all posts
Showing posts with label ECLAT Algorithm. Show all posts

Tuesday, February 20, 2024

ECLAT ALGORITHM IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

 ECLAT Algorithm

  • Introduction and Objective
  • Itemset Lattice in ECLAT algorithm
  • ECLAT Algorithm Working
  • Advantages of the ECLAT algorithm
  • Disadvantage of the ECLAT algorithm
ECLAT machine learning, or Equivalence Class Clustering and bottom-up Lattice Traversal, is a widely used association rule mining technique similar to the Apriori algorithm. However, it's an optimized and scalable version of Apriori, boasting several key improvements:
  • ECLAT operates on the vertical data format of a dataset, unlike Apriori and fp-growth, which work on horizontal transaction data.
  • It employs a depth-first search technique for traversing itemsets, contrasting with Apriori, which adopts a breadth-first strategy to explore the transaction dataset.

These adaptations enhance ECLAT's effectiveness and quickness, particularly on dense datasets with few unique items and a high transaction volume. However, it might not match the performance of the FP-growth algorithm when handling sparse datasets with a multitude of distinct items.

What is the Itemset Lattice in the ECLAT Algorithm?

"In the ECLAT algorithm in machine learning, the construction of an itemset lattice mirrors the approach seen in the Apriori algorithm. This lattice serves as a representation of the search space for frequent itemsets, containing these itemsets along with their corresponding support counts.

Generating the itemset lattice involves a recursive process to produce frequent itemsets of increasing size. Recursion operates at each level, where candidate itemsets 'y' are formed by combining itemsets identified in the previous step. Should a candidate itemset meet the minimum support threshold, it earns a place within the lattice. The resulting itemset lattice is stored in memory, typically represented as a tree data structure." 

Real-World Example for ECLAT

To understand the ECLAT algorithm example much better let’s look at a real-world example. Emily owns a grocery store, but she faces the challenge of optimizing her inventory and boosting sales. She has a diverse array of products; she sought a method to identify which items were frequently purchased together by her customers.

To understand her customer buying behavior she uses the ECLAT algorithm, a powerful tool for frequent itemset mining. Emily employed the algorithm to analyze her transaction data, and she found that frequent itemsets represent combinations of products often bought together.

Emily discovered that customers who bought milk were likely to also buy bread and eggs. Similarly, those customers who buy pasta are also likely to buy pasta sauce. After knowing these patterns, Emily rearranged her store layout, placing complementary items closer together to encourage additional purchases.

Emily also uses the knowledge she gains from the ECLAT algorithm to design targeted promotions. She offered discounts on items frequently bought together, such as chips and soda or cookies and milk, encouraging customers to purchase more items during their visit.

ECLAT Algorithm Working

We first determine the minimal support, confidence, and lift levels before applying the ECLAT method. By this specification, if the transactional dataset isn't already formatted vertically, we convert it into one. The algorithm then goes through phases that are comparable to those in the Apriori algorithm: candidate creation, pruning, database search, and rule generation.

Step 1: Converting Transaction Data into Vertical Format

Most transactional datasets typically store data in a horizontal format, where each row includes a transaction ID along with the respective items contained within the transaction, as illustrated below.:

Transaction ID

Items

T1

l1, l3, l4

T2

l2, l3, l5, l6

T3

l1, l2, l3, l5

T4

l2, l5

T5

l1, l3, l5

In a vertical format representation, each row of the transaction data consists of an item and the transactions where that item appears. This format organizes the data vertically, listing items along with the transactions they belong to:

Items

Transaction IDs

l1

T1, T3, T5

l2

T2, T3, T4

l3

T1, T2, T3, T5

l4

T1

l5

T2, T3, T4, T5

l6

T2

Step 2: Candidate Generation from the Dataset

Once the dataset is transformed into a vertical format, the subsequent stage entails candidate generation, aimed at potentially forming frequent itemsets. This process involves identifying combinations of items that may occur together frequently in transactions. This process begins by establishing sets comprising single items. For instance, in a dataset containing N items, N candidate sets are created initially.

The candidate sets are subjected to assessment utilizing the minimum support count to detect frequent itemsets comprising individual items. Following this, a progressive iteration process combines these identified frequent itemsets to generate larger sets that include 2, 3, 4, 5, or even more items.

During the candidate generation phase, frequent itemsets sharing k-1 items in common are merged to form candidate itemsets containing k items. This iterative process continues until no further candidate itemsets can be generated, signaling the completion of the procedure.

Step 3: Pruning the candidate itemsets

The Apriori principle is the foundation for the ECLAT algorithm's trimming stage. This process is predicated on the idea that an item set must be frequent if a subset of it is. Essentially, an item cannot be considered frequent as a whole if it includes a non-frequent subset.

Pruning is necessary to speed up the algorithm's execution since it removes candidate sets before the dataset is scanned to determine support counts. A series of procedures are used to reduce the candidate set when itemsets of K items are generated.

For every candidate set containing k items, the algorithm scrutinizes each subset comprising k-1 items to determine if it meets the criteria for being a frequent itemset. If all of these subsets are deemed frequent itemsets, the candidate set is preserved for further generation of frequent itemsets. Conversely, if any subset is non-frequent, the entire item is discarded or pruned from consideration.

Step 4: Frequent Itemset Generation

The next step is to find the support count of the candidate itemsets that remain after pruning. To determine the support of each frequent itemset requires scanning the transaction dataset.

The review determines the support count of a candidate itemset or the total number of transactions in which it appears. Those candidates who do not meet the minimum support criteria are removed from the list. The remaining itemsets, with support counts surpassing the threshold, are recognized as frequent itemsets.

Once frequent itemsets containing k items are obtained, the process continues by creating candidate itemsets with k+1 items. This involves pruning, scanning the database, and generating frequent itemsets with k+1 items.

This sequence of generating candidate itemsets, pruning, database scanning, and identifying frequent itemsets persists iteratively until no further frequent itemsets can be generated.

Step 5: Association rule generation

After they are generated, frequent itemsets serve as the foundation for association rules. These rules are often expressed in the format {S} → {I-S}, where {S} represents a subset of the frequent itemset {I}. In this notation, {I} denotes the entire frequent itemset, while {S} represents a subset of items within {I}.

we can write the above Eclat algorithm in Python code because Eclat algorithm implementation in Python is the easiest way to understand it.

Advantages of the ECLAT algorithm

  • Vertical data structure: ECLAT uses a vertical data layout (transactions represented as lists of items) rather than a horizontal one (transactions as rows), making it memory-efficient and suitable for large datasets.
  • Efficiency in Memory Usage: ECLAT optimizes memory usage by representing transaction data concisely, making it more efficient than other algorithms, particularly when handling datasets with sparse structures.
  • Fast Algorithm: it’s generally faster compared to Apriori, particularly on datasets with high dimensionality or when searching for high support itemsets, due to its depth-first search strategy.
  • Scalability: ECLAT scales well to large datasets as it doesn’t require multiple scans of the database, making it suitable for mining frequent itemsets in big data scenarios.
  • Prefix-based Intersection: the algorithm leverages prefix-based intersection strategies to efficiently generate frequent itemsets, minimizing the number of candidate itemsets generated during the search process.
  • Ease of Implementation: ECLAT’s straightforward design makes it relatively easy to implement and understand, aiding adoption and adoption for different use cases.
  • Mining Diverse Itemsets: It's effective in mining diverse itemsets by efficiently discovering frequent itemsets with varying lengths and support thresholds.

Disadvantages of the ECLAT algorithm

  • Memory requirements: Despite being more memory-efficient compared to some algorithms, ECLAT can still demand significant memory, especially when dealing with datasets containing numerous transactions and items.
  • Limited Handling of Large Dataset: Although it’s more memory-efficient than Apriori, ECLAT might still face challenges when dealing with extremely large or dense datasets due to memory constraints.
  • Need for transaction Identifiers: ECLAT requires transaction identifiers or bit vectors to represent transaction sets, which can add overhead and complexity to the data representation, particularly in scenarios with high-dimensional or sparse datasets.
  • Lack of Pruning Techniques: ECLAT might not perform optimally with lower support thresholds, as it can result in a more extensive search space and increased computational requirements.
  • High Pruning Techniques: Unlike Some other algorithms, ECLAT might lack certain pruning strategies to efficiently reduce the search space, potentially leading to increased computational overhead.
  • High Support Threshold Impact: ECLAT might not perform optimally with lower support thresholds, as it can result in a more extensive search space and increased computational requirements.
  • Complexity in Parallelization: Parallelizing ECLAT might be more challenging due to the vertical data structure and the requirement of multiple intersections during the frequent itemset mining process.
  • Inefficiency with Low Support Itemsets: In cases where the dataset contains numerous low support itemsets, ECLAT might generate a large number of infrequent itemsets, impacting performance.
  • Dependency on Vertical Format: Although the vertical format helps in certain scenarios, transforming data into this format can be a preprocessing challenge, especially when working with data initially presented in a horizontal format.

Summary

ECLAT is a frequent itemset mining algorithm known for its efficient vertical data layout, organizing transactions to efficiently identify sets of items frequently occurring together in datasets. Utilizing a depth-first search strategy and prefix-based intersection techniques, ECLAT efficiently generates frequent itemsets, making it faster than Apriori in certain scenarios, particularly on high-dimensional datasets. Items memory-efficient approach and scalability to large datasets make it suitable for mining diverse itemsets, but it might face challenges with memory constraints and lack some pruning strategies, impacting performance in specific dataset characteristics and support thresholds. Despite this, its simplicity and effectiveness in discovering frequent item sets with varying lengths and support thresholds make it a valuable tool in association rule mining. below is the Elcat algorithm python description.

Python code

let's look at the eclat algorithm Python code:



APRIORI ALGORITHM IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

 Apriori Algorithm

  • Introduction and Objective
  • Components of Apriori Algorithm
  • Steps of the Apriori Algorithm
  • Advantages of the Apriori Algorithm
  • Efficient for Large Datasets
  • Identifies Frequent itemset
  • Simple and Understandable
  • Disadvantages of the Apriori Algorithm
  • Computational Complexity
  • Generation on Numerous candidate itemset
  • Apriori Property Dependency
  • Sensitive to Support Threshold 
  • Inefficient Handling of Large Dataset.

The Apriori algorithm stands out in association rule mining by identifying relationships between objects. Introduced by R. Agrawal and R. Srikant in 1994, it aims to discover frequent item sets within datasets. Named "Apriori" due to its utilization of prior knowledge about repeating items, this algorithm employs repeated or specific level approach to find k+1 itemsets based on k-frequent itemsets.

Its primary goal is to establish associations among different objects. Often referred to as frequent pattern mining, an illustrative example occurs in supermarkets where common items like butter, bread, and milk are placed together. This strategic arrangement is based on the likelihood that if a customer buys bread, they might also purchase milk and butter, enhancing both convenience for customers and the supermarket's sales performance.

Operating on large transactional databases like supermarket purchase records, the Apriori algorithm association rule mining efficiently generates frequent item sets by employing the Apriori property, which reduces the search space. It aids customers in their purchases and enhances sales performance wherever it's applied.

By the a priori property, every nonempty subset must also be frequent. The anti-monotonicity of the support measure, which asserts that all supersets of rare subsets must likewise be uncommon, is crucial to this tactic. The apriori algorithm in machine learning and apriori algorithm data science are the main fields of machine learning.

Real-World Example for Apriori Algorithm

Let’s look at an apriori algorithm example, Emily a grocery store owner faced the challenge of optimizing her inventory and boosting sales. She wanted to know which of the many things she had on her shelf her clients most frequently purchased together.

To overcome this challenge, she uses the Apriori algorithm, a powerful tool for association rule mining. Emily employed the algorithm to analyze her transaction data, uncovering patterns of co-occurrence among different products.

To her delight, Emily discovered that customers who bought milk were also likely to buy bread and eggs. Similarly, those who buy pasta often buy pasta sauce too. After knowing these insights, Emily rearranged her store layout, placing complementary items closer together to encourage additional purchases.

Emily also uses the knowledge she gained from the Apriori algorithm association rules to design targeted promotions. She offered discounts on items frequently bought together, like chips and soda or cookies and milk, enticing customers to purchase more items during their visit.

Components of the Apriori algorithm

There are three components in the apriori algorithm.
  1. Support
  2. Confidence
  3. Lift

Support - it is the innate degree of a product's interest. The frequency with which a specific group of items appears in a dataset is referred to as "support". It displays the overall dataset frequency of a group of items.

Mathematically, the support of an itemset A is calculated as:
Support(A)=(Transactions containing A)/(Total number of transactions)

For example, We can divide the total number of transactions by the number of transactions where both milk and bread were purchased to find the support for a set of items in purchase data. {milk, bread} is our set of items in this instance. within the database.

An essential statistic in the Apriori method, support aids in the identification of frequent item groups. Apriori concentrates on itemsets meeting a certain minimal support criterion to reduce the search space. The number of combinations to be investigated is reduced and the computational efficiency in identifying frequent itemsets and association rules is improved when itemsets that do not match this threshold are deemed infrequent and are removed from further investigation.

confidence – it is referred to as the possibility of anything bought together with other things or things or we can say that confidence is a measure used to evaluate the strength of association between items in association rules. It represents the conditional probability that a rule is true given that its antecedent (the items on the left-hand side) is present.

Mathematically, the confidence of a rule {A} -> {B}  is calculated as:
Confidence (A →B)=  Support(A∪B)/(Support (A))

In the above equation Support(A∪B) represents the frequency of occurrence of both items A and B together in the dataset, on the other hand, Support(A) shows the frequency of item A appearing alone.
High confidence shows a strong relationship between items A and B, it means that if someone buys item A then, there’s a high likelihood that they also buy item B. However, it’s necessary to set a threshold for confidence, as high confidence doesn’t necessarily guarantee the usefulness or significance of a rule. Adjusting the confidence threshold helps in filtering out less meaningful rules and focusing on those with stronger associations.
Lift–lift is a measure used to assess the strength of association between two items in an association rule beyond what would be expected based on the individual item frequencies.
The lift of a rule {A} → {B} is calculated as

Lift(A→B)=  Support(A∪B)/(Support(A)×Support(B) )

Left measures how much more often items A and B occur together in transactions compared to what would be expected if their occurrences were statistically independent.

Interpretation of lift values:

Lift = 1: indicates independence. It means that the occurrence of A does not affect the occurrence of B, and vice versa.

Lift > 1: it means that there is a positive correlation between items A and B, it means that if someone takes item A then there is a high chance that they also take item B and vice versa. 

Lift < 1: suggests an avoidance or negative connection between A and B by showing that the presence of A adversely affects the presence of B (and vice versa).

High lift values generally suggest a strong association between the items in the rule. In the context of association rule mining, lift is used alongside support and confidence to evaluate and select meaningful rules, for analysis and decision-making.

Steps of the Apriori Algorithm
    1. Initialization
      1. Identify all unique items present in the dataset.
      2. Determine the minimal level of support (frequency) required for an item to be considered frequent.
    2. Generating candidate itemsets
      1. Create candidate itemsets of length 1 (single items) containing all unique items from the dataset.
      2. Filter these candidates based on the minimum support threshold to obtain frequent items of length 1.
    3. Generating longer itemsets
      1. Using the two most common itemsets of length 1, create candidate itemsets of length 2 (item pairs).
      2. Prune item pairs that contain subsets that are not frequent.
      3. Again, filter these candidates based on the minimum support threshold to obtain frequent itemsets of length 2.
    4. Iterative process for longer itemsets
      1. Repeat the process of generating candidate itemsets and filtering them to obtain frequent itemsets of higher lengths (length>2).
      2. Each iteration involves generating candidate itemsets, pruning based on the Apriori property (eliminating subsets that are not frequent), and filtering based on the support threshold.
    5. Association rule generation
      1. From the frequent itemsets obtained, generate association rules.
      2. Create rules that have both high confidence and support, reflecting strong associations between items.
      3. Association rules are in the form of “if X then Y”, where X and Y are itemsets.
    6. Evaluation and Pruning
      1. Evaluate generated rules based on predefined metrics like support, confidence, and lift.
      2. Prune rules that do not meet the minimum threshold criteria to derive meaningful and actionable rules.
    7. Repeat or Terminate
      1. The algorithm terminates when no new frequent itemsets can be generated or when the desired itemset length is reached.
      2. Otherwise, iterate through the process until no more frequent itemsets can be found.

Let’s look at these steps with an example and in detail.

Let us have the following dataset shown in the below image and we need to find the frequent itemsets and association rules generated for them.


From the above image, we can see that the minimum number of supports is two, and the least confidence is 60%.

Step-1: (I) ie. K=1 - In this step, we create a table with the support number for each product. is in the table/dataset above. It is called C1 (candidate set).

TID

Items_count

L1

6

L2

7

L3

6

L4

2

L5

2

(II) now we compare the support count of candidate set items with their least support count (here minimum support is 2), if we get any candidate set element support_count less than minimum support then we have to remove those items. After the process of getting items L1.

TID

Items_count

L1

6

L2

7

L3

6

L4

2

L5

2


Step – 2: here K = 2
  • At this point we need to generate another set of candidates using C2 L1. This is called a joi tap. The condition to join L_(k-1) and L_(k-1) (L subscript (k-1)) is that there are (K-2) elements in common between them.
  • Now we have to check all the subsets of the set of common elements, or not. If the target set is not common, we remove it.
  • Then we need to find support for those targets by searching the dataset.

TID

Items_count

L1, L2

4

L1, L3

4

L1, L4

1

L1, L5

2

L2, L3

4

L2, L4

2

L2, L5

2

L3, L4

0

L3, L5

1

L4, L5

0


(II) Now that we have compared the candidates (C2) items, we delete the candidates whose support_number is less than min_support. This procedure yields the elementary set L2.

TID

Items_count

L1, L2

4

L1, L3

4

L1, L5

2

L2, L3

4

L2, L4

2

L2, L5

2


Step – 3: 
  • Here we generate the candidate set C3 using L2 (merging). Here, the condition to combine L_(k-1) and L_(k-1) is that they must have (K-2) elements in common. Therefore, the first element of L2 must match.
  • So, the set of elements created by merging L2 is {l1, l2, l3} {l1, l2, l5} {l1, l3, l5} {l2, l3, l4 } { l2, l4 , l5} {l2, l3, l5}.
  • Now we check if all subsets of these elements are common or not, if they are not common then remove these elements. (Using the table above), we see that the subsets {l1, l2, l3} are {l1, l2,}, {l2, l3}, {l1, l2}, which are common. {l2, l3, l4} is a subset of { l3 , l4} is not frequent, so remove it. We check all the items in the same way.)
  • Now we need to find the support number for the rest of these items by searching the dataset.
  • TID

    Itmes_count

    L1, L2, L3

    2

    L1, L2, L5

    2

(II) Since min_support = 2 is also present here, we may compare it to the candidate support count (C3). We remove an item from the candidate collection if its support count is less than min_support. Thus, the set of elements L3 is what we have.

TID

Items_count

L1, L2, L3

2

L1, L2, L5

2

Step – 4
  • The next action is to generate candidate set C4 using L3. Combining L_(k-1) and L_(k-1) (K=4) requires that the components have (k-2) in common. Thus, the first two items in L3 are anticipated to overlap.
  • It is now necessary to confirm whether or not each subset of the tuple is frequent. Assuming the following: {l1, l2, l3, l5} makes up the tuple L3. Since {l1, l3, l5} is an uncommon member in its subset, we can conclude that C4 does not have an initial set.
  • We must stop here because no more common initial set can be found.

Advantages of Apriori Algorithm

  • Efficient for large datasets: it efficiently handles large datasets by reducing the search space for frequent item sets using candidate generation and pruning techniques.
  • Identifies frequent itemset: it accurately identifies frequent itemsets based on user-defined support thresholds, helping uncover commonly occurring item combinations in the data.
  • Simple and understandable: the algorithm’s concept and implementation are relatedly straightforward, making it scalable to distributed computing environments, which is useful for big data scenarios.
  • Scalable and Parallelizable: Apriori can be parallelized, and various optimizations can be applied, making it scalable to distributed computing environments, which is useful for big data scenarios.
  • The basis for rule generation: it serves as a foundation for generating association rules that reveal interesting and actionable patterns in the data.
  • Flexible Thresholds: Users can set different support thresholds to discover itemsets of varying frequencies, allowing for flexibility in the exploration of different patterns within the dataset.
  • Applications in various fields: the algorithm finds applications in various fields, such as market basket analysis, recommendation systems, bioinformatics, etc., and provides an overview of the relationships between products or events.
  • Association Rule Generation: it generates association rules that express correlations between items, assisting in decision-making, targeted marketing, and business strategy formulation.
  • There are various apriori algorithm applications in the real world and also apriori algorithm big data is used. 

Disadvantages of Apriori Algorithm

  • Computational Complexity: For datasets with a large number of transactions or items, the apriori algorithm can become computationally expensive due to its need to generate and test a vast number of candidate itemsets.
  • Storage and Memory Requirements: Maintaining a large number of candidate itemsets and support counts can require significant memory storage, particularly for datasets with numerous unique items.
  • Generation of Numerous Candidate items: the algorithm may generate a considerable number of candidate itemsets, leading to a large search space and increased processing time, especially in cases where minimum support thresholds are low.
  • Apriori Property Dependency: the Apriori algorithm relies on the Apriori property (i.e., if an item is infrequent, its supersets will also be infrequent), resulting in multiple scans of the dataset to generate candidate items, which might be inefficient.
  • Sensitive to Support Threshold: Performance may vary significantly based on the chosen minimum support threshold. Setting a lower threshold might lead to the generation of numerous frequent item sets, while a higher threshold might result in missing some relevant patterns.
  • Inefficient Handling of Large Dataset: For datasets with a high number of transactions but low-density itemsets, the algorithm might not perform optimally due to the large number of potentially infrequent itemsets.
  • Static Threshold Settings: the need for predefined support thresholds requires prior knowledge or trial-and-error to set appropriate values, which can impact the quality and relevance of discovered itemsets.
  • Lack of Handling Continuous Variables: Apriori primarily works with categorical or binary data and may require pre-processing for continuous variables.

Summary

The Apriori algorithm in data mining is a classic association rule mining method used to find common entities in datasets. Iteratively, it identifies recurring target sets, eliminates candidates that fall short of minimal support requirements, and keeps going. It effectively discovers relationships between objects or events, resulting in significant associative rules. Despite the challenges of scaling large datasets and the reliance on predefined support thresholds, Aprior is a core technique for discovering interesting patterns and relationships in data.

Python Code



Featured Post

ASSOCIATION RULE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Association rule   Rule Evaluation Metrics Applications of Association Rule Learning Advantages of Association Rule Mining Disadvantages of ...

Popular