MATT MILI

Co-founder at CryptoInsider.
Creator of things at 21Mil.
Aspiring Filmmaker.

I love to build things, travel and create content for people to enjoy. You can find some of my photos on Instagram and some of the things I've built below.

Frequent Item Set Mining with Python

Recently, I’ve been quite interested in learning about Big Data. Hearing this buzz word from hackathons to tech talks intrigued me so much that I decided to enroll in a course this semester on this topic. So what exactly does “Big Data” mean? It refers to analyzing large amounts of data in order to extract important information that can be put to good use.

One concept under the Big Data umbrella is data mining. Basically, it means examining large data sets to discover patterns.

For example, you own a retail store and you have data that shows what each customer purchased in the last year or so. With this data, you would like to figure out what items are frequently purchased together that way you can properly rearrange items in your retail store based on customer purchase behaviour to yield maximum profits.

Frequent Item Set Mining

The example above introduces the idea of Frequent Item Set Mining. Let’s call each customer’s total items purchased a ‘basket’ filled with ‘items’ that they’ve purchased. Frequent item set mining refers to finding sets of items that appear together frequently in baskets.

Let’s assume the worst case scenario for your business. Your country has been hit with a recession. It has been a very bad year for business and you’ve only had 10 customers purchase from your store all year. In this scenario, finding the frequent item sets can be done by hand. However, in the real world, you’re dealing with large data sets that make it completely impossible to do it by hand, therefore, we need an algorithm to do this for us.

A-Priori Algorithm

Our goal is to find sets of items that are purchased together frequently. The threshold is the amount of times these sets are found in the data set. For example if our threshold is 10, then we only consider sets of items that appear 10 or more times in our data set. This can be done by pairing up each unique item with another and scanning through the data set while incrementing the count value. The number of possible item sets can get very large. For example, a data set that contains N possible items can generate 2N-1 possible item sets. Stores selling 10,000 or more items are common and this would take a very long time to compute.

The A-Priori Algorithm requires a data set and a threshold as input. We want to find all possible pairs, triplets, etc. of items purchased together that meet a minimum threshold. The Apriori Principle states that if an item set is frequent, then all of its subsets are frequent. For example, if the set {milk, diapers} are frequent, then {milk} and {diapers} are frequent. First, the algorithm creates a candidate item set meaning a list of all the unique items each with a count of 1. The data set will then be scanned to see which sets meet the threshold value. Sets that do not meet this will be discarded and the remaining sets are then combined to make pairs. Again, the data set will be re-scanned and sets that do not meet the threshold value will be discarded. This is repeated until all sets are removed.

Here is my implementation in python of the Apriori Algorithm:

def create_candidate_item_set(dataset_file):
  ''' Create a dictionary of all candidate item sets from the data set with their corresponding count '''
  
  candidate_item_list = defaultdict(int)
  baskets = []

  with open(dataset_file) as file:
    for line in file:
      num_list = map(int, line.split())
      # create a list of all baskets
      baskets.append(num_list)
      # create a dictionary with a count of each individual item

      for item in num_list:
        candidate_item_list[item] = 1 if item not in candidate_item_list else candidate_item_list[item]+1

  return candidate_item_list, baskets


def create_frequent_item_set(item_list, min_threshold):
  ''' Return the frequent items from the candidate_item_list that meet the min_support '''

  # delete items that dont meet min threshold
  for key, value in item_list.items():
    if value < min_threshold:
      del item_list[key]

  return item_list.keys(), item_list.values()


def count(item_list, baskets):
  ''' Count the number of frequent item sets in the baskets '''
  count = dict(zip(item_list, [1]*len(item_list)))

  for basket in baskets:
    for key in count:
      if set(list(key)) < set(basket):
        count[key] += 1 

  return count


def join(freq_item_sets, k):
  ''' Generate the joint transactions from candidate sets of size k '''
  
  # k is the size of each item set
  if k <= 2: 
    return list(it.combinations(freq_item_sets, k))
  else:
    return list(it.combinations(set(a for b in freq_item_sets[0] for a in b),k))


def apriori(dataset_file, threshold):  
  
  C1, baskets = create_candidate_item_set(dataset_file)
  F1_items, F1_values = create_frequent_item_set(C1, threshold)
  
  if not F1_values:
    return None
  else:
    # Remember the item sets
    L = [[F1_items, F1_values]]
    k = 2
    while(True):
      new_list = join(L[k-2][0], k)
      items = count(new_list, baskets)

      Fk_items, Fk_values = create_frequent_item_set(items, threshold)
      if len(Fk_items) > 0:
        L.append([[Fk_items, Fk_values]])
        k += 1
      else:
        break

    return L[k-2][0][0]

I’ve tested this on a data set with 88,000+ baskets and a threshold of 10,000 items. The algorithm was able to complete the task in 2.4 seconds.

I hope you’ve enjoyed this post and happy learning!

Matt