MATT MILI

Co-founder at CryptoInsider.
Creator of things at 21Mil.
Aspiring Filmmaker.

I love to build things, travel and create content for people to enjoy. You can find some of my photos on Instagram and some of the things I've built below.

Apriori Hash Based Improvement

In my last post I discussed frequent item set mining using the APriori algorithm which finds frequent item sets by extending them to larger item sets as long as those item sets appear sufficiently often in the data set.

One can observe that during the first pass of Apriori, most of the system’s memory is idle. One improvement would be to implement a hash-based check in Pass 1 to reduce the amount of memory required in Pass 2.

Introducing PCY, named after it’s founders Park, Chen and Yu. PCY improves Apriori by focusing on efficient memory usage. The idea is to maintain a hash table with as many ‘buckets’ as possible that fit into main memory. For each bucket, keep a count of the item sets that hash into this bucket.

FOR (each basket) :
  FOR (each item in the basket) :
    add 1 to item’s count;
  
  // THIS IS NEW IN PCY
  FOR (each pair of items) :
    hash the pair to a bucket;
    add 1 to the count for that bucket;

A few things to consider when hashing a frequent pair:

  • Pairs of items need to be generated from the input file; they are not present in the file.
  • Need to see whether the frequent pair is present at least s (support) times

After Pass 1 is complete, there is an intermediate step that must be completed before pass 2. For memory conservation purposes, the hash table from pass 1 is converted into a bit vector. 1 will represent a bucket count exceeds the support s, and 0 means it did not. 4 byte integers are replaced by bits, so the bit vector requires 1/32 of memory.

Now, in order for an item set[i,j] to be a candidate pair, they must pass 2 things:

  1. Both i and j are frequent.
  2. When applying the hash function hash(i, j) hashes to a bucket whos bit in the bit vector is 1.

Here’s my implementation in Python:

def hash(num1, num2):
  ''' Hash function for the hash table using 2000 buckets'''
  return (num1 ^ num2) % 2000


import BitMap
def create_bitmap(hash_table, threshold):
  '''Creates a bitmap for the bucket hash table'''
  bit_map = BitMap(len(hash_table))

  for key, value in hash_table.items():
    if value < threshold:
      bit_map.set(key)

  return bit_map


def create_candidate_item_set(dataset_file):
  ''' Create a dictionary of all candidate item sets from the data set with their corresponding count '''
  
  candidate_item_list = defaultdict(int)
  baskets = []
  buckets = {}

  with open(dataset_file) as file:
    for line in file:
      ##
      # Create the candidate item set
      ##
      num_list = map(int, line.split())
      # create a list of all baskets
      baskets.append(num_list)
      # create a dictionary with a count of each individual item
      for item in num_list:
        candidate_item_list[item] += 1

      ##
      # Create pairs of unique items in each bucket
      ##
      pairs = list(it.combinations(num_list, 2))
      for pair in pairs:
        index = hash(pair[0], pair[1]) 
        buckets[index] = 1 if index not in buckets else buckets[index]+1

  return candidate_item_list, baskets, buckets


def create_frequent_item_set(item_list, min_threshold):
  ''' Return the frequent items from the candidate_item_list that meet the min_support '''

  # delete items that dont meet min threshold
  for key, value in item_list.items():
    if value < min_threshold:
      del item_list[key]

  return item_list.keys()


def count(item_list, baskets):
  ''' Count the number of frequent item sets in the baskets '''
  count = dict(zip(item_list, [1]*len(item_list)))

  for basket in baskets:
    for key in count.iterkeys():
      if set(list(key)) < set(basket):
        count[key] += 1 

  return count


def join(freq_item_sets, k):
  ''' Generate the joint transactions from candidate sets of size k '''
  
  # k is the size of each item set
  if k <= 2: 
    return list(it.combinations(freq_item_sets, k))
  else:
    return list(it.combinations(set(a for b in freq_item_sets for a in b),k))


def pcy(dataset_file, threshold):  
  
  C1, baskets, buckets = create_candidate_item_set(dataset_file)
  bitmap = create_bitmap(buckets, threshold)
  F1_items = create_frequent_item_set(C1, threshold)

  # hash each item set into the bitmap and remove non frequent pairs
  frequent_pairs = join(F1_items, 2)
  for pair in frequent_pairs:
    hash_value = hash(pair[0], pair[1])
    if not bitmap.test(hash_value):
      frequent_pairs.remove(pair)

  if not frequent_pairs:
    return None
  else:
    # Initialize with possible frequent pairs
    L = [frequent_pairs]
    items = count(L[0], baskets)
    # check which frequent pairs meet minimum threshold value
    L[0] = create_frequent_item_set(items, threshold)

    k = 3
    while(True):
      new_list = join(L[k-3], k)
      items = count(new_list, baskets)

      Fk_items = create_frequent_item_set(items, threshold)
      if len(Fk_items) > 0:
        L.append(Fk_items)
        k+=1
      else:
        break
    
return L[k-3]