1. Consider the following dataset:
[‘bab3’, ‘bc01’, ‘cc2’, ‘cd5’, ‘cd3’, ‘cdx2’, ‘cdx1’, ‘e01’, ‘g02’, ‘ha1’, ‘hb1’, ‘hc8’, ‘hz5’, ‘z00’, ‘z01’, ‘bc01’]
a. Assume a hashing function that makes an assignment based on the 1st symbol of the string. So ‘bab3’ goes into Bucket1 since it starts with ‘b’ and ‘cc1’ goes into Bucket2 since it starts from ‘c’. (Yes, it is a very crude hash function)
[a-b] -> Bucket1
[c-d] -> Bucket2
[e-f] -> Bucket3
[g-z] -> Bucket4
Why (or why not?) would you consider it a good hashing function? Please note that an answer of yes or no (without an explanation) will not be credited.
b. Design your own (good) hash function based on the given data and using exactly 5 buckets. In this case, the goodness of the function is measured based on load-balancing of the data.
c. Suppose that the input dataset is:
[‘a1’, ‘a1’, ‘b1’, ‘d1’, ‘a1’, ‘a1’, ‘b1’, ‘c1’, ‘a2’, ‘c1’, ‘c1’, ‘a1’, ‘d2’,’d1’].
How would you design a hash function to partition this data into 3 buckets? Once again the goodness of hash function is measured based on even distribution (as even as possible).
Now using the hashing function described above, the number of element in each bucket will be as follows:
Bucket1 -> [ ‘bab3’, ‘bc01’, ‘bc01’ ]: 3 elements
Bucket2 -> [ ‘cc2’, ‘cd5’, ‘cd3’, ‘cdx2’, ‘cdx1’] : 5 elements
Bucket3 -> [ ‘e01’]: 1 element
Bucket4 -> [ ‘g02’, ‘ha1’, ‘hb1’, ‘hc8’, ‘hz5’, ‘z00’, ‘z01’]: 7 elements
The hashing function is not good for the given data because the number of elements are not evenly distributed in each bucket. The Bucket3 contains only single element whereas Bucket4 contains 7 elements.
b. For the given data a better hash function would be(using exactly 5 Buckets as described in question):
map each string to a number by using following method:
map a to 1, b to 2 and so on... map z to 26.
map each number to same digit.
eg. for bab3: b will be 2 a will be 1 hence the sum will be 2 + 1 + 2 + 3 = 8.
hence the data:
[‘bab3’, ‘bc01’, ‘cc2’, ‘cd5’, ‘cd3’, ‘cdx2’, ‘cdx1’, ‘e01’, ‘g02’, ‘ha1’, ‘hb1’, ‘hc8’, ‘hz5’, ‘z00’, ‘z01’, ‘bc01’]
will become:
[8, 4, 8, 12, 10, 34, 33, 6, 9, 10, 11, 19, 38, 26, 27, 6]
[0-6] : bucket 1
[7-9] : bucket 2
[10-11]: bucket 3
[12-26]: bucket 4
[greater than 26]: bucket 5
Now the content of the buckets will be :
bucket 1: ['bc01', 'e01', 'bc01']
bucket 2: ['bab3' , 'cc2', 'g02']
bucket 3: ['cd3', 'ha1','hb1']
bucket 4: ['cd5', 'hc8' , 'z00']
bucket 5: ['z01','cdx1','cdx2','hz5']
Now here the content in the bucket are evenly distributed.
Hence it is a good hash function.
c.
for data:
[‘a1’, ‘a1’, ‘b1’, ‘d1’, ‘a1’, ‘a1’, ‘b1’, ‘c1’, ‘a2’, ‘c1’, ‘c1’, ‘a1’, ‘d2’,’d1’]
The above method can also be applied on this data.
Using the above method a good hash function for the given dataset will be:
[0-2] : bucket 1
[3-4] : bucket 2
[5 and above]: bucket 3
first the mapped data will be:
[2,2,3,5,2,2,3,4,3,4,4,2,6,5]
Bucket1: ['a1' ,'a1' ,'a1', 'a1' , 'a1']
Bucket2: ['a2', 'b1', 'b1', 'c1', 'c1', 'c1']
Bucket3: ['d1','d1','d2']
The buckets are not as evenly distributed as the in the first question. But still it is a good distribution of the elements.
1. Consider the following dataset: [‘bab3’, ‘bc01’, ‘cc2’, ‘cd5’, ‘cd3’, ‘cdx2’, ‘cdx1’, ‘e01’, ‘g02’, ‘ha1’, ‘hb1’,...