1.) Here different instances are
(M,Family,Small),(M,Sports,Medium), and so on. Every example besides (F,Luxury,Large) is unsplitted.
Gini value for (M,Family,Small) = ((1/1)²+(0/1)²)
Also its weight is (1/20)
Similarly calculating for other training examples, we get overall gini value as:
((1/20)*((1/1)²+(0/1)²)+(2/20)*((2/2)²+(0/2)²)+(1/20)*((1/1)²+(0/1)²)+(2/20)*((2/2)²+(0/2)²)+(2/20)*((2/2)²+(0/2)²)+(1/20)*((1/1)²+(0/1)²)+(2/20)*((1/2)²+(1/2)²)+(1/20)*((1/1)²+(0/1)²)+(1/20)*((1/1)²+(0/1)²)+(1/20)*((1/1)²+(0/1)²)+(1/20)*((1/1)²+(0/1)²)+(2/20)*((2/2)²+(0/2)²)+(3/20)*((3/3)²+(0/3)²))=19/20
Gini Index=1-(19/20)=1/20=0.05
2.) Here each Customer ID perfectly classifies, thus gini value for each ID is 1/20
=>overall gini value=1
gini Index=1-1=0
3.)For gender (10 M, 10 F)
a)For M : 6 belong to C0 and 4 to C1
gini value=((6/10)²+(4/10)²)=0.52
b)For F : 4 belong to C0 and 6 to C1
gini value=((4/10)²+(6/10)²)=0.52
gini index for gender = 1-((10/20)*0.52+(10/20)*0.52)=0.48
4.)For Car ( 4 Family, 8 Sports , 8 Luxury)
a)Family(C0:1, C1:3)
gini value= ((1/4)²+(3/4)²)=10/16
b)Sports(C0:8, C1:0)
gini value= ((8/8)²+(0/8)²)=1
c)Luxury(C0:1, C1:7)
gini value= ((1/8)²+(7/8)²)=50/64
Gini index for car type= 1-((4/20)*(10/16)+(8/20)*(1)+(8/20)*(50/64))=1-((1/8)+(2/5)+(5/16))=13/80=0.1625
5.) For Shirt(5 Small, 7 Medium, 4 Large, 4 Extra Large)
a)Small(C0:3 ,C1:2)
gini value= ((3/5)²+(2/5)²)=13/25
b)Medium(C0:3 ,C1:4)
gini value= ((3/7)²+(4/7)²)=25/49
c)Large(C0:2 ,C1:2)
gini value= ((2/4)²+(2/4)²)=1/2
d)Extra Large(C0:2 ,C1:2)
gini value= ((2/4)²+(2/4)²)=1/2
Gini Index for shirt size=1-((5/20)*(13/25)+(7/20)*(25/49)+(4/20)*(1/2)+(4/20)*(1/2))=1-(0.13+0.17857+0.1+0.1)=1-0.50857=0.49143
6.)Since Gini index for Car Type is lowest, Car Type is the best attribute.
7.)Though Customer ID has lowest gini index it does not carry any information .It is a coincidental irregularity and poses the threat of leading to overfitting since each value of customer ID would map to a node in decision tree.Thus the selection of attributes with many uniformly distributed values should be discouraged.
07. [Classification] Consider the following data set for a binary-class problem. [20] Customer ID Gender M...
1. Consider the training examples shown in the table below for a binary classification problem. (60 points) der Car Type Shirt Size Class CO Customer ID Gen Small Medium Medium Large Family Sports Sports Sports SportsExtra LargeCO SportsExtra Large CO Sports Sports Sports Luxury Family FamilyExtra Large C1 Family LuxuryExtra Large C1 Luxury Luxury Luxury Luxury Luxury Luxury Small Small Medium Large Large 10 C1 12 13 14 15 16 17 18 19 20 Medium C1 Small Small Medium Medium...
Consider the training examples shown above in Table 3.5 for a
binary classification
problem.
(a) Compute the Gini index for the overall collection of training
examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Gender attribute.
Table 3.5. Data set for Exercise 2 Customer ID Gender Car Type Shirt Size Class amily Sports Sports Sports SportsExtra LargeC Sports Extra LargeC Sports Sports Sports Luxury Family Family Extra Large Cl Family LuxuryExtra...