In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...

Question

Question

In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...

In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads training data in WEKA arff format and generates ID3 decision tree in a format similar to that of the tree generated by Weka ID3. Please note the following:

Your algorithm will use the entire data set to generate the tree. You may assume that the attributes (a) are of nominal type (i.e., no numeric data), and (b) have no missing values.

In general, the basic ID3 algorithm uses entropy measure to select the best attribute to divide the data set. It continues to select attribute for further branching (based on the subset of data belong to that branch) until either (a) all attributes have been used, or (b) all instances under a node belong to the same class. This ensures a 0% error rate on the training set although it may not work the best with future data due to over-fitting.

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

In this assignment, you will use the WEKA system to analyze two artificial data sets and one real data set. You will apply five learning algorithms to each data set and compare their performance. I have included a section at the end that describes how to get weka and how to run it from the GUI or from the command line.

Learning Algorithms. We will compare Perceptron, Logistic Regression, Decision Trees (J48), and k-nearest neighbors (IBk) (two variations: 1-NN and 5-NN).

Data Sets. We will apply these five algorithms to the data sets hw_gmm, hw_step, and statlog. This latter data set is from the UCI Irvine machine learning repository. The data set has been cleaned so that there are no missing values. The artificial data sets have one or more training data files and one test data file, the statlog data is one large file:

statlog files in the data folder directory: Index, australian.dat, australian.doc

hw_gmm data files 
      hw_gmm_25.arff       25 training examples
      hw_gmm_50.arff       50 training examples
      hw_gmm_100.arff      100 training examples
      hw_gmm_250.arff      250 training examples
      hw_gmm_500.arff      500 training examples
      hw_gmm_test.arff     test data file

hw_step data files
      hw_step-25.arff      25 training examples
      hw_step-50.arff      50 training examples
      hw_step-100.arff     100 training examples
      hw_step-250.arff     250 training examples
      hw_step-500.arff     500 training examples
      hw_step_test.arff    test data file

You will run the five learning algorithms on each training data file and evaluate the results on the corresponding test data files.

Exercises / What to turn in.
1. [Data Handling, 10 points]: Your first task is to download the statlog data and convert it weka format. This tasks will make you familiar with how to download and handle data.
  - Go to the webpage and follow the link to 'Data Folder' at the top (right under the main name).
  - Download the australian.dat file and split it into two files of 490 instances for the training set (name itstatlog.arff and 200 instances for the test set (name it statlog_test.arff).
  - Add a weka arff header to these two files, using the details of the attribute information on the web-page. Look at the artificial data sets to help you get the syntax correct. Make sure you keep the attributes and instances in the same order as they are in the original files.
  You will be predicting the last attribute (class).
  TURN IN:
  You should turn in the top 50 lines of your statlog.arff and statlog_test.arfffiles.
2. [Training Set Sensitivity, 30 points]:How sensitive are the various learners to the training set size. We will have each learner learn on each of the train files (sizes 10 to 500) and record their accuracies. This exercise gives us insight into the behavior of each learner and how sensitive it is to training set sizes. This knowledge is useful when deciding which learner to use in a specific problem.
  For each classifier and each problem domain, you should learn using each of the training files (e.g., hw_step_10.arff) and test the learned model on the given test file (e.g., hw_step_test.arff). Record the accuracy of the learned model and report it in a table and a graph as specified in (a) and (b). Look at the end of the homework on how to do these runs and get the accuracies. I suggest you use the command-line to do these in a batch-setting.
  1. [Tabular comparison, 20 points]
    TURN IN:
    A table in the following format:
```
-------------------------------------------------------
hw_gmm:
N       Perceptron   LogReg   J48     kNN-1     kNN-5
25      xxx          yyy      zzz     kkk1      kkk5
50      xxx          yyy      zzz     kkk1      kkk5
100     xxx          yyy      zzz     kkk1      kkk5
250     xxx          yyy      zzz     kkk1      kkk5
500     xxx          yyy      zzz     kkk1      kkk5

hw_step:
N       Perceptron   LogReg   J48     kNN-1     kNN-5
25      xxx          yyy      zzz     kkk1      kkk5
50      xxx          yyy      zzz     kkk1      kkk5
100     xxx          yyy      zzz     kkk1      kkk5
250     xxx          yyy      zzz     kkk1      kkk5
500     xxx          yyy      zzz     kkk1      kkk5

adult:
N       Perceptron   LogReg   J48     kNN-1     kNN-5
490     xxx          yyy      zzz     kkk1      kkk5
-------------------------------------------------------
```
    Where xxx gives the error rate of the perceptron, yyy gives the error rate of LogisticRegression, etc.
  2. [Graph Comparison, 10 points]
    TURN IN:
    Graphs of the results for hw_gmm andhw_step plotting the performance of the five algorithms as a function of the size of the training data set (known as a "learning curve"). I recommend using gnuplot, excel or matlab for constructing the graphs as WEKA does not provides an easy way to do this.
    For gnuplot, you need to create a separate file for each learner. Each file should consist of x,y pairs, where x is the training set size and y is the accuracy. You can then plot these files using the plot command.
    
    For excel, you can plot the graphs using the table above and use the chart wizard to draw your graphs.
3. [Decision Boundaries, 60 points]
  Each learner creates decision boundaries and we often would like to know what these boundaries are. In some cases, such as logistic regression and J48, computing these boundaries is straight forward. In other cases, such as VotedPerceptron and Nearest Neighbor, this is not so easy and we need to use other means. These exercises are meant to help you understand how to get the decision boundaries from the learned models.
  1. [Logistic Regression, 20 points] Let us consider the hw_gww_25 and hw_step_50 training sets and what kind of decision boundaries that logistic regression found. To compute the decision boundary for Logistic Regression, recall that the logistic regression model has the form
```
log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2
```
    WEKA produces a table that looks like
```
 Variable      Coeff.
        1      w1
        2      w2
Intercept      w0
```
    TURN IN:
    
    (i, 10 points) Plot of the data points for hw_gmm_25 with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm.
    
    (ii, 10 points) Plot of the data points for hw_step_50 with a line showing the learned decision boundary for Logistic Regression.
  2. [J48, 20 points]:
    Now, let us consider the hw_gmm_250and hw_step_250 training sets and the kind of decision boundaries found by J48. This will require that you read the decision tree and understand the decision boundary. J48 displays the tree in the following format:
```
x1 <= 1.0: positive (75.0/17.0)
x1 > 1.0
|   x2 <= 5.0: negative (42.0/12.0)
|   x2 > 5.0: positive (33.0/10.0)
```
    The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.

Add a comment

Answer 2

In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...

Homework Answers

Add Answer to:
In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...

Post as a guest

Earn Coins

Can you give me a poste for Science Writing TOPIC: DECISION TREE Decision Tree Algorithm Pseudocode:-...

Below is a example of a ID3 algorithm in Unity using C# im not sure how...

In c++ visual studio Write a program that does the following: Reads the input data set...

Problem statement For this program, you are to implement a simple machine-learning algorithm that uses a...

In this assignment, you must write a C program to check the validity of a Sudoku solution. You must at least do the foll...

Write a modular program using visual c++ to simulate the Game of Life and investigate the...

If you’re using Visual Studio Community 2015, as requested, the instructions below should be exact but...

This C++ Program should be written in visual studio 2017 You are to write a program...

Use C++ For this week’s lab you will write a program to read a data file...

Python program This assignment requires you to write a single large program. I have broken it...

In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...

Homework Answers

Add Answer to: In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...

Post as a guest

Earn Coins

Add Answer to:
In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...