In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads training data in WEKA arff format and generates ID3 decision tree in a format similar to that of the tree generated by Weka ID3. Please note the following:
Your algorithm will use the entire data set to generate the tree. You may assume that the attributes (a) are of nominal type (i.e., no numeric data), and (b) have no missing values.
In general, the basic ID3 algorithm uses entropy measure to select the best attribute to divide the data set. It continues to select attribute for further branching (based on the subset of data belong to that branch) until either (a) all attributes have been used, or (b) all instances under a node belong to the same class. This ensures a 0% error rate on the training set although it may not work the best with future data due to over-fitting.
In this assignment, you will use the WEKA system to analyze two artificial data sets and one real data set. You will apply five learning algorithms to each data set and compare their performance. I have included a section at the end that describes how to get weka and how to run it from the GUI or from the command line.
statlog files in the data folder directory: Index, australian.dat, australian.doc
hw_gmm data files
hw_gmm_25.arff 25 training examples
hw_gmm_50.arff 50 training examples
hw_gmm_100.arff 100 training examples
hw_gmm_250.arff 250 training examples
hw_gmm_500.arff 500 training examples
hw_gmm_test.arff test data file
hw_step data files
hw_step-25.arff 25 training examples
hw_step-50.arff 50 training examples
hw_step-100.arff 100 training examples
hw_step-250.arff 250 training examples
hw_step-500.arff 500 training examples
hw_step_test.arff test data file
You will run the five learning algorithms on each training data file and evaluate the results on the corresponding test data files.
TURN IN:
You should turn in the top 50 lines of your statlog.arff and
statlog_test.arfffiles.
For each classifier and each problem domain, you should learn using each of the training files (e.g., hw_step_10.arff) and test the learned model on the given test file (e.g., hw_step_test.arff). Record the accuracy of the learned model and report it in a table and a graph as specified in (a) and (b). Look at the end of the homework on how to do these runs and get the accuracies. I suggest you use the command-line to do these in a batch-setting.
TURN IN:
A table in the following format:
------------------------------------------------------- hw_gmm: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 hw_step: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 adult: N Perceptron LogReg J48 kNN-1 kNN-5 490 xxx yyy zzz kkk1 kkk5 -------------------------------------------------------Where xxx gives the error rate of the perceptron, yyy gives the error rate of LogisticRegression, etc.
For gnuplot, you need to create a separate file for each learner. Each file should consist of x,y pairs, where x is the training set size and y is the accuracy. You can then plot these files using the plot command.
For excel, you can plot the graphs using the table above and use the chart wizard to draw your graphs.
log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2WEKA produces a table that looks like
Variable Coeff.
1 w1
2 w2
Intercept w0
TURN IN:
(i, 10 points) Plot of the data points for hw_gmm_25 with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm.
(ii, 10 points) Plot of the data points for hw_step_50 with a line showing the learned decision boundary for Logistic Regression.
Now, let us consider the hw_gmm_250and hw_step_250 training sets and the kind of decision boundaries found by J48. This will require that you read the decision tree and understand the decision boundary. J48 displays the tree in the following format:
x1 <= 1.0: positive (75.0/17.0) x1 > 1.0 | x2 <= 5.0: negative (42.0/12.0) | x2 > 5.0: positive (33.0/10.0)The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.
In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...
Can you give me a poste for Science Writing TOPIC: DECISION TREE Decision Tree Algorithm Pseudocode:- 1) Place the best attribute of the dataset at the root node of the tree. 2) Split the training set into subsets. Subsets should be make in such a way that each subset contains data with the same value for an attribute. 3) Repeat steps 1 and 2 on each subset until you find leaf nodes in all the branches of the tree. Two...
Below is a example of a ID3 algorithm in Unity using C# im not sure how the ID3Example works in the whole thing can someone explain the whole thing in more detail please. i am trying to use it with this data set a txt file Alternates?:Bar?:Friday?:Hungry?:#Patrons:Price:Raining?:Reservations?:Type:EstWaitTime:WillWait? Yes:No:No:Yes:Some:$$$:No:Yes:French:0-10:True Yes:No:No:Yes:Full:$:No:No:Thai:30-60:False No:Yes:No:No:Some:$:No:No:Burger:0-10:True Yes:No:Yes:Yes:Full:$:Yes:No:Thai:10-30:True Yes:No:Yes:No:Full:$$$:No:Yes:French:>60:False No:Yes:No:Yes:Some:$$:Yes:Yes:Italian:0-10:True No:Yes:No:No:None:$:Yes:No:Burger:0-10:False No:No:No:Yes:Some:$$:Yes:Yes:Thai:0-10:True No:Yes:Yes:No:Full:$:Yes:No:Burger:>60:False Yes:Yes:Yes:Yes:Full:$$$:No:Yes:Italian:10-30:False No:No:No:No:None:$:No:No:Thai:0-10:False Yes:Yes:Yes:Yes:Full:$:No:No:Burger:30-60:True Learning to use decision trees We already learned the power and flexibility of decision trees for adding a decision-making component to...
In c++ visual studio
Write a program that does the following: Reads the input data set from file named "data.txt". Assume that the input file contains x and y values as shown in the sample to the right (the first number in each line is the x value). The number of data points in the input file is not known but assume that they will not exceed 100. Once it gets the data in two one-dimensional arrays (x and y),...
Problem statement For this program, you are to implement a simple machine-learning algorithm that uses a rule-based classifier to predict whether or not a particular patient has diabetes. In order to do so, you will need to first train your program, using a provided data set, to recognize a disease. Once a program is capable of doing it, you will run it on new data sets and predict the existence or absence of a disease. While solving this problem, you...
In this assignment, you must write a C program to check the validity of a Sudoku solution. You must at least do the following: 1- Ask the user to provide a minimum of first two rows of the Sudoku grid. For the rest of the entries, you should use a random number generator. 2- Use appropriate logic to make sure the random number generator generates a distinct set of valid integers! 3- It should be a console-based, yet convenient and...
Write a modular program using visual c++ to simulate the Game of Life and investigate the patterns produced by various initial configurations. Some configurations die off rather rapidly; others repeat after a certain number of generations; others change shape and size and may move across the array; and still others may produce ‘gliders’ that detach themselves from the society and sail off into space! Since the game requires an array of cells that continually expands/shrinks, you would want to use...
If you’re using Visual Studio Community 2015, as requested, the instructions below should be exact but minor discrepancies may require you to adjust. If you are attempting this assignment using another version of Visual Studio, you can expect differences in the look, feel, and/or step-by-step instructions below and you’ll have to determine the equivalent actions or operations for your version on your own. INTRODUCTION: In this assignment, you will develop some of the logic for, and then work with, the...
This C++ Program should be written in visual studio 2017 You are to write a program that can do two things: it will help users see how long it will take to achieve a certain investment goal given an annual investment amount and an annual rate of return; also, it will help them see how long it will take to pay off a loan given a principal amount, an annual payment amount and an annual interest rate. When the user...
Use C++
For this week’s lab you will write a program to read a data file
containing numerical values, one per line. The program should
compute average of the numbers and also find the smallest and the
largest value in the file. You may assume that the data file will
have EXACTLY 100 integer values in it. Process all the values out
of the data file. Show the average, smallest, largest, and the name
of the data file in the...
Python program This assignment requires you to write a single large program. I have broken it into two parts below as a suggestion for how to approach writing the code. Please turn in one program file. Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text "The film was a breath of...