Project # 2
(Due Date: 31st December, 2013)

You are required to analyze a real world data mining problem using techniques studied in this course. The problem can be related to banks, hospitals, cinemas, restaurants, super-stores, academic institutions, etc. In case you can't find data on your own then you can pick any competition of interest from the following website

http://www.kaggle.com/

The website lists several projects posted by real companies who also pay a cash reward to the best solution of their problem. The projects' deadline (posted on Kaggle) may be different but you need to make sure that you submit your Project # 2 to me by 31st of December.

In addition to submitting a comprehensive report describing your analysis, you would be required a detailed presentation (10-15 mins.) on the problem you chose, the nature of data and how you cleaned/prepared it, and your findings. The presentations will be held on the 31st of December, 2013 and 2nd of January, 2014 but you are required to submit your report by 31st December (noon time).

As is the case with Project 1, it is a group-based project (max 3 person) but if you want to do it alone then it's fine as well. The report will be submitted via turnitin and the zero-tolerance policy of IBA towards plagiarism is applicable in this case. Any two reports found similar would result in a straight F for both groups and further action would be decided by the Examination department.


Project # 1
(Due Date: 12 PM on November 21, 2013)

This is a group project (maximum group size is 3). You need to develop a predictive model for the given data.

The data set is about credit card fraud prediction. The data set was originally posted in PAKDD2013 for a data mining competition. The data is in CSV format. There are two files: training and testing. The training file has 500,000 records while the testing file has 262,966 records.

Download Files

Before applying all the classification techniques you have learned in this course, you need to prepare the data first. This includes excluding (or transforming) those features having extremely higher percentage of missing values, handling of missing values, discretization of certain attributes, feature reduction, etc.

The benchmark should be the F-Measure and ROC values for "1" in the "Target_Label" column.

You should mainly perform your analysis in KNIME but feel free to take advantage of other tools (be it Weka, Excel, etc.). You might be interested in knowing that KNIME supports connectivity with a database (such as MySQL, SQL Server, etc.). You may want to utilize that feature for certain operations (such as columns or rows removal or updation).

You need to present your detailed findings in the class. While you are working on this project, I would encourage you to discuss the issues on this wiki under the discussion page so that others can also respond to the issues you are facing in case they have solved it in a certain way. But keep in mind that discussion doesn't mean sharing your work with other groups. The submitted presentations will be checked on turnitin and the zero-tolerance policy of IBA towards plagiarism is applicable in this case. Any two analysis found similar would result in a straight F for both groups and further action would be decided by the Examination department.



Assignment # 1
(Due Date: 12 PM on October 31, 2013)

This is a group assignment (maximum group size is 3). You need to present a case study/white paper/research paper describing an application of data mining. You will have 10 minutes to present your findings in the class. Considering the fact that we haven't touched unsupervised learning so far, it is suggested that you pick only classification related applications involving either Classification Tree, Naive Bayes or Neural Networks.