Sparse vs. Dense data: Analysis of several supervised learning algorithms

k-NN, SVM, Decision Trees, Boosted Trees, and Neural Networks

This was a project for a graduate CS course. Source code can be shared in private upon request with potential employers.

How do K-NN, SVM, Neural Networks, Decision Trees, and Boosted decision trees perform on a small, dense dataset compared to a large, sparse dataset?

I completed this analysis as part of an assignment for Georgia Tech's CS 7641 Machine Learning course. No templates were provided: I found the datasets, wrote the code (leveraging libraries like sci-kit learn and seaborn), generated the visualizations, and completed the analysis.

Datasets

  • The first dataset was a time-series of muscle activation measurments taken on the forearm during the completion of a gesture. The mutli-class supervised learning problem was to predict the gesture given the muscle activation.
  • The second dataset was a subset of the 20-newsgroups dataset. The task is to predict which newsgroup a post belongs to based on its content. I chose three similar newsgroups and then applied a TF-IDF transform to create the training data.

Concepts applied

  • Preprocessing, Cross-validation, hyperparameter tuning, cross-validation, data visualization
  • K-NN, SVM, Neural Networks, Decision Trees, and Boosted Decision Trees (AdaBoost)
  • Time-series analysis, Natural Language Processing (NLP), incremental learning

Summary of results

  • Neural Networks and Radial kernel SVM worked best on the Gesture dataset, indicating non-linearity.
  • The news group data's sparse representation (CSR) meant effecient memory and training times.
  • Additional feature engineering is likely to improve performance for the news group data; in particular the use of embeddings.

Other Work

Reinforcement Learning in the Stock Market

Back-testing learned strategies vs. manual strategies in the stock market.