Skip to main content

ECE 592 – Topics in Data Science

Instructor: Dr. Dror Baron, e-mail: barondror AT ncsu DOT edu,
Teaching assistant: Vaibhav Choudhary, Email: vchoudh2 AT ncsu DOT edu, office hour Thursday 12-1 PM
Classrooms: Modules will be recorded electronically and posted to a shared directory. Office hours will be held on Mondays and Wednesdays 11:45 AM - 1:00 PM

Useful Links

About this Course

Prerequisites

The main prerequisite is eagerness to learn about data science. Technical prerequisites include undergraduate signal processing (ECE 421), probability (ST 371), comfort in math (linear algebra, calculus, multi-dimensional spaces), and comfort with computers (we will be using Matlab and/or Python; see below).

Purpose

ECE 592 (Topics in Data Science) will acquaint students with some core basic topics in data science. Specific topics covered will include scientific programming, basic data structures, computational complexity, optimization, machine learning, wavelets, sparse signal processing, dimensionality reduction, and principle components analysis.

Course Outline

The course will proceed as follows:

  • Introduction.
  • Scientific programming (including data structures and computational complexity).
  • Optimization.
  • Machine learning basics (classification, clustering, and regression).
  • Sparse signal processing (including wavelets).
  • Dimensionality reduction (including principle components analysis).

Course Materials

Textbook

The instructor will be borrowing and inspired by several textbooks (see below). You need not purchase any of these. There will also be some references provided (to academic papers) in the slides and assignments; this is meant for your enrichment if you find that topic of special interest.

  • C. M. Bishop, Pattern Recognition and Machine Learning, 2006.
  • D. MacKay, Information Theory, Inference, and Learning Algorithms, 2003.
  • M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2012.
  • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learnin, 2001.
  • T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, 1990.
  • S. Mallat, A Wavelet Tour of Signal Processing, 1999.

Matlab/Python

We will be using the Matlab and/or Python languages during the course. (Further details about a possible preference will be determined early during the semester.) We expect to have some computer homework questions and projects. (Note that many other programming platforms can and are often used.) Here are some resources for these languages:

Slides and Modules

We now organize our course materials as follows. The material is organized in several slide decks, where each one covers a major topic. In our online course format, the slides part of the material is presented in pre-recorded modules, and under each deck of slides we organize and describe corresponding modules. The actual recordings of modules are organized in the course’s shared directory. Each module is in a sub-directory, which includes a pdf file with the corresponding parts of the slide deck. We also have some supplements, which provide details about some of the more delicate course topics.

  • Projects – this set of slides summarizes material about projects that will appear during the course.
  • Introduction.
    • Module1: Motivation for data science and applications. (Intro slides, pages 17-27.)
    • Module2: Polynomial curve fitting example (Intro slides, pages 28-38.)
    • Matlab curve fitting example.
  • Probability and models.
    • Linear_algebra module: recording from ECE 421 (signal processing) reviews basic linear algebra concepts.
    • Module3: Probability spaces and Bayes’ rule (Models slides, pages 1-9.)
    • Module4: Random variables, expectation, and variance. (Models slides, pages 10-17.)
    • Module5: Machine learning terminology. (Models slides, pages 18-23.)
    • Supplement about test and training data.
    • Module6: Models and minimum description length. (Models slides, pages 24-33.)
    • Supplement about models.
    • Module7: Model complexity. (Models slides, pages 34-42.)
    • Supplement on two part codes.
    • Supplement further motivating model complexity.
    • Supplement with MDL example.
    • Supplement on norms.
    • Module8: Kolmogorov complexity. (Models slides, pages 43-45.)
  • Scientific programming.
    • Module9: Resource consumption of algorithms. (Scientific programming slides, pages 1-10.)
    • Supplement on two example sorting algorithms.
    • Module10: Orders of growth of resource consumption. (Scientific programming slides, pages 11-16.)
    • Module11: Computational complexity. (Scientific programming slides, pages 17-22.)
    • Supplement on example of computational complexity.
    • Module12: Algorithm selection. (Scientific programming slides, pages 23-29.)
    • Module13: Divide and conquer. (Scientific programming slides, pages 30-33.)
    • mergesort and merge routines developed in class.
    • Module14: Computational architectures. (Scientific programming slides, pages 34-40.)
    • Module15: Parallel processing. (Scientific programming slides, pages 41-44.)
    • Module16: Stacks, queues, and linked lists. (Scientific programming slides, pages 45-55.)
    • Module17: Graphs. (Scientific programming slides, pages 56-61.)
    • Module18: Trees. (Scientific programming slides, pages 62-69.)
    • Module19: Profiling. (Scientific programming slides, pages 70-76.)
  • Optimization.
    • Module20: Motivation for optimization. (Optimization slides, pages 1-8.)
    • Supplement providing dynamic programming example.
    • Module21: Dynamic programming. (Optimization slides, pages 9-22.)
    • Module22: Linear programming. (Optimization slides, pages 23-27.)
    • Line search example code.
    • Module23: Convex programming. (Optimization slides, pages 28-36.)
    • Module24: Integer programming. (Optimization slides, pages 37-41.)
    • Module25: Non-convex programming. (Optimization slides, pages 42-52.)
    • Annealing example code (this resembles MCMC discussed for non-convex programming).
  • Machine learning.
    • Module26: Two classifiers. (Machine learning slides, pages 1-13.)
    • Matlab classification example.
    • Supplement about the curse of dimensionality.
    • Module27: Decision theory. (Machine learning slides, pages 14-16.)
    • Supplement on least squares.
    • Module28: Decision theory. (Machine learning slides, pages 17-20.)
    • Supplement on K means algorithm.
    • Supplement about loss functions.
    • Module29: Linear regression. (Machine learning slides, pages 21-29.)
    • Module30: Subset selection. (Machine learning slides, pages 30-34.)
    • Supplement about subset selection.
    • Module31: Shrinkage. (Machine learning slides, pages 35-45.)
    • Supplement about shrinkage.
    • Module32: Decision trees. (Machine learning slides, pages 46-49.)
    • Module33: Linear classification. (Machine learning slides, pages 50-55.)
    • Module34: LDA and QDA. (Machine learning slides, pages 56-63.)
    • Supplement containing an example on Bayesian classification.
    • Supplement on Bayesian distributions.
    • Module35: Logistic regression. (Machine learning slides, pages 64-66.)
    • Module36: Basis expansions. (Machine learning slides, pages 67-73.)
    • Module37: Kernel methods. (Machine learning slides, pages 74-77.)
    • Module38: Support vector machines. (Machine learning slides, pages 78-82.)
  • Sparse signal processing.
    • Module39: Sparsity. (Sparse signal processing slides, pages 1-8.)
    • Module40: Bases. (Sparse signal processing slides, pages 9-19.)
    • Supplement on inner product spaces.
    • Supplement on bases and LTI systems.
    • Module41: Frames. (Sparse signal processing slides, pages 20-22.)
    • Module42: Wavelets. (Sparse signal processing slides, pages 23-33.)
    • Module43: Multi resolution approximation. (Sparse signal processing slides, pages 34-43.)
    • Supplement on direct sums.
    • Module44: Compressed sensing. (Sparse signal processing slides, pages 44-51.)
    • Module45: Compressive signal acquisition. (Sparse signal processing slides, pages 52-64.)
    • Module46: Sparse recovery. (Sparse signal processing slides, pages 65-77.)
    • Supplement on LASSO.
    • Supplement on machine learning vs. CS.
    • Module47: Optimal sparse recovery. (Sparse signal processing slides, pages 78-82.)
    • Module48: Information theoretic performance limits. (Sparse signal processing slides, pages 79-90.)
    • Supplement on single letter bound for CS.
    • Module49: Precise performance limits. (Sparse signal processing slides, pages 91-96.)
    • Supplement deriving precise performance limit for CS.
    • Module50: Approximate message passing. (Sparse signal processing slides, pages 97-106.)
    • Supplement on AMP implementation; and the AMP and denoise routines developed in class.
    • Supplement on solving Tanaka’s fixed point equation numerically; and Matlab for Tanaka’s equation.
  • Dimensionality reduction.
    • Module51: Dimensionality reduction. (Dimensionality reduction slides, pages 1-12.)
    • Supplement on deriving PCA.

Software

Below are Matlab and Python implementations for various examples provided during the course. Many thanks to Dhananjai Ravindra, Jordan Miller, and Deveshwar Hariharan for translating Matlab scripts to Python!

Assignments and Grading

The following is our grade structure.

Component% of GradeDue Date
Homework:15%Throughout course
Projects:25%Throughout course
Final Project:20%Due last week end of course
Tests:40%Schedule TBD

Up to 2-3% extra credit will be provided. We encourage students to be proactive about their studies, including class participation, office hours, emails to the instructor, spotting errors, and making suggestions.

Homework

We expect homeworks roughly every 2-3 weeks. They will be posted below.

  • The first set of practice homeworks is not due, and is only for review purposes. These practice questions use WebWorK software (accessible from Moodle); please read through our instructions for getting started with your WebWorK account, and it might help to keep in mind some of these points about the software platform. (Note that the software compares your numerical answers to its answes, and being you should make sure that your answers are sufficiently accurate.)
  • Homework 1 is due on August 24.
  • Homework 2 is now due on September 4.
  • Homework 3 is now due on September 21.
  • Homework 4 is due on October 7.
  • Homework 5 is due on October 26.
  • Homework 6 is due on November 9.

Projects

Homework style projects: We expect 3-4 “homework style” projects during the semester, and a final project. Each “homework style project” will be a major assignment that combines deriving some math, producing some software, and possibly looking at data. Each such “homework style” project will involve some application, and we hope that you will be able to better appreciate how data science is used in many real world settings. The final project will be up to you, and is an opportunity to gain deeper knowledge in a topic of special interest to you.

Final project: The final project will be a topic that 2-3 students choose to work on. This could involve reading a paper and presenting it carefully to the class, working on a data set using an algorithm that wasn’t covered in depth in class, or even (hopefully) presenting new results you worked on. A list of possible topics for projects appears here. The presentation style will be to provide a brief lecture in class. The duration of the lecture will depend on the number of projects, where we envision having two classes dedicated to presentations. All the students will be providing feedback (including the grade) to each other. Overall, the objective of the final project to give students a personalized learning experience while providing an opportunity to present the findings to the entire class and receive ample feedback. Regulations for individual projects can be found here.

The final project will be graded as follows: (a) 50% will be for the report, where you can see a rubric used for grading it; (b) 40% will be the presentation, where half of that is our grade assessment and half is peer grading; and (c) 10% will be for the project proposal, where you get half (5%) automatically if you submit on October 12, and lose an extra 1% per day after that.

Please keep in mind that 25% of the grade will be for “homework style” projects and another 20% for the final project.

Tests

In past years, there was often a midterm exam. The online nature of this year’s course will lead to a greater number of tests, and each test will be shorter. Below are past tests (and their solutions) throughout the history of this course.

Feedback

Students are encouraged to send feedback to the instructional staff.