Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , ,

Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers

Starting With Data Science

A rigorous hands-on introduction to data science for software engineers.

Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets software engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.

The course includes lectures, hands-on labs, and optional homework exercises. Students are expected to attend a 4 full days, and will come out with a basic understanding of some of the most important tools for supervised learning in data science. We share all course materials, which the students can use as starting points for their own projects. Students are expected to bring a laptop with a Python 3 configured version of Anaconda (https://www.anaconda.com/download/) so they can work along with the lectures, and also work on labs and exercises. The topics are designed for an engineering audience. This is for an audience comfortable with Python, and interested in learning the basics of data manipulation and starting with supervised machine learning.

Topics covered:

  • Day1:
    Starting with data and statistics

    • Starting with Python and JupyterLab.
    • Working with data using Pandas.
    • Basic concepts of probability and statistics.
    • Predicting quantities (regression metrics and methods).
    • Overfit and test/train split.
  • Day 2:
    Starting with machine learning

    • Advanced regression methods: polynomial regression and generalized additive models.
    • Predicting classes (supervised classification) using logistic regression.
    • Fixing modeling problems using regularization.
    • Advanced classification metrics
    • Basic feature engineering (missing values, categorical variables, sessionization).
  • Day 3:
    Advanced machine learning

    • tree based methods
    • Decision trees
    • Random forest methods
    • Gradient boosted trees
    • Dimension reduction
    • Variable screening
    • Principal components reduction.
  • Day 4:
    Unsupervised methods and advanced topics

    • Clustering
    • Principal components lab
    • Introduction to Neural Nets and Deep Learning
    • Embeddings
    • Keras
    • Model explainability (LIME)

The course is designed in terms of “daily take aways” or “daily victories”. The goals by day are:

  • Day 1: an understanding of sampling error, modeling numeric values (regression) and evaluating models (metrics).
  • Day 2: understanding classification and ranking problems and how to solve them. The students are encouraged to use logistic regression as their go-to tool for these problems.
  • Day 3: Current machine learning methods for complex relations: ensembles of trees methods (Random Forest and Gradient Boosted Trees). These are powerful machine learning methods that routinely win machine learning and data science contests.
  • Day 4: Deep learning. The students will work with the Keras package and use Neural Net methods to encode complex data such as images and text (basic natural language processing). We will end on model explainability, which attempts to link the powerful “black box” methods of days 3 and 4 with the more inspectable methods of days 1 and 2.

The software used in the course is all open-source. Primarily the course will use:

  • Python 3: the primary programming language for the course.
  • The Anaconda distribution: the main software distribution and package manager for the course.
  • JupyterLab: current interactive worksheet system for Python and data science.
  • Pandas: a state of the art data manipulation system for Python.
  • scikit-learn: a state of the art machine learning system for Python.
  • seaborn: a best of breed plotting and visualization package.
  • Keras: a powerful deep learning package.

All software engineers will eventually work with data, or work with people who work with data. This course well help prepare your software engineers to quickly become effective in this reality.

Note: this is a currently taught on demand course offered to groups of around size 15 to 30 people at a time. We do not currently have a public version that can take individual participants.

Win Vector LLC is a data science consulting and training organization. The senior parters are authors of numerous data science packages, R and Python training courses, and a best selling data science book. Please enquire for course rates and scheduling. The above Python course has been presented to great success at large companies, it can be adapted to R on request.