
Abstract: This workshop will deliver all essential code and general theory to build machine learning models in scikit-learn (sklearn). Navigate the machine learning landscape through crucial concepts like overfitting and cross-validation as you employ powerful algorithms like XGBoost and LightGBM. Supervised machine learning algorithms (regressors/classifiers) will be utilized on datasets that are primarily numerical. Jupyter Notebooks will be used (Anaconda download recommended), with code provided on Github.
Session Outline:
Module 1: Prepare Data for Machine Learning (pandas)
Preparing data for machine learning requires some data cleaning and data transformation so that the algorithms work. We use Pandas, Python’s Data Analytics Library, to transform CSV files into DataFrames, to clear null values, and to convert categorical columns into numerical columns. Envisioning data in terms of machine learning problems and solutions is introduced.
Time: ~ 35 minutes
Module 2: Build Your First Machine Learning Model (sklearn)
Build your first machine learning model using scikit-learn’s user-friendly API. Steps include splitting your data, initializing a model, fitting your model to the data, scoring your model, and making predictions from the model. An emphasis is placed on understanding machine learning code.
Time ~ 35 minutes
Module 3: Find the Best Model via Cross-Validation (sklearn)
Sklearn has many regressors and classifiers including Linear Regression, Logistic Regression, Decision Trees, Random Forests, XGBoost, LightGBM, and more. Algorithms are compared via slides and code using cross-validation to counter overfitting and obtain realistic scores. Stratifying data with Kfold cross-validation is included.
Time ~ 45 minutes
Module 4: Optimize Models with Hyperparameter Fine-tuning (sklearn)
Optimizing machine learning models requires understanding the ranges of hyperparameters, and finding the best possible combinations suited to your data. Sklearn includes powerful modules for full grid searches and random searches to find unique combinations and hidden patterns within the data. XGBoost is covered in depth.
Time ~ 45 minutes
Module 5: Finalize Models with feature_importances_ and Pipelines (sklearn)
After finalizing models, valuable information may be retrieved by discovering the most influential attributes (columns) via feature_importances_. Additionally, sklearn provides powerful machine learning pipelines that participants may use to automate code in the future.
Time ~ 35 minutes
Background Knowledge:
Python. No experience in machine learning/data analytics assumed.
Bio: Corey Wade, MS Mathematics, MFA Writing & Consciousness, is the director and founder of Berkeley Coding Academy, an online program with live classes where teenagers learn Python Programming, Data Analytics, and Machine Learning. Author of Hands-on Gradient Boosting with XGBoost and scikit-learn, and lead author of The Python Workshop, Corey also teaches Math, Programming, and Data Science at Berkeley Independent Study. Corey has published iPhone apps with students, designed classes to build websites, and run after-school coding programs to support girls and underserved students. A Springboard Data Science graduate and multiple grant award-winner, Corey has also worked in industry developing Data Science curricula for Pathstream and Hello World while contributing articles for Towards Data Science. When not coding or teaching, Corey reads poetry and studies the stars.

Corey Wade
Title
Founder, Director | Author | Berkeley Coding Academy
