Introducing Machine Learning in Python with Scikit-learn


This workshop will deliver all essential code and general theory to build machine learning models in scikit-learn (sklearn). Navigate the machine learning landscape through crucial concepts like overfitting and cross-validation as you employ powerful algorithms like XGBoost and LightGBM. Supervised machine learning algorithms (regressors/classifiers) will be utilized on datasets that are primarily numerical. Jupyter Notebooks will be used (Anaconda download recommended), with code provided on Github.

Session Outline:

Module 1: Prepare Data for Machine Learning (pandas)

Preparing data for machine learning requires some data cleaning and data transformation so that the algorithms work. We use Pandas, Python’s Data Analytics Library, to transform CSV files into DataFrames, to clear null values, and to convert categorical columns into numerical columns. Envisioning data in terms of machine learning problems and solutions is introduced.
Time: ~ 35 minutes

Module 2: Build Your First Machine Learning Model (sklearn)

Build your first machine learning model using scikit-learn’s user-friendly API. Steps include splitting your data, initializing a model, fitting your model to the data, scoring your model, and making predictions from the model. An emphasis is placed on understanding machine learning code.
Time ~ 35 minutes

Module 3: Find the Best Model via Cross-Validation (sklearn)

Sklearn has many regressors and classifiers including Linear Regression, Logistic Regression, Decision Trees, Random Forests, XGBoost, LightGBM, and more. Algorithms are compared via slides and code using cross-validation to counter overfitting and obtain realistic scores. Stratifying data with Kfold cross-validation is included.
Time ~ 45 minutes

Module 4: Optimize Models with Hyperparameter Fine-tuning (sklearn)

Optimizing machine learning models requires understanding the ranges of hyperparameters, and finding the best possible combinations suited to your data. Sklearn includes powerful modules for full grid searches and random searches to find unique combinations and hidden patterns within the data. XGBoost is covered in depth.
Time ~ 45 minutes

Module 5: Finalize Models with feature_importances_ and Pipelines (sklearn)

After finalizing models, valuable information may be retrieved by discovering the most influential attributes (columns) via feature_importances_. Additionally, sklearn provides powerful machine learning pipelines that participants may use to automate code in the future.
Time ~ 35 minutes

Background Knowledge:

Python. No experience in machine learning/data analytics assumed.


Corey Wade, MS Mathematics, MFA Writing & Consciousness, is the director and founder of Berkeley Coding Academy, an online program with live classes where teenagers learn Python Programming, Data Analytics, and Machine Learning. Author of Hands-on Gradient Boosting with XGBoost and scikit-learn, and lead author of The Python Workshop, Corey also teaches Math, Programming, and Data Science at Berkeley Independent Study. Corey has published iPhone apps with students, designed classes to build websites, and run after-school coding programs to support girls and underserved students. A Springboard Data Science graduate and multiple grant award-winner, Corey has also worked in industry developing Data Science curricula for Pathstream and Hello World while contributing articles for Towards Data Science. When not coding or teaching, Corey reads poetry and studies the stars.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google