Introducing Machine Learning in Python with Scikit-learn


This workshop will deliver all essential code and general theory to build machine learning models in scikit-learn (sklearn). Navigate the machine learning landscape through crucial concepts like overfitting and cross-validation as you employ powerful algorithms like XGBoost and LightGBM. Supervised machine learning algorithms (regressors/classifiers) will be utilized on datasets that are primarily numerical. Jupyter Notebooks will be used (Anaconda download recommended), with code provided on Github.

Session Outline:

Module 1: Prepare Data for Machine Learning (pandas)

Preparing data for machine learning requires some data cleaning and data transformation so that the algorithms work. We use Pandas, Python’s Data Analytics Library, to transform CSV files into DataFrames, to clear null values, and to convert categorical columns into numerical columns. Envisioning data in terms of machine learning problems and solutions is introduced.
Time: ~ 35 minutes

Module 2: Build Your First Machine Learning Model (sklearn)

Build your first machine learning model using scikit-learn’s user-friendly API. Steps include splitting your data, initializing a model, fitting your model to the data, scoring your model, and making predictions from the model. An emphasis is placed on understanding machine learning code.
Time ~ 35 minutes

Module 3: Find the Best Model via Cross-Validation (sklearn)

Sklearn has many regressors and classifiers including Linear Regression, Logistic Regression, Decision Trees, Random Forests, XGBoost, LightGBM, and more. Algorithms are compared via slides and code using cross-validation to counter overfitting and obtain realistic scores. Stratifying data with Kfold cross-validation is included.
Time ~ 45 minutes

Module 4: Optimize Models with Hyperparameter Fine-tuning (sklearn)

Optimizing machine learning models requires understanding the ranges of hyperparameters, and finding the best possible combinations suited to your data. Sklearn includes powerful modules for full grid searches and random searches to find unique combinations and hidden patterns within the data. XGBoost is covered in depth.
Time ~ 45 minutes

Module 5: Finalize Models with feature_importances_ and Pipelines (sklearn)

After finalizing models, valuable information may be retrieved by discovering the most influential attributes (columns) via feature_importances_. Additionally, sklearn provides powerful machine learning pipelines that participants may use to automate code in the future.
Time ~ 35 minutes

Background Knowledge:

Python. No experience in machine learning/data analytics assumed.


A BCA and UC Berkeley Graduate Instructor, Ishika Prashar has degrees in Data Science and Cognitive Science with an emphasis in Business and Industrial Analytics. Ishika is passionate about the interdisciplinary nature of data science and technology being utilized to solve relevant problems in education and healthcare. A first generation woman in STEM, Ishika is passionate about continuing to advocate for women, especially in the field of technology, by increasing representation. Data science research topics include voting preferences, crime rates, and air quality.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google