Thomas J. Fan
Senior Machine Learning Engineer at Union.ai
Thomas J. Fan is a Senior Machine Learning Engineer at Union.ai and a maintainer for scikit-learn, an open-source machine learning library for Python. He led the development of scikit-learn's set_output API, which allows transformers to return pandas DataFrames. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He also maintains skorch, a neural network library that wraps PyTorch. Thomas has a Master's in Mathematics from NYU and a Master's in Physics from Stony Brook University.
All Sessions by Thomas J. Fan
Introduction to scikit-learn: Machine Learning in PythonMachine Learning | Beginner
Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface helps to abstract away the algorithm, thus allowing us to focus on our domain-specific problems. First, we learn the importance of splitting your data into train and test sets for model evaluation. Then, we explore the preprocessing techniques on numerical, categorical, and missing data. We see how different machine learning models are impacted by preprocessing. For example, linear and distance-based models require standardization, but tree-based models do not. We explore how to use the Pandas output API, which allows scikit-learn's transformers to output Pandas DataFrames! The Pandas output API enables us to connect the feature names with the state of a machine learning model. Next, we learn about the Pipeline, which connects transformers with a classifier or regressor to build a data flow where the output of one model is the input of another. Lastly, we look at scikit-learn's Histogram-based Gradient Boosting model, which can natively handle numerical and categorical data with missing values. After this training, you will have the foundations to apply scikit-learn to your machine learning problems.