Data Science and Machine Learning in the Cloud for Cloud Novices
Data Science and Machine Learning in the Cloud for Cloud Novices


In this half-day hands-on training, we will use free-tier resources in the Google Cloud Platform (GCP) to introduce learners to the practical use of cloud computing resources in data science and machine learning. Learners should have some experience with data analytics, data science or machine learning. Learners should also have a Gmail account with no former GCP use associated with it, or be willing to create such an account. While fluency in R or Python will be very helpful, it is not rigorously required, as well-annotated scripts will be provided. No previous exposure to or use of cloud computing is required; this is introductory-level in terms of its cloud computing assumptions. This training will be useful for those considering cloud adoption, interested in data engineering, or interested in working with public data as citizen scientists.

Topics covered will include:

• Cloud computing concepts and vocabulary
• Cloud providers
• Free tier and cost considerations
• Public datasets and citizen science
• Redundancy, security, and privacy
• Continuum of management levels
• Cloud data storage and analytics
• Machine learning in the cloud

Learners will experience the following in a Half Day Hands-on Training. While this is not principally a course in how to conduct data analysis, useful scripts, queries, and visualizations will be provided to scaffold learners.

• Create a new GCP account and explore documentation and tutorials offered
• Explore public datasets hosted on GCP’s BigQuery service
• Use SQL to do data analysis on a public dataset
• Create a Jupyter notebook on a free-tier compute environment and use Python to analyze data
• Create an RStudio Community server environment on a free-tier compute environment and use R to analyze data
• Create a machine learning predictive model on public data
• (If time permits: Dashboard creation in the Cloud)
• (If time permits: Special considerations for sensitive data)
• (If time permits: Advanced Data Pipelines)

● Google Cloud Platform, particularly BigQuery;
● Google Collaboratory;


Joy Payton is a cloud engineer, data scientist, and adjunct professor who specializes in helping biomedical professionals conduct reproducible computational research. In addition to moving medicine forward through principles of open science and reproducibility, Joy also enjoys teaching citizen scientists how to use public data repositories to understand their own communities better and advocate for change from a data-centric perspective. Her various roles allow Joy to lead efforts to teach people how to write their first line of code and help anyone who's interested climb the data science learning curve. Currently employed by the Children's Hospital of Philadelphia and Yeshiva University, Joy is always open to hearing about open-source, data-centric volunteer opportunities for herself and her students.