Abstract: Organizations have long used relational databases for a wide variety of data-intensive applications. Data scientists and analysts need to understand how to work with relational databases, particularly for common data science tasks, such as finding, exploring, analyzing and extracting data within a relational database. SQL is an expressive declarative language designed for working with tabular data. Since its inception, new features have been added that allow for complex expressions and computation, such as regular expressions, sliding window functions, and analytical operations that create non-relational results. Understanding the advanced features of SQL will help data scientists reduce the need for programming custom data manipulation functionality.
The tutorial begins with a brief overview of SQL and then delves into the five major topics a data scientist should understand when working with relational databases: basic statistics in SQL, data preparation in SQL, advanced filtering and data aggregation, window functions, and using relational databases with analytics tools, such as Jupyter Notebooks and RStudio. The first segment of the tutorial will review descriptive statistical functions, filtering, and basic aggregation. Next, we will turn our attention to data preprocessing and review extracting and reformatting strings, reformatting numeric data, and filtering with regular expression. The third segment examines how to use subqueries, joins, rollups, cubes as well as tips for finding Top N results. The fourth segment of the tutorial will show how to work with streaming and other windowed data, including specifying partitions, computing rank, lag, lead, and creating buckets with the NTILE function. In the final segment, we will demonstrate how to query relational databases using SQL with Python and R.
Bio: Dan Sullivan is a Principal Engineer and Architect with New Relic where he focuses on cloud architecture, data science, machine learning, and data architecture. He is the author of six books, most recently on cloud and NoSQL databases, as well as several online courses on machine learning, data science, and cloud computing that have had over 1 million views. Dan holds a doctorate in genetics, bioinformatics, and computational biology. His peer-reviewed research has been published in PloS One, Nucleic Acid Research, Infection and Immunity, and the Journal of Proteomics and Genomics Research.