Visual Elements of Data Science

Abstract: "Above all, show the data” (Edward Tufte)

Data Visualization is fundamental not only to data exploration, but to addressing data science problems in general. It is a key technique in descriptive statistics (e.g., boxplots, histograms, distribution charts, heatmaps), diagnostics (e.g., scatterplots, Geiger counter charts, digital elevation models) and predictive layers (e.g., decision trees, artificial neural networks) of the data science stack. For example, visualization is a means to understand relationships between variables, to recognize patterns, to detect outliers and to break down complexity. Effective ways to describe and summarize data sets are also very helpful in communicating with clients and collaborators in a more quantitative and rational way. Therefore, implementing and utilizing data visualizations is a key skill that every data scientist must have in their repository.

While enterprises and businesses across industries are now widely using dashboards and other (often commercial) business intelligence software to generate data visualizations, data scientists usually still heavily rely on creating charts in scripting languages and other open source coding environments from scratch. This is because they need to not only explore raw data and data aggregates, but also review model outputs visually and prepare charts for presentations and publications. The currently most widely used tools include ggplot2, plotly and shiny (R); as well as matplotlib, Seaborn and Bokeh (python).

This session reviews key elements of the effective use of data visualizations in Data Science industry applications. These include (1) a narrative / a story to tell about the data, (2) simplicity (3) conciseness through balancing information, complexity and avoiding too much decoration (aesthetics concept). It also addresses how to choose the right chart for given data sets, depending on different contexts and questions. What are some simple rules to follow for a good graphic, and which common errors need to be avoided? How do you know if your graph is accurately representing the underlying data set? This is particularly important for high dimensional data sets and growing data volumes in the age of Big Data.

In this workshop, state of the art scripts and packages in R and python will be used to demo how to plot heatmaps, time series charts and network graphs as well as representations and maps for geospatial data sets.

Bio: Olaf Menzer is a Data Scientist in the Decision Analytics team at Pacific Life in Newport Beach, California. His focus areas are around enabling business process improvements and the generation of insights through data synthesis, the application of advanced analytics and technology more broadly. He is also a Visiting Researcher at the University of California, Santa Barbara, contributing to primary research articles and statistical applications in Ecosystem Science.

Prior to working at Pacific Life, Olaf was a Predictive Analyst at Ingram Micro, designing, implementing and testing sales forecasting models, lead generation engines and product recommendation algorithms for cross-selling millions of technology products. He also held different Research Assistant roles at the Lawrence Berkeley National Lab and the Max Planck Institute in Germany where he supported scientific computing, data analysis and machine learning applications.

Olaf was a speaker at the INFORMS Business Analytics conference in 2016, Predictive Analytics World in 2018 and at several academic conferences in the past. He received a M.Sc. in Bioinformatics from Friedrich Schiller University in Germany (2011), and a Ph.D. in Geographic Information Science from University of California, Santa Barbara (2015).

Open Data Science Conference