Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive
Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive


This talk introduces Apache Hivemall, a scalable machine learning library for Apache Hive, Spark and Pig, in the context of real-world large-scale data science.

Most importantly, Hivemall significantly simplifies machine learning workflow such as feature engineering, algorithm implementation and evaluation, because Hive enables us to access to distributed storage using handy SQL-like queries (HiveQL). Today, data scientists and machine learning engineers commonly suffer from numerous tiny code fragments and poor scalability of pipelines due to the difficulty of implementation. By contrast, once Hivemall is installed, we can execute a wealth of machine learning algorithms in a scalable manner by just writing dozens of lines of queries.

To the end of this session, the speaker talks about:

* Which part of modern realistic machine learning and data science is painful
* When Hivemall is notably preferable to the other implementation of machine learning algorithms, and why it is
* Who can get the benefit from the scalability and simplicity of Hivemall
* What kind of machine learning techniques are implemented in Hivemall, including classification, regression, anomaly detection, natural language processing and recommendation
* How to install and use Hivemall, and how Hivemall implements a wide variety of machine learning algorithms in the scalable manner

Additionally, this talk provides some tips to more effectively utilize Hivemall by showing an example with a workflow engine. For example, Digdag, a distributed workflow engine, provides a simple way to run, organize and/or schedule highly-dependent complex tasks in either sequential and parallel; that is, the workflow engine makes real-world machine learning pipelines nicely manageable. Since workflow definition itself is written in the easy-to-use YAML format, engineers can handle the pipelines in a similar way to what people do on their own source code, in terms of deployment, version control and modularity.


Takuya Kitazawa is a data science engineer at Treasure Data, Inc., a company developing a large-scale enterprise-grade customer data platform, and committer of Apache Hivemall, a scalable machine learning library for Apache Hive and Spark. He is interested in theory and practice of real-world data science and engineering, especially for recommender systems and scalable machine learning.

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google