Under The Hood: Creating Your Own Spark Datasources

Abstract: Apache Spark has become the tool of choice for data engineers and data scientists for data discovery, data munging and pipelining, general ETL and many other kinds of scalable distributed data processing tasks. One of the key contributing factor for this success is the consistent and easy to use distributed framework supporting Scala, Java, Python and R and the ability to connect Spark to a variety of data sources. Connectors come bundled with Spark, and available as part of the ecosystem from other projects and vendors. However there is often a need to integrate Spark with a source, destination or system for which there is no available connector.

This workshop is geared to address that problem by providing an under-the-hood understanding of static and structured streaming sources in Spark by way of building a rudimentary data source. Furthermore, with the knowledge gained, attendees will be able to understand, appreciate and optimally use existing data connectors and integrations like that for Kafka or Hadoop.

Bio: Jayesh Thakrar is a senior software engineer at Conversant where he designed and built systems covering Hadoop, HBase, Cassandra, Flume, Kafka, Hive and OpenTSDB. For the past year he has been working on Spark application development using the big data systems he created. Jayesh is an avid learner and passionate about big data, and often speaks at meetups and conferences sharing his experiences with Apache and other open source projects.