Hyunsik Choi is Director of research at Gruter Inc. which is a big data platform startup company located in Palo Alto CA. He is the founder of Apache Tajo project and a member of ASF.
Abstract:
Tajo is an advanced open source data warehouse system on Hadoop. Tajo has rapidly evolved over couple of years. In this talk, I’ll give an overview of Tajo, and I will present how Tajo has been improved for years. In particular, this talk will introduce new features of the recent major release Tajo 0.10: Hbase storage support, thin JDBC driver, direct JSON support, and Amazon EMR support. Then, I will present the upcoming features that currently Tajo community is doing: multi-tenant scheduler take 1, allowing multiple users to submit multiple queries into one cluster; nested schema support, allowing users to directly handle complex data types without flattening; more advanced SQL features like window frame, and subqueries.
Takeaway for audience:
Tajo provides ANSI SQL and scalable and low-latency SQL processing on various data sources like HDFS, S3, Openstack Swift, RDBMS, and HBase. Users can learn that Tajo is the best system for an unified SQL-on-Hadoop system on batch as well as low-latency workloads. Also, they will learn that Tajo can be a nice solution to users who already use RDBMSs and also want to introduce Hadoop-based data warehouse system or want to migrate existing RDBMs into Hadoop.
Speaker:
Hyunsik Choi, VP of Apache Tajo
Workshop on Apache Spark covering following topics:
- Overview of Apache Spark
- Spark execution model
- Programming in Spark
- Spark execution model
- Programming in Spark
- Spark SQL
- Spark Streaming
Level: Intermediate
Register now - Entry is free for limited period.
Apache Spark started as a project at AMP Labs in Berkeley. Today it is one of the most active open source projects. In this workshop, we will uncover key components of Apache Spark.
Apache Spark builds a Directed Acyclic Graph (DAG) of tasks split in to stages where the tasks are parallelized as much as possible before shuffling of data. We will discuss the execution paradigm in Spark along with in-memory/cache based persistence and processing of data giving it unique performance advantage.
Details will be discussed on how to program with Spark and use its API. Resilient Distributed Datasets (RDD) are parallelized collections of data sets and can be operated in parallel as distributed data set. There is a huge list of operators that can be used to do a transformation or action operation on the RDD.
Spark SQL is becoming a preferred choice of doing exploratory analytics on data. DataFrames are available as programming abstraction which provide convenient way of doing complex operations with its API. Coupled with a powerful Catalyst optimization engine, Spark SQL can run with or without Hive and will be discussed in detail as part of this workshop.
Streaming is one of the hottest areas in big data arena with a wide scope for near real time application development. Easily integrable with other modules of Spark, it provides a powerful way of doing micro-batch analytics. We take a look at some examples of real time streaming.
Level: Intermediate
Register now - Entry is free for limited period.
Apache Spark started as a project at AMP Labs in Berkeley. Today it is one of the most active open source projects. In this workshop, we will uncover key components of Apache Spark.
Apache Spark builds a Directed Acyclic Graph (DAG) of tasks split in to stages where the tasks are parallelized as much as possible before shuffling of data. We will discuss the execution paradigm in Spark along with in-memory/cache based persistence and processing of data giving it unique performance advantage.
Details will be discussed on how to program with Spark and use its API. Resilient Distributed Datasets (RDD) are parallelized collections of data sets and can be operated in parallel as distributed data set. There is a huge list of operators that can be used to do a transformation or action operation on the RDD.
Spark SQL is becoming a preferred choice of doing exploratory analytics on data. DataFrames are available as programming abstraction which provide convenient way of doing complex operations with its API. Coupled with a powerful Catalyst optimization engine, Spark SQL can run with or without Hive and will be discussed in detail as part of this workshop.
Streaming is one of the hottest areas in big data arena with a wide scope for near real time application development. Easily integrable with other modules of Spark, it provides a powerful way of doing micro-batch analytics. We take a look at some examples of real time streaming.
Michael D. Thomas is Senior Software Architect in the SAS R&D Technology Office. In addition to developing and architecting software for over twenty years, he has authored three books on programming topics and numerous conference papers. He blogs at SAS Voices on emerging technologies such as IoT, VR and AR and advocates for elementary school chess.
Session: Augmented reality and virtual reality with Big Data
Abstract:
For
all the talk about the growth of Big Data, there hasn’t been a commensurate
increase in the amount of data going to human perception. Parallel to the rise
of Big Data, augmented reality (AR) and virtual reality (VR) technologies are
enhancing reality like never before. VR and AR can be applied to Big Data,
especially real-time data streams – after all, reality is in real-time. This
talk surveys the exciting developments in AR and VR and how analytics can be
combined with these technologies to yield data immersion.
Takeaway for audience:
The audience will learn about the market developments in the AR and VR space and how they can apply to the perception of data using reality metaphors – well beyond traditional data visualization metaphors some of which date back to the late 1700’s.