Volker Markl, Technical University-Berlin
Mosaics: Stratosphere, Flink and beyond
Abstract: The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define “big data”, i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.
Bio: Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität (TU) Berlin, Director of the Intelligent Analytics for Massive Data Research Group at the German Research Center for Artificial Intelligence (DFKI), and Director of the Berlin Big Data Center (BBDC). He has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 19 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been Speaker and Principal Investigator of the DFG funded Stratosphere research project that resulted in the Apache Flink Big Data Analytics System. Currently, he serves as the Secretary of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.”
Laura Haas, IBM
Leveraging data and people to accelerate data science
Abstract: Doing data science – extracting insight by analyzing data – is not easy. Data science is used to answer interesting questions that typically involve multiple diverse data sources, many different types of analysis, and often, large and messy data volumes. To answer one of these questions, several types of expertise may be needed to understand the context and domain being served, to import and transform individual data sets, to implement effective machine learning and/or statistical methods, to design and program applications and interfaces to extract and share data and insights, and to manage the data and systems used for analysis and storage. In the IBM Research Accelerated Discovery Lab, we are studying how data scientists work, and using what we learn to help them gain insights faster. In this talk, we will look at what we have learned to date, through user studies and experience with tens of analytics projects, and the environment that we’ve built as a result. In particular, I will describe how we capture information to enable contextual search, provenance queries, and other functionality to afford teams faster progress in data-intensive investigations. I will also touch on our efforts to leverage data and people to explain what happens during an investigation, with an ultimate goal of moving from descriptive to prescriptive analytics in order to accelerate data science and the analytic process. I will illustrate these various efforts using an ambitious current project on applying metagenomics to food safety, and will conclude with a discussion of where more work is needed and our future directions.
Bio: Laura Haas is an IBM Fellow and Director of IBM Research’s Accelerated Discovery Lab. She was Director of Computer Science at IBM’s Almaden Research Center from 2005 to 2011, and had worldwide responsibility for IBM Research’s exploratory science program from 2009 through 2013. From 2001-2005, she led the Information Integration Solutions architecture and development teams in IBM's Software Group. Previously, Dr. Haas was a research staff member and manager at Almaden. She is best known for her work on the Starburst query processor, from which DB2 LUW was developed, on Garlic, a system which allowed integration of heterogeneous data sources, and on Clio, the first semi-automatic tool for heterogeneous schema mapping. She has received several IBM awards for Outstanding Innovation and Technical Achievement, an IBM Corporate Award for information integration technology, the Anita Borg Institute Technical Leadership Award, and the ACM SIGMOD Codd Innovation Award. Dr. Haas was Vice President of the VLDB Endowment Board of Trustees from 2004-2009, and served on the board of the Computing Research Association from 2007-2016 (vice chair 2009-2015); she currently serves on the National Academies Computer Science and Telecommunications Board (2013-2019). She is an ACM Fellow, a member of the National Academy of Engineering, the IBM Academy of Technology, and a Fellow of the American Academy of Arts and Sciences.