- Volker Markl, 8:30 AM - 9:30 AM, Wednesday, April 19
- Laura Haas, 8:30 AM - 9:30 AM, Thursday, April 20
- Pavel Pevzner, 9:00 AM - 10:00 AM, Friday, April 21
- Magdalena Balazinska, 1:30 PM, Thursday, April 20
Volker Markl, Technical University-Berlin
Mosaics: Stratosphere, Flink and beyond
Abstract: The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define “big data”, i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.
Bio: Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität (TU) Berlin, Director of the Intelligent Analytics for Massive Data Research Group at the German Research Center for Artificial Intelligence (DFKI), and Director of the Berlin Big Data Center (BBDC). He has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 19 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been Speaker and Principal Investigator of the DFG funded Stratosphere research project that resulted in the Apache Flink Big Data Analytics System. Currently, he serves as the Secretary of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.”
Website: http://www.dima.tu-berlin.de
[ top ]
Laura Haas, IBM
Leveraging data and people to accelerate data science
Abstract: Doing data science – extracting insight by analyzing data – is not easy. Data science is used to answer interesting questions that typically involve multiple diverse data sources, many different types of analysis, and often, large and messy data volumes. To answer one of these questions, several types of expertise may be needed to understand the context and domain being served, to import and transform individual data sets, to implement effective machine learning and/or statistical methods, to design and program applications and interfaces to extract and share data and insights, and to manage the data and systems used for analysis and storage. In the IBM Research Accelerated Discovery Lab, we are studying how data scientists work, and using what we learn to help them gain insights faster. In this talk, we will look at what we have learned to date, through user studies and experience with tens of analytics projects, and the environment that we’ve built as a result. In particular, I will describe how we capture information to enable contextual search, provenance queries, and other functionality to afford teams faster progress in data-intensive investigations. I will also touch on our efforts to leverage data and people to explain what happens during an investigation, with an ultimate goal of moving from descriptive to prescriptive analytics in order to accelerate data science and the analytic process. I will illustrate these various efforts using an ambitious current project on applying metagenomics to food safety, and will conclude with a discussion of where more work is needed and our future directions.
Bio: Laura Haas is an IBM Fellow and Director of IBM Research’s Accelerated Discovery Lab. She was Director of Computer Science at IBM’s Almaden Research Center from 2005 to 2011, and had worldwide responsibility for IBM Research’s exploratory science program from 2009 through 2013. From 2001-2005, she led the Information Integration Solutions architecture and development teams in IBM's Software Group. Previously, Dr. Haas was a research staff member and manager at Almaden. She is best known for her work on the Starburst query processor, from which DB2 LUW was developed, on Garlic, a system which allowed integration of heterogeneous data sources, and on Clio, the first semi-automatic tool for heterogeneous schema mapping. She has received several IBM awards for Outstanding Innovation and Technical Achievement, an IBM Corporate Award for information integration technology, the Anita Borg Institute Technical Leadership Award, and the ACM SIGMOD Codd Innovation Award. Dr. Haas was Vice President of the VLDB Endowment Board of Trustees from 2004-2009, and served on the board of the Computing Research Association from 2007-2016 (vice chair 2009-2015); she currently serves on the National Academies Computer Science and Telecommunications Board (2013-2019). She is an ACM Fellow, a member of the National Academy of Engineering, the IBM Academy of Technology, and a Fellow of the American Academy of Arts and Sciences.
Website: http://www.research.ibm.com/client-programs/accelerated-discovery-lab/index.shtml
[ top ]
Pavel Pevzner, University of California San Diego
Life After MOOCs: Online Science Education Needs a New Revolution
Abstract: Universities continue to pack hundreds of students into a single classroom, despite the fact that this “hoarding” approach has little pedagogical value. Hoarding is particularly objectionable in STEM courses, where learning a complex idea is comparable to navigating a labyrinth. In the large classroom, once a student takes a wrong turn, the student has limited opportunities to ask a question, resulting in a learning breakdown, or the inability to progress further without individualized guidance.
A recent revolution in online education has largely focused on making low-cost equivalents of hoarding classes, as many MOOCs are mirror images of their offline counterparts. I propose to transform MOOCs into a more efficient educational product called a Massive Adaptive Interactive Text (MAIT) that can prevent individual learning breakdowns and even outperform a professor in a classroom. I argue that computer science is a unique discipline where this transition is about to happen and describe our first steps towards transforming a MOOC into a MAIT that has already outperformed me. In difference from MOOCs, MAITs will capture digitized individual learning paths of all students and will transform educational psychology into digital science. I will argue that the future MAIT revolution, in difference from the ongoing MOOC revolution, will profoundly affect the way we all teach and will generate huge population-wide datasets containing individual learning paths through various Intelligent Tutoring Systems.
Bio: Pavel Pevzner is Ronald R. Taylor Professor of Computer Science and Engineering and Director of the NIH National Center for Computational Mass Spectrometry at University of California, San Diego. He holds Ph.D. (1988) from Moscow Institute of Physics and Technology, Russia. He was named Howard Hughes Medical Institute Professor in 2006. He was elected the ACM Fellow (2010) for "contribution to algorithms for genome rearrangements, DNA sequencing, and proteomics”, International Society for Computational Biology Fellow (2012), and European Academy of Sciences (Academia Europaea) in 2016. He was awarded a Honoris Causa (2011) from Simon Fraser University in Vancouver and was a recipient of the Senior Scientist Award from the International Society for Computational Biology (2017). Dr. Pevzner authored textbooks "Computational Molecular Biology: An Algorithmic Approach", "Introduction to Bioinformatics Algorithms" (jointly with Neal Jones) and “Bioinformatics Algorithms: an Active Learning Approach” (jointly with Phillip Compeau). In 2015, jointly with Phillip Compeau, he developed a Bioinformatics specialization on Coursera (a series of 7 courses) that is now being transformed into a MAIT and that has already have over 300,000 enrollments. In 2016, he co-developed the first Algorithms specialization on Coursera that already had over 100,000 enrollments.
[ top ]
Magda Balazinska, University of Washington
Research with Real Users
Abstract: There are many potential benefits to ensuring that our research prototypes benefit real users. Users help to define requirements and identify important areas for innovation. Users help to test our systems and verify their utility. Working with real users, however, presents its challenges and requires an extra effort. In this talk, we will present some of these benefits, challenges, and lessons learned from working with real users from sciences in the context of database systems work in data analytics.
Bio: Magdalena Balazinska is the Jean Loup Baer Associate Professor of Computer Science and Engineering at the University of Washington. She's the director of the IGERT PhD Program in Big Data and Data Science and the director of the associated Advanced Data Science PhD Option. She's also a Senior Data Science Fellow of the University of Washington eScience Institute. Magdalena's research interests are in the field of database management systems. Her current research focuses on data management for data science, big data systems, and cloud computing. Magdalena holds a Ph.D. from the Massachusetts Institute of Technology (2006). She is a Microsoft Research New Faculty Fellow (2007), received the inaugural VLDB Women in Database Research Award (2016), an NSF CAREER Award (2009), a 10-year most influential paper award (2010), a Google Research Award (2011), an HP Labs Research Innovation Award (2009 and 2010), a Rogel Faculty Support Award (2006), a Microsoft Research Graduate Fellowship (2003-2005), and multiple best-paper awards.
Website: http://www.cs.washington.edu/people/faculty/magda
[ top ]