Jie Xu – Big Data: The Long Tail Problem
Big Data often requires massive-scale parallel processing and data analysis. Typically, a job is divided into many tasks that can be run on distributed processing nodes in parallel, and a job is finished only if all the tasks (or in some cases a subset of tasks) are completed. In practice, a major challenge to such a way of massive-scale parallel processing is the so-called “Long Tail” problem – certain tasks of a job suffer from unexpected slow execution and exhibit obvious longer execution durations than other parallelized tasks, thereby delaying the completion of the job. This talk will investigate the root-causes of the long tail phenomenon, and introduce some of the latest research results that offer cost-effective solutions to the problem. An unusually slow task of a job is called “Straggler”. We will discuss 1) how to identify potential stragglers, and cost-effective schemes for mitigating the effects of slow tasks, 2) how to identify “weak” processing nodes (or servers), using the Machine-Learning method to forecast the performance of servers, and 3) how to deal with the problem of “data skew” – slow tasks caused by unbalanced data partitions. For example, we have developed a dynamic method for identifying and mitigating stragglers that uses 50% less replicas in comparison with the standard replication-based or clone-based methods, reduces response time by up to 20%, as well as achieves a higher speculation success rate up to 66.67% than 16.67% of the static method (i.e. reduces dramatically the degree of resource waste).
Professor Jie Xu is a chair of computing with the University of Leeds and the director of the UK EPSRC WRG e-Science Centre. He is also the chief scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing (BDBC), the National Professor of the 1000-Talent Program and a Yangtze River Scholar, China. He is now leading a collaborative research team, investigating fundamental theories and models for distributed systems (e.g. the theory for fault-diagnosable and large-scale distributed systems, and the formalisation for dynamic multi-party authentication) and developing advanced Internet and Cloud technologies with a focus on complex system engineering (e.g. with Rolls-Royce and JLR), energy-efficient computing (e.g. with Google and Alibaba), dependable and secure collaboration (e.g. large-scale data processing and analysis for social science and e-healthcare applications with TPP and X-Lab Ltd), and evolving system architectures(e.g. with BAE Systems). He is a Steering/Executive Committee member of IEEE SRDS, ISORC, HASE, SOSE, and co-founder of IEEE IC2E. He had published in excess of 300 academic papers, book chapters and edited books. His major work has appeared in IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Computer, IEEE Transactions on Services Computing, IEEE Transactions on Cloud Computing, VLDB, IEEE DSN, IEEE SRDS, IEEE Internet Computing etc.