Implementing a Lambda Architecture to perform real-time updates

Master of Science === Department of Computing and Information Sciences === William Hsu === The Lambda Architecture is the new paradigm for big data, that helps in data processing with a balance on throughput, latency and fault-tolerance. There exists no single tool that provides a complete solution...

Full description

Bibliographic Details
Main Author: Gudipati, Pramod Kumar
Language:en
Published: Kansas State University 2016
Subjects:
Online Access:http://hdl.handle.net/2097/34517
Description
Summary:Master of Science === Department of Computing and Information Sciences === William Hsu === The Lambda Architecture is the new paradigm for big data, that helps in data processing with a balance on throughput, latency and fault-tolerance. There exists no single tool that provides a complete solution in terms of better accuracy, low latency and high throughput. This initiated the idea to use a set of tools and techniques to build a complete big data system. The Lambda Architecture defines a set of layers to fit in a set of tools and techniques rightly for building a complete big data system: Speed Layer, Serving Layer, Batch Layer. Each layer satisfies a set of properties and builds upon the functionality provided by the layers beneath it. The Batch layer is the place where the master dataset is stored, which is an immutable and append-only set of raw data. Also, batch layer pre-computes results using a distributed processing system like Hadoop, Apache Spark that can handle large quantities of data. The Speed Layer captures new data coming in real time and processes it. The Serving Layer contains a parallel processing query engine, which takes results from both Batch and Speed layers and responds to queries in real time with low latency. Stack Overflow is a Question & Answer forum with a huge user community, millions of posts with a rapid growth over the years. This project demonstrates The Lambda Architecture by constructing a data pipeline, to add a new “Recommended Questions” section in Stack Overflow user profile and update the questions suggested in real time. Also, various statistics such as trending tags, user performance numbers such as UpVotes, DownVotes are shown in user dashboard by querying through batch processing layer.