Data Lake for enterprises : leveraging Lambda architecture for building Enterprise Data Lake / Tomcy John, Pankaj Misra.

A practical guide to implementing your enterprise data lake using Lambda Architecture as the base About This Book Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base Delve into the big data technologies required to meet mo...

Full description

Saved in:
Bibliographic Details
Online Access: Full Text (via O'Reilly/Safari)
Main Authors: John, Tomcy (Author), Misra, Pankaj (Author)
Format: eBook
Language:English
Published: Birmingham, UK : Packt Publishing, 2017.
Subjects:

MARC

LEADER 00000cam a2200000 i 4500
001 b10304036
006 m o d
007 cr |||||||||||
008 170623s2017 enka o 000 0 eng d
005 20240829145615.0
015 |a GBB7E7321  |2 bnb 
016 7 |a 018399333  |2 Uk 
020 |a 1787282651 
020 |a 9781787282650  |q (electronic bk.) 
020 |z 9781787281349 
029 1 |a GBVCP  |b 1004864566 
029 1 |a UKMGB  |b 018399333 
035 |a (OCoLC)safo991530196 
035 |a (OCoLC)991530196 
037 |a safo9781787281349 
040 |a UMI  |b eng  |e rda  |e pn  |c UMI  |d TOH  |d OCLCF  |d IDEBK  |d OCLCQ  |d CEF  |d KSU  |d NLE  |d UKMGB  |d UAB  |d UKAHL  |d DST  |d OCLCO  |d OCLCQ  |d N$T  |d INARC  |d OCLCQ  |d OCLCO  |d OCLCQ  |d DXU 
049 |a GWRE 
050 4 |a QA76.9.D5 
100 1 |a John, Tomcy,  |e author. 
245 1 0 |a Data Lake for enterprises :  |b leveraging Lambda architecture for building Enterprise Data Lake /  |c Tomcy John, Pankaj Misra. 
264 1 |a Birmingham, UK :  |b Packt Publishing,  |c 2017. 
300 |a 1 online resource (1 volume) :  |b illustrations 
336 |a text  |b txt  |2 rdacontent 
337 |a computer  |b c  |2 rdamedia 
338 |a volume  |b nc  |2 rdacarrier 
588 |a Description based on online resource; title from title page (Safari, viewed June 23, 2017). 
520 |a A practical guide to implementing your enterprise data lake using Lambda Architecture as the base About This Book Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base Delve into the big data technologies required to meet modern day business strategies A highly practical guide to implementing enterprise data lakes with lots of examples and real-world use-cases Who This Book Is For Java developers and architects who would like to implement a data lake for their enterprise will find this book useful. If you want to get hands-on experience with the Lambda Architecture and big data technologies by implementing a practical solution using these technologies, this book will also help you. What You Will Learn Build an enterprise-level data lake using the relevant big data technologies Understand the core of the Lambda architecture and how to apply it in an enterprise Learn the technical details around Sqoop and its functionalities Integrate Kafka with Hadoop components to acquire enterprise data Use flume with streaming technologies for stream-based processing Understand stream- based processing with reference to Apache Spark Streaming Incorporate Hadoop components and know the advantages they provide for enterprise data lakes Build fast, streaming, and high-performance applications using ElasticSearch Make your data ingestion process consistent across various data formats with configurability Process your data to derive intelligence using machine learning algorithms In Detail The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects - data lake and lambda architecture-together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces yo... 
505 0 |a Cover -- Copyright -- Credits -- Foreword -- About the Authors -- About the Reviewers -- www.PacktPub.com -- Customer Feedback -- Table of Contents -- Preface -- Part 1 -- Overview -- Part 2 -- Technical Building blocks of Data Lake -- Part 3 -- Bringing It All Together -- Chapter 1: Introduction to Data -- Exploring data -- What is Enterprise Data? -- Enterprise Data Management -- Big data concepts -- Big data and 4Vs -- Relevance of data -- Quality of data -- Where does this data live in an enterprise? -- Intranet (within enterprise) -- Internet (external to enterprise) -- Business applications hosted in cloud -- Third-party cloud solutions -- Social data (structured and unstructured) -- Data stores or persistent stores (RDBMS or NoSQL) -- Traditional data warehouse -- File stores -- Enterprise's current state -- Enterprise digital transformation -- Enterprises embarking on this journey -- Some examples -- Data lake use case enlightenment -- Summary -- Chapter 2: Comprehensive Concepts of a Data Lake -- What is a Data Lake? -- Relevance to enterprises -- How does a Data Lake help enterprises? -- Data Lake benefits -- How Data Lake works? -- Differences between Data Lake and Data Warehouse -- Approaches to building a Data Lake -- Lambda Architecture-driven Data Lake -- Data ingestion layer -- ingest for processing and storage -- Batch layer -- batch processing of ingested data -- Speed layer -- near real time data processing -- Data storage layer -- store all data -- Serving layer -- data delivery and exports -- Data acquisition layer -- get data from source systems -- Messaging Layer -- guaranteed data delivery -- Exploring the Data Ingestion Layer -- Exploring the Lambda layer -- Batch layer -- Speed layer -- Serving layer -- Data push -- Data pull -- Data storage layer -- Batch process layer -- Speed layer -- Serving layer -- Relational data stores. 
505 8 |a Distributed data stores -- Summary -- Chapter 3: Lambda Architecture as a Pattern for Data Lake -- What is Lambda Architecture? -- History of Lambda Architecture -- Principles of Lambda Architecture -- Fault-tolerant principle -- Immutable Data principle -- Re-computation principle -- Components of a Lambda Architecture -- Batch layer -- Speed layer -- CAP Theorem -- Eventual consistency -- Serving layer -- Complete working of a Lambda Architecture -- Advantages of Lambda Architecture -- Disadvantages of Lambda Architectures -- Technology overview for Lambda Architecture -- Applied lambda -- Enterprise-level log analysis -- Capturing and analyzing sensor data -- Real-time mailing platform statistics -- Real-time sports analysis -- Recommendation engines -- Analyzing security threats -- Multi-channel consumer behaviour -- Working examples of Lambda Architecture -- Kappa architecture -- Summary -- Chapter 4: Applied Lambda for Data Lake -- Knowing Hadoop distributions -- Selection factors for a big data stack for enterprises -- Technical capabilities -- Ease of deployment and maintenance -- Integration readiness -- Batch layer for data processing -- The NameNode server -- The secondary NameNode Server -- Yet Another Resource Negotiator (YARN) -- Data storage nodes (DataNode) -- Speed layer -- Flume for data acquisition -- Source for event sourcing -- Interceptors for event interception -- Channels for event flow -- Sink as an event destination -- Spark Streaming -- DStreams -- Data Frames -- Checkpointing -- Apache Flink -- Serving layer -- Data repository layer -- Relational databases -- Big data tables/views -- Data services with data indexes -- NoSQL databases -- Data access layer -- Data exports -- Data publishing -- Summary -- Chapter 5: Data Acquisition of Batch Data using Apache Sqoop -- Context in data lake -- data acquisition. 
505 8 |a Data acquisition layer -- Data acquisition of batch data -- technology mapping -- Why Apache Sqoop -- History of Sqoop -- Advantages of Sqoop -- Disadvantages of Sqoop -- Workings of Sqoop -- Sqoop 2 architecture -- Sqoop 1 versus Sqoop 2 -- Ease of use -- Ease of extension -- Security -- When to use Sqoop 1 and Sqoop 2 -- Functioning of Sqoop -- Data import using Sqoop -- Data export using Sqoop -- Sqoop connectors -- Types of Sqoop connectors -- Sqoop support for HDFS -- Sqoop working example -- Installation and Configuration -- Step 1 -- Installing and verifying Java -- Step 2 -- Installing and verifying Hadoop -- Step 3 -- Installing and verifying Hue -- Step 4 -- Installing and verifying Sqoop -- Step 5 -- Installing and verifying PostgreSQL (RDBMS) -- Step 6 -- Installing and verifying HBase (NoSQL) -- Configure data source (ingestion) -- Sqoop configuration (database drivers) -- Configuring HDFS as destination -- Sqoop Import -- Import complete database -- Import selected tables -- Import selected columns from a table -- Import into HBase -- Sqoop Export -- Sqoop Job -- Job command -- Create job -- List Job -- Run Job -- Create Job -- Sqoop 2 -- Sqoop in purview of SCV use case -- When to use Sqoop -- When not to use Sqoop -- Real-time Sqooping: a possibility? -- Other options -- Native big data connectors -- Talend -- Pentaho's Kettle (PDI -- Pentaho Data Integration) -- Summary -- Chapter 6: Data Acquisition of Stream Data using Apache Flume -- Context in Data Lake: data acquisition -- What is Stream Data? -- Batch and stream data -- Data acquisition of stream data -- technology mapping -- What is Flume? -- Sqoop and Flume -- Why Flume? -- History of Flume -- Advantages of Flume -- Disadvantages of Flume -- Flume architecture principles -- The Flume Architecture -- Distributed pipeline -- Flume architecture -- Fan Out -- Flume architecture. 
505 8 |a Fan In -- Flume architecture -- Three tier design -- Flume architecture -- Advanced Flume architecture -- Flume reliability level -- Flume event -- Stream Data -- Flume agent -- Flume agent configurations -- Flume source -- Custom Source -- Flume Channel -- Custom channel -- Flume sink -- Custom sink -- Flume configuration -- Flume transaction management -- Other flume components -- Channel processor -- Interceptor -- Channel Selector -- Sink Groups -- Sink Processor -- Event Serializers -- Context Routing -- Flume working example -- Installation and Configuration -- Step 1: Installing and verifying Flume -- Step 2: Configuring Flume -- Step 3: Start Flume -- Flume in purview of SCV use case -- Kafka Installation -- Example 1 -- RDBMS to Kafka -- Example 2: Spool messages to Kafka -- Example 3: Interceptors -- Example 4 -- Memory channel, file channel, and Kafka channel -- When to use Flume -- When not to use Flume -- Other options -- Apache Flink -- Apache NiFi -- Summary -- Chapter 7: Messaging Layer using Apache Kafka -- Context in Data Lake - messaging layer -- Messaging layer -- Messaging layer - technology mapping -- What is Apache Kafka? -- Why Apache Kafka -- History of Kafka -- Advantages of Kafka -- Disadvantages of Kafka -- Kafka architecture -- Core architecture principles of Kafka -- Data stream life cycle -- Working of Kafka -- Kafka message -- Kafka producer -- Persistence of data in Kafka using topics -- Partitions - Kafka topic division -- Kafka message broker -- Kafka consumer -- Consumer groups -- Other Kafka components -- Zookeeper -- MirrorMaker -- Kafka programming interface -- Kafka core API's -- Kafka REST interface -- Producer and consumer reliability -- Kafka security -- Kafka as message-oriented middleware -- Scale-out architecture with Kafka -- Kafka connect -- Kafka working example -- Installation. 
505 8 |a Producer - putting messages into Kafka -- Kafka Connect -- Consumer - getting messages from Kafka -- Setting up multi-broker cluster -- Kafka in the purview of an SCV use case -- When to use Kafka -- When not to use Kafka -- Other options -- RabbitMQ -- ZeroMQ -- Apache ActiveMQ -- Summary -- Chapter 8: Data Processing using Apache Flink -- Context in a Data Lake -- Data Ingestion Layer -- Data Ingestion Layer -- Data Ingestion Layer -- technology mapping -- What is Apache Flink? -- Why Apache Flink? -- History of Flink -- Advantages of Flink -- Disadvantages of Flink -- Working of Flink -- Flink architecture -- Client -- Job Manager -- Task Manager -- Flink execution model -- Core architecture principles of Flink -- Flink Component Stack -- Checkpointing in Flink -- Savepoints in Flink -- Streaming window options in Flink -- Time window -- Count window -- Tumbling window configuration -- Sliding window configuration -- Memory management -- Flink API's -- DataStream API -- Flink DataStream API example -- Streaming connectors -- DataSet API -- Flink DataSet API example -- Table API -- Flink domain specific libraries -- Gelly - Flink Graph API -- FlinkML -- FlinkCEP -- Flink working example -- Installation -- Example -- data processing with Flink -- Data generation -- Step 1 -- Preparing streams -- Step 2 -- Consuming Streams via Flink -- Step 3 -- Streaming data into HDFS -- Flink in purview of SCV use cases -- User Log Data Generation -- Flume Setup -- Flink Processors -- When to use Flink -- When not to use Flink -- Other options -- Apache Spark -- Apache Storm -- Apache Tez -- Summary -- Chapter 9: Data Store Using Apache Hadoop -- Context for Data Lake -- Data Storage and lambda Batch layer -- Data Storage and the Lambda Batch Layer -- Data Storage and Lambda Batch Layer -- technology mapping -- What is Apache Hadoop? -- Why Hadoop? -- History of Hadoop. 
650 0 |a Electronic data processing  |x Distributed processing  |x Management. 
650 0 |a Big data. 
650 0 |a Information storage and retrieval systems. 
650 7 |a Big data  |2 fast 
650 7 |a Electronic data processing  |x Distributed processing  |x Management  |2 fast 
650 7 |a Information storage and retrieval systems  |2 fast 
700 1 |a Misra, Pankaj,  |e author. 
856 4 0 |u https://go.oreilly.com/UniOfColoradoBoulder/library/view/~/9781787281349/?ar  |z Full Text (via O'Reilly/Safari) 
915 |a - 
956 |a O'Reilly-Safari eBooks 
956 |b O'Reilly Online Learning: Academic/Public Library Edition 
994 |a 92  |b COD 
998 |b Subsequent record output 
999 f f |i d0354aa6-7c85-581a-ae14-027c2a295677  |s 134ac131-8aa9-51b4-aeff-2ce72e825e84 
952 f f |p Can circulate  |a University of Colorado Boulder  |b Online  |c Online  |d Online  |e QA76.9.D5  |h Library of Congress classification  |i web  |n 1