Fr. 73.20

Spark - Big Data Cluster Computing in Production

English · Paperback / Softback

Shipping usually within 1 to 3 weeks (not available at short notice)

Description

Read more

Production-targeted Spark guidance with real-world use cases
 
Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more.
 
Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings.
* Review Spark hardware requirements and estimate cluster size
* Gain insight from real-world production use cases
* Tighten security, schedule resources, and fine-tune performance
* Overcome common problems encountered using Spark in production
 
Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.

List of contents

Introduction xix
 
Chapter 1 Finishing Your Spark Job 1
 
Installation of the Necessary Components 2
 
Native Installation Using a Spark Standalone Cluster 3
 
The History of Distributed Computing That Led to Spark 3
 
Enter the Cloud 4
 
Understanding Resource Management 5
 
Using Various Formats for Storage 8
 
Text Files 10
 
Sequence Files 11
 
Avro Files 11
 
Parquet Files 12
 
Making Sense of Monitoring and Instrumentation 13
 
Spark UI 13
 
Spark Standalone UI 15
 
Metrics REST API 16
 
Metrics System 16
 
External Monitoring Tools 16
 
Summary 17
 
Chapter 2 Cluster Management 19
 
Background 21
 
Spark Components 24
 
Driver 25
 
Workers and Executors 26
 
Configuration 27
 
Spark Standalone 30
 
Architecture 31
 
Single?-Node Setup Scenario 31
 
Multi?-Node Setup 32
 
YARN 33
 
Architecture 35
 
Dynamic Resource Allocation 37
 
Scenario 39
 
Mesos 40
 
Setup 41
 
Architecture 42
 
Dynamic Resource Allocation 44
 
Basic Setup Scenario 44
 
Comparison 46
 
Summary 50
 
Chapter 3 Performance Tuning 53
 
Spark Execution Model 54
 
Partitioning 56
 
Controlling Parallelism 56
 
Partitioners 58
 
Shuffling Data 59
 
Shuffling and Data Partitioning 61
 
Operators and Shuffl ing 63
 
Shuffling Is Not That Bad After All 67
 
Serialization 67
 
Kryo Registrators 69
 
Spark Cache 69
 
Spark SQL Cache 73
 
Memory Management 73
 
Garbage Collection 74
 
Shared Variables 75
 
Broadcast Variables 76
 
Accumulators 78
 
Data Locality 81
 
Summary 82
 
Chapter 4 Security 83
 
Architecture 84
 
Security Manager 84
 
Setup Configurations 85
 
ACL 86
 
Configuration 86
 
Job Submission 87
 
Web UI 88
 
Network Security 95
 
Encryption 96
 
Event logging 101
 
Kerberos 101
 
Apache Sentry 102
 
Summary 102
 
Chapter 5 Fault Tolerance or Job Execution 105
 
Lifecycle of a Spark Job 106
 
Spark Master 107
 
Spark Driver 109
 
Spark Worker 111
 
Job Lifecycle 112
 
Job Scheduling 112
 
Scheduling within an Application 113
 
Scheduling with External Utilities 120
 
Fault Tolerance 122
 
Internal and External Fault Tolerance 122
 
Service Level Agreements (SLAs) 123
 
Resilient Distributed Datasets (RDDs) 124
 
Batch versus Streaming 130
 
Testing Strategies 133
 
Recommended Confi gurations 139
 
Summary 142
 
Chapter 6 Beyond Spark 145
 
Data Warehousing 146
 
Spark SQL CLI 147
 
Thrift JDBC/ODBC Server 147
 
Hive on Spark 148
 
Machine Learning 150
 
DataFrame 150
 
MLlib and ML 153
 
Mahout on Spark 158
 
Hivemall on Spark 160
 
External Frameworks 161
 
Spark Package 161
 
XGBoost 163
 
spark?-jobserver 164
 
Future Works 166
 
Integration with the Parameter Server 167
 
Deep Learning 175
 
Enterprise Usage 182
 
Collecting User Activity Log with Spark and Kafka 183
 
Real?-Time Recommendation with Spark 184
 
Real?-Time Categorization of T

About the author










Ilya Ganelin is a data engineer working at Capital One Data Innovation Lab. Ilya is an active contributor to the core components of Apache Spark and a committer to Apache Apex. Ema Orhian is a Big Data Engineer interested in scaling algorithms. She is the main committer on jaws-spark-sql-rest, a data warehouse explorer on top of Spark SQL. Kai Sasaki is a software engineer working in distributed computing and machine learning. He is a Spark contributor who develops mainly MLlib, ML libraries. Brennon York has been a core contributor to Apache Spark since 2014 including development on GraphX and the core build environment.

Summary

Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production.

Customer reviews

No reviews have been written for this item yet. Write the first review and be helpful to other users when they decide on a purchase.

Write a review

Thumbs up or thumbs down? Write your own review.

For messages to CeDe.ch please use the contact form.

The input fields marked * are obligatory

By submitting this form you agree to our data privacy statement.