Warehouse Stock Clearance Sale

Grab a bargain today!


Sign Up for Fishpond's Best Deals Delivered to You Every Day
Go
Practical Data Science with­ Hadoop and Spark
Designing and Building Effective Analytics at Scale

Rating
Format
Paperback, 256 pages
Published
United States, 1 December 2016

This book provides a unique perspective on applying data science with Hadoop by explaining what data science with Hadoop is all about, its practical business applications, and then diving deep into the details and providing a hands-on tutorial and showcase of various use-cases from the real world. The authors bring together all the practical knowledge students will need to do real, useful data science with Hadoop.



Foreword xiii


Preface xv


Acknowledgments xxi


About the Authors xxiii


Part I: Data Science with Hadoop-An Overview 1


Chapter 1: Introduction to Data Science 3


What Is Data Science? 3


Example: Search Advertising 4


A Bit of Data Science History 5


Becoming a Data Scientist 8


Building a Data Science Team 12


The Data Science Project Life Cycle 13


Managing a Data Science Project 18


Summary 18


Chapter 2: Use Cases for Data Science 19


Big Data-A Driver of Change 19


Business Use Cases 21


Summary 29


Chapter 3: Hadoop and Data Science 31


What Is Hadoop? 31


Hadoop's Evolution 37


Hadoop Tools for Data Science 38


Why Hadoop Is Useful to Data Scientists 46


Summary 51


Part II: Preparing and Visualizing Data with Hadoop 53


Chapter 4: Getting Data into Hadoop 55


Hadoop as a Data Lake 56


The Hadoop Distributed File System (HDFS) 58


Direct File Transfer to Hadoop HDFS 58


Importing Data from Files into Hive Tables 59


Importing Data into Hive Tables Using Spark 62


Using Apache Sqoop to Acquire Relational Data 65


Using Apache Flume to Acquire Data Streams 74


Manage Hadoop Work and Data Flows with Apache


Oozie 79


Apache Falcon 81


What's Next in Data Ingestion? 82


Summary 82


Chapter 5: Data Munging with Hadoop 85


Why Hadoop for Data Munging? 86


Data Quality 86


The Feature Matrix 93


Summary 106


Chapter 6: Exploring and Visualizing Data 107


Why Visualize Data? 107


Creating Visualizations 112


Using Visualization for Data Science 121


Popular Visualization Tools 121


Visualizing Big Data with Hadoop 123


Summary 124


Part III: Applying Data Modeling with Hadoop 125


Chapter 7: Machine Learning with Hadoop 127


Overview of Machine Learning 127


Terminology 128


Task Types in Machine Learning 129


Big Data and Machine Learning 130


Tools for Machine Learning 131


The Future of Machine Learning and Artificial Intelligence 132


Summary 132


Chapter 8: Predictive Modeling 133


Overview of Predictive Modeling 133


Classification Versus Regression 134


Evaluating Predictive Models 136


Supervised Learning Algorithms 140


Building Big Data Predictive Model Solutions 141


Example: Sentiment Analysis 145


Summary 150


Chapter 9: Clustering 151


Overview of Clustering 151


Uses of Clustering 152


Designing a Similarity Measure 153


Clustering Algorithms 154


Example: Clustering Algorithms 155


Evaluating the Clusters and Choosing the Number of Clusters 157


Building Big Data Clustering Solutions 158


Example: Topic Modeling with Latent Dirichlet Allocation 160


Summary 163


Chapter 10: Anomaly Detection with Hadoop 165


Overview 165


Uses of Anomaly Detection 166


Types of Anomalies in Data 166


Approaches to Anomaly Detection 167


Tuning Anomaly Detection Systems 170


Building a Big Data Anomaly Detection Solution with Hadoop 171


Example: Detecting Network Intrusions 172


Summary 179


Chapter 11: Natural Language Processing 181


Natural Language Processing 181


Tooling for NLP in Hadoop 184


Textual Representations 187


Sentiment Analysis Example 189


Summary 193


Chapter 12: Data Science with Hadoop-The Next Frontier 195


Automated Data Discovery 195


Deep Learning 197


Summary 199


Appendix A: Book Web Page and Code Download 201


Appendix B: HDFS Quick Start 203


Quick Command Dereference 204


Appendix C: Additional Background on Data Science and Apache Hadoop and Spark 209


General Hadoop/Spark Information 209


Hadoop/Spark Installation Recipes 210


HDFS 210


MapReduce 211


Spark 211


Essential Tools 211


Machine Learning 212


Index 213

Show more

Our Price
£32.54
Ships from UK Estimated delivery date: 11th Apr - 15th Apr from UK

Product Description

This book provides a unique perspective on applying data science with Hadoop by explaining what data science with Hadoop is all about, its practical business applications, and then diving deep into the details and providing a hands-on tutorial and showcase of various use-cases from the real world. The authors bring together all the practical knowledge students will need to do real, useful data science with Hadoop.



Foreword xiii


Preface xv


Acknowledgments xxi


About the Authors xxiii


Part I: Data Science with Hadoop-An Overview 1


Chapter 1: Introduction to Data Science 3


What Is Data Science? 3


Example: Search Advertising 4


A Bit of Data Science History 5


Becoming a Data Scientist 8


Building a Data Science Team 12


The Data Science Project Life Cycle 13


Managing a Data Science Project 18


Summary 18


Chapter 2: Use Cases for Data Science 19


Big Data-A Driver of Change 19


Business Use Cases 21


Summary 29


Chapter 3: Hadoop and Data Science 31


What Is Hadoop? 31


Hadoop's Evolution 37


Hadoop Tools for Data Science 38


Why Hadoop Is Useful to Data Scientists 46


Summary 51


Part II: Preparing and Visualizing Data with Hadoop 53


Chapter 4: Getting Data into Hadoop 55


Hadoop as a Data Lake 56


The Hadoop Distributed File System (HDFS) 58


Direct File Transfer to Hadoop HDFS 58


Importing Data from Files into Hive Tables 59


Importing Data into Hive Tables Using Spark 62


Using Apache Sqoop to Acquire Relational Data 65


Using Apache Flume to Acquire Data Streams 74


Manage Hadoop Work and Data Flows with Apache


Oozie 79


Apache Falcon 81


What's Next in Data Ingestion? 82


Summary 82


Chapter 5: Data Munging with Hadoop 85


Why Hadoop for Data Munging? 86


Data Quality 86


The Feature Matrix 93


Summary 106


Chapter 6: Exploring and Visualizing Data 107


Why Visualize Data? 107


Creating Visualizations 112


Using Visualization for Data Science 121


Popular Visualization Tools 121


Visualizing Big Data with Hadoop 123


Summary 124


Part III: Applying Data Modeling with Hadoop 125


Chapter 7: Machine Learning with Hadoop 127


Overview of Machine Learning 127


Terminology 128


Task Types in Machine Learning 129


Big Data and Machine Learning 130


Tools for Machine Learning 131


The Future of Machine Learning and Artificial Intelligence 132


Summary 132


Chapter 8: Predictive Modeling 133


Overview of Predictive Modeling 133


Classification Versus Regression 134


Evaluating Predictive Models 136


Supervised Learning Algorithms 140


Building Big Data Predictive Model Solutions 141


Example: Sentiment Analysis 145


Summary 150


Chapter 9: Clustering 151


Overview of Clustering 151


Uses of Clustering 152


Designing a Similarity Measure 153


Clustering Algorithms 154


Example: Clustering Algorithms 155


Evaluating the Clusters and Choosing the Number of Clusters 157


Building Big Data Clustering Solutions 158


Example: Topic Modeling with Latent Dirichlet Allocation 160


Summary 163


Chapter 10: Anomaly Detection with Hadoop 165


Overview 165


Uses of Anomaly Detection 166


Types of Anomalies in Data 166


Approaches to Anomaly Detection 167


Tuning Anomaly Detection Systems 170


Building a Big Data Anomaly Detection Solution with Hadoop 171


Example: Detecting Network Intrusions 172


Summary 179


Chapter 11: Natural Language Processing 181


Natural Language Processing 181


Tooling for NLP in Hadoop 184


Textual Representations 187


Sentiment Analysis Example 189


Summary 193


Chapter 12: Data Science with Hadoop-The Next Frontier 195


Automated Data Discovery 195


Deep Learning 197


Summary 199


Appendix A: Book Web Page and Code Download 201


Appendix B: HDFS Quick Start 203


Quick Command Dereference 204


Appendix C: Additional Background on Data Science and Apache Hadoop and Spark 209


General Hadoop/Spark Information 209


Hadoop/Spark Installation Recipes 210


HDFS 210


MapReduce 211


Spark 211


Essential Tools 211


Machine Learning 212


Index 213

Show more
Product Details
EAN
9780134024141
ISBN
0134024141
Dimensions
23.1 x 17.5 x 2 centimeters (0.32 kg)

Table of Contents

  • Part I: Data Science with Hadoop—An Overview
  • Chapter 1: Introduction to Data Science
  • Chapter 2: Use Cases for Data Science
  • Chapter 3: Hadoop and Data Science
  • Part II: Preparing and Visualizing Data with Hadoop
  • Chapter 4: Getting Data into Hadoop
  • Chapter 5: Data Munging with Hadoop
  • Chapter 6: Exploring and Visualizing Data
  • Part III: Applying Data Modeling with Hadoop
  • Chapter 7: Machine Learning with Hadoop
  • Chapter 8: Predictive Modeling
  • Chapter 9: Clustering
  • Chapter 10: Anomaly Detection with Hadoop
  • Chapter 11: Natural Language Processing
  • Chapter 12: Data Science with Hadoop—The Next Frontier
  • Appendix A: Book Web Page and Code Download
  • Appendix B: HDFS Quick Start
  • Appendix C: Additional Background on Data Science and Apache Hadoop and Spark

About the Author

Ofer Mendelevitch is Vice President of Data Science at Lendup, where he is responsible for Lendup’s machine learning and advanced analytics group. Prior to joining Lendup, Ofer was Director of Data Science at Hortonworks, where he was responsible for helping Hortonwork’s customers apply Data Science with Hadoop and Spark to big data across various industries including healthcare, finance, retail and others. Before Hortonworks, Ofer served as Entrepreneur in Residence at XSeed Capital, VP of Engineering at Nor1, and Director of Engineering at Yahoo!.

 

Casey Stella is a Principal Software Engineer focusing on Data Science at Hortonworks, which provides an open source Hadoop distribution. Casey’s primary responsibility is leading the analytics/data science team for the Apache Metron (Incubating) Project, an open source cybersecurity project. Prior to Hortonworks, Casey was an architect at Explorys, which was a medical informatics startup spun out of the Cleveland Clinic.  In the more distant past, Casey served as a developer at Oracle, Research Geophysicist at ION Geophysical and as a poor graduate student in Mathematics at Texas A&M.

 

Douglas Eadline, PhD, began his career as analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering many aspects of HPC and Hadoop computing. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor¿in¿chief for ClusterWorld Magazine and was senior HPC editor for Linux Magazine. He has practical hands-on experience in many aspects of HPC and Apache Hadoop, including hardware and software design, benchmarking, storage, GPU, cloud computing, and parallel computing. Currently, he is a writer and consultant to the HPC/analytics industry and leader of the Limulus Personal Cluster Project (http://limulus.basement-supercomputing.com). He is author of the Apache Hadoop® Fundamentals LiveLessons and Apache Hadoop® YARN Fundamentals LiveLessons videos from Pearson, and is book co-author of Apache Hadoop® YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and author of Hadoop® 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem, also from Addison-Wesley, and is author of High Performance Computing for Dummies.

Show more
Review this Product
Ask a Question About this Product More...
 
Item ships from and is sold by Fishpond World Ltd.

Back to top
We use essential and some optional cookies to provide you the best shopping experience. Visit our cookies policy page for more information.