 
With this book, managers and decision makers are given the tools to make more informed decisions about big data purchasing initiatives. Big Data Analytics: A Practical Guide for Managers not only supplies descriptions of common tools, but also surveys the various products and vendors that supply the big data market.
Comparing and contrasting the different types of analysis commonly conducted with big data, this accessible reference presents clear-cut explanations of the general workings of big data tools. Instead of spending time on HOW to install specific packages, it focuses on the reasons WHY readers would install a given package.
The book provides authoritative guidance on a range of tools, including open source and proprietary systems. It details the strengths and weaknesses of incorporating big data analysis into decision-making and explains how to leverage the strengths while mitigating the weaknesses.
The book further explores basic statistical concepts that, when misapplied, can be the source of errors. Time and again, big data is treated as an oracle that discovers results nobody would have imagined. While big data can serve this valuable function, all too often these results are incorrect, yet are still reported unquestioningly. The probability of having erroneous results increases as a larger number of variables are compared unless preventative measures are taken.
The approach taken by the authors is to explain these concepts so managers can ask better questions of their analysts and vendors as to the appropriateness of the methods used to arrive at a conclusion. Because the world of science and medicine has been grappling with similar issues in the publication of studies, the authors draw on their efforts and apply them to big data.
Introduction 
So What Is Big Data?
Growing Interest in Decision Making
What This Book Addresses
The Conversation about Big Data
Technological Change as a Driver of Big Data
The Central Question: So What?
Our Goals as Authors
References
The Mother of Invention’s Triplets: Moore’s Law, the Proliferation of Data, and Data Storage Technology 
Moore’s Law
Parallel Computing, Between and Within Machines
Quantum Computing
Recap of Growth in Computing Power
Storage, Storage Everywhere
Grist for the Mill: Data Used and Unused
Agriculture
Automotive
Marketing in the Physical World
Online Marketing
Asset Reliability and Efficiency
Process Tracking and Automation
Toward a Definition of Big Data
Putting Big Data in Context
Key Concepts of Big Data and Their Consequences
Summary
References.
Hadoop 
Power through Distribution 
Cost Effectiveness of Hadoop
Not Every Problem Is a Nail 
Some Technical Aspects
Troubleshooting Hadoop
Running Hadoop
Hadoop File System 
MapReduce
Pig and Hive
Installation
Current Hadoop Ecosystem
Hadoop Vendors 
Cloudera
Amazon Web Services (AWS)
Hortonworks
IBM
Intel
MapR
Microsoft 
To Run Pig Latin Using Powershell
Pivotal
References 
 
HBase and Other Big Data Databases 
Evolution from Flat File to the Three V’ s 
Flat File 
Hierarchical Database 
Network Database 
Relational Database 
Object-Oriented Databases 
Relational-Object Databases
Transition to Big Data Databases 
What Is Different bbout HBase? 
What Is Bigtable? 
What Is MapReduce? 
What Are the Various Modalities for Big Data Databases?
Graph Databases 
How Does a Graph Database Work? 
What is the Performance of a Graph Database?
Document Databases
Key-Value Databases
Column-Oriented Databases 
HBase 
Apache Accumulo
References
Machine Learning 
Machine Learning Basics
Classifying with Nearest Neighbors
Naive Bayes
Support Vector Machines
Improving Classification with Adaptive Boosting
Regression
Logistic Regression
Tree-Based Regression
K-Means Clustering
Apriori Algorithm
Frequent Pattern-Growth
Principal Component Analysis (PCA)
Singular Value Decomposition
Neural Networks
Big Data and MapReduce
Data Exploration
Spam Filtering
Ranking
Predictive Regression
Text Regression
Multidimensional Scaling
Social Graphing
References
Statistics 
Statistics, Statistics Everywhere
Digging into the Data
Standard Deviation: The Standard Measure of Dispersion
The Power of Shapes: Distributions
Distributions: Gaussian Curve
Distributions: Why Be Normal?
Distributions: The Long Arm of the Power Law
The Upshot? Statistics Are not Bloodless
Fooling Ourselves: Seeing What We Want to See in the Data
We Can Learn Much from an Octopus
Hypothesis Testing: Seeking a Verdict 
Two-Tailed Testing
Hypothesis Testing: A Broad Field
Moving on to Specific Hypothesis Tests
Regression and Correlation 
p Value in Hypothesis Testing: A Successful Gatekeeper?
Specious Correlations and Overfitting the Data
A Sample of Common Statistical Software Packages 
Minitab 
SPSS 
R 
SAS 
Big Data Analytics 
Hadoop Integration 
Angoss 
Statistica 
Capabilities
Summary
References 
 
Google 
Big Data Giants
Google 
Go 
Android 
Google Product Offerings 
Google Analytics 
Advertising and Campaign Performance 
Analysis and Testing
Facebook
Ning
Non-United States Social Media 
Tencent 
Line 
Sina Weibo 
Odnoklassniki 
Vkontakte 
Nimbuzz
Ranking Network Sites
Negative Issues with Social Networks
Amazon
Some Final Words
References 
 
Geographic Information Systems (GIS) 
GIS Implementations
A GIS Example
GIS Tools
GIS Databases
References 
 
Discovery 
Faceted Search versus Strict Taxonomy
First Key Ability: Breaking Down Barriers
Second Key Ability: Flexible Search and Navigation
Underlying Technology
The Upshot
Summary
References 
 
Data Quality 
Know Thy Data and Thyself
Structured, Unstructured, and Semistructured Data
Data Inconsistency: An Example from This Book
The Black Swan and Incomplete Data
How Data Can Fool Us 
Ambiguous Data 
Aging of Data or Variables 
Missing Variables May Change the Meaning 
Inconsistent Use of Units and Terminology
Biases 
Sampling Bias 
Publication Bias 
Survivorship Bias
Data as a Video, Not a Snapshot: Different Viewpoints as a Noise Filter
What Is My Toolkit for Improving My Data? 
Ishikawa Diagram 
Interrelationship Digraph 
Force Field Analysis
Data-Centric Methods 
Troubleshooting Queries from Source Data 
Troubleshooting Data Quality beyond the Source System 
Using Our Hidden Resources
Summary
References 
 
Benefits 
Data Serendipity
Converting Data Dreck to Usefulness
Sales
Returned Merchandise
Security
Medical
Travel 
Lodging 
Vehicle 
Meals
Geographical Information Systems 
New York City 
Chicago CLEARMAP 
Baltimore 
San Francisco 
Los Angeles 
Tucson, Arizona, University of Arizona, and COPLINK
Social Networking
Education 
General Educational Data 
Legacy Data 
Grades and other Indicators 
Testing Results 
Addresses, Phone Numbers, and More
Concluding Comments
References 
 
Concerns 
Part Two: Basic Principles of National Application 
Collection Limitation Principle 
Data Quality Principle 
Purpose Specification Principle 
Use Limitation Principle 
Security Safeguards Principle 
Openness Principle 
Individual Participation Principle 
Accountability Principle
Logical Fallacies 
Affirming the Consequent 
Denying the Antecedent 
Ludic Fallacy
Cognitive Biases 
Confirmation Bias 
Notational Bias 
Selection/Sample Bias 
Halo Effect 
Consistency and Hindsight Biases 
Congruence Bias 
Von Restorff Effect
Data Serendipity 
Converting Data Dreck to Usefulness Sales
Merchandise Returns
Security 
CompStat 
Medical
Travel 
Lodging 
Vehicle 
Meals
Social Networking
Education
Making Yourself Harder to Track 
Misinformation 
Disinformation 
Reducing/Eliminating Profiles 
Social Media 
Self Redefinition 
Identity Theft 
Facebook
Concluding Comments
References
Epilogue 
Michael Porter’s Five Forces Model 
Bargaining Power of Customers 
Bargaining Power of Suppliers 
Threat of New Entrants 
Others
The OODA Loop
Implementing Big Data
Nonlinear, Qualitative Thinking
Closing
References
Kim H. Pries has four college degrees: a bachelor of arts in history from the University of Texas at El Paso (UTEP), a bachelor of science in metallurgical engineering from UTEP, a master of science in engineering from UTEP, and a master of science in metallurgical engineering and materials science from Carnegie-Mellon University.
Pries worked as a computer systems manager, a software engineer for an electrical utility, and a scientific programmer under a defense contract for Stoneridge, Incorporated (SRI). He has worked as software manager, engineering services manager, reliability section manager, and product integrity and reliability director.
In addition to his other responsibilities, Pries has provided Six Sigma training for both UTEP and SRI and cost reduction initiatives for SRI. Pries is also a founding faculty member of Practical Project Management. Additionally, in concert with Jon Quigley, Pries was a cofounder and principal with Value Transformation, LLC, a training, testing, cost improvement, and product development consultancy.
He trained for Introduction to Engineering Design and Computer Science and Software Engineering with Project Lead the Way. He currently teaches biotechnology, computer science and software engineering, and introduction to engineering design at the beautiful Parkland High School in the Ysleta Independent School District of El Paso, Texas. 
 
Robert Dunnigan is a manager with Janus Consulting Partners and is based in Dallas, Texas. He holds a bachelor of science in psychology and in sociology with an anthropology emphasis from North Dakota State University. He also holds a master of business administration from INSEAD, "the business school for the world," where he attended the Singapore campus.
As a Peace Corps volunteer, Robert served over 3 years in Honduras developing agribusiness opportunities. As a consultant, he later worked on the Afghanistan Small and Medium Enterprise Development project in Afghanistan, where he traveled the country with his Afghan colleagues and friends seeking opportunities to develop a manufacturing sector in the country.
Robert is an American Society for Quality–certified Six Sigma Black Belt and a Scrum Alliance–certified Scrum Master.