“Torture the data enough and it will confess to anything.” Ronald Coase, Nobel Prize Laureate in Economics
As Foster Provost and Tom Fawcett explain in the Preface, they examine concepts that fall within one of three types:
“1. Concepts about how data science fits into the organization and the competitive landscape, including ways to attract, structure, and nurture data science teams; ways for think about how data science leads to competitive advantage; and tactical concepts for doing well with data science projects.
2. General ways of thinking data, analytically. These help in identifying appropriate data and consider appropriate methods. The concepts include the data mining process as well as the collection of different high-level data mining tasks.
3. General concepts for actually extracting knowledge from data, which undergird the vast array of data science tasks and their algorithms.”
There you have the nature and extent of the WHAT on which the information, insights, and counsel focus. Provost and Fawcett devote most of their attention to explaining HOW to apply these concepts to achieve high-impact data mining driven by data-analytic thinking. I share their belief “that explaining data science around such fundamental concepts not only aids the reader, it also facilitates communication between and among business stakeholders and data scientists. It provides a shared vocabulary and enables both parties [data scientists and non-data scientists such as I] to understand each other better. The shared concepts lead to deeper discussions that may uncover critical issues otherwise missed.”
These are among the dozens of business subjects and issues of special interest and value to me, also listed to indicate the scope of Provost and Fawcett’s coverage.
o From Big Data 1.0 to Big Data 2.0 (Pages 9-13)
o From Business Problems to Data Mining Tasks (19-23)
o The Data Mining Process. (26-34)
o Other Analytics Techniques and Technologies (Pages 35-41 and 187-208)
o Selecting Informative Attributes (49-56)
o Supervised Segmentation with Tree-Structured Models (62-67)
o Class Probability Estimation and Logistic “Regression” (97-100)
o Overfitting (113-119)
Note: This is a tendency to tailor models to the training data.
o Correlation of Similarity and Distance (142-144)
o Some Important Technical Details Relating to Similarities and Neighbors (157-161)
o Stepping Back: Solving a Business Problem Versus Data Exploration (183-185)
o A Key Analytical Framework: Expected Value (194-204)
o A Model of Evidence Lift” (244-246)
o Decision Analytic Thinking II: Toward Analytic Engineering (279-289)
o Co-occurrences and Associations: Finding Items That Go Together 292-298)
o Bias, Variance, and Ensemble Methods 308-311)
o Sustaining Competitive Advantage with Data Science (318-323)
As I worked my way through the book a second time, in preparation to compose this review, I was again reminded of comments by Eric Schmidt, executive chairman of Google: “From the dawn of civilization until 2003, mankind generated five exabytes of data. Now we produce five exabytes every two days…and the pace is accelerating.” Correspondingly, the challenges that this process of data accumulation creates will become even greater. Provost and Fawcett wrote this book for those who must manage this process but also to assist the efforts of instructors who are now preparing them to do so.