The Unreasonable Effectiveness of Data

IEEEHere is a brief excerpt from an article co-authored by Alon Halevy, Peter Norvig, and Fernando Pereira, written for iEEE iNTElliGENT SYSTEMS magazine and featured at the computer.org website. IEEE Intelligent Systems is a bimonthly publication of the IEEE Computer Society that provides peer-reviewed, cutting-edge articles on the theory and applications of systems that perceive, reason, learn, and act intelligently. The editorial staff collaborates with authors to produce technically accurate, timely, useful, and readable articles as part of a consistent and consistently valuable editorial product. To learn more about the Society, please click here.

*     *     *

Eugene Wigner’s article“The Unreasonable Effectiveness of Mathematics in the Natural Sciences”1 examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma [Newton’s Second Law: force acting on an object is equal to the mass of an object times its acceleration] or e = mc2 [Einstein’s Theory of Relativity: Energy equals mass times the velocity of light squared]. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.2 Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

One of us, as an undergraduate at Brown University, remembers the excitement of having access to the Brown Corpus, containing one million English words.3 Since then, our field has seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long.4 In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.

E. Wigner, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences,” Comm. Pure and Applied Mathematics, vol. 13, no. 1, 1960, pp. 1–14.

R. Quirk et al., A Comprehensive Grammar of the English Language, Longman, 1985.

H. Kucera, W.N. Francis, and J.B. Carroll, Computational Analysis of Present-Day American English, Brown University Press, 1967.

* * *

To read the complete article, please click here.

Alon Halevy is a research scientist at Google. Contact him at halevy@google.com.

Peter Norvig is a research director at Google. Contact him at pnorvig@google.com.

Fernando Pereira is a research director at Google. Contact him at pereira@google.com.

Posted in

Leave a Comment





This site uses Akismet to reduce spam. Learn how your comment data is processed.