Why Data Science Teams Need Generalists, Not Specialists


Here is an excerpt from an article written by Eric Colson for Harvard Business Review and the HBR Blog Network. To read the complete article, check out the wealth of free resources, obtain subscription information, and receive HBR email alerts, please click here.

Credit: Hiroshi Watanabe/Getty Images

*     *     *

In The Wealth of Nations, Adam Smith demonstrates how the division of labor is the chief source of productivity gains using the vivid example of a pin factory assembly line: “One [person] draws out the wire, another straights it, a third cuts it, a fourth points it, a fifth grinds it.” With specialization oriented around function, each worker becomes highly skilled in a narrow task leading to process efficiencies. Output per worker increases many fold; the factory becomes extremely efficient at producing pins.

This division of labor by function is so ingrained in us even today that we are quick to organize our teams accordingly. Data science is no exception. An end-to-end algorithmic business capability requires many functions, and so companies usually create teams of specialists: research scientist, data engineers, machine learning engineers, causal inference scientists, and so on. Specialists’ work is coordinated by a product manager, with hand-offs between the functions in a manner resembling the pin factory: “one person sources the data, another models it, a third implements it, a fourth measures it” and on and on.

Alas, we should not be optimizing our data science teams for productivity gains; that is what you do when you know what it is you’re producing—pins or otherwise—and are merely seeking incremental efficiencies. The goal of assembly lines is execution. We know exactly what we want—pins in Smith’s example, but one can think of any product or service in which the requirements fully describe all aspects of the product and its behavior. The role of the workers is then to execute on those requirements as efficiently as possible.

But the goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services like recommendations systems, client engagement bandits, style preference classification, size matching, fashion design systems, logistics optimizers, seasonal trend detection, and more can’t be designed up-front. They need to be learned. There are no blueprints to follow; these are novel capabilities with inherent uncertainty. Coefficients, models, model types, hyper parameters, all the elements you’ll need must be learned through experimentation, trial and error, and iteration. With pins, the learning and design are done up-front, before you make it. With data science, you learn as you go, not before you go.

In the pin factory, when learning comes first, we neither expect nor want the workers to improvise on any aspect the product, except to produce it more efficiently. Organizing by function makes sense since task specialization leads to process efficiencies and production consistency (no variations in the end product).

But when the product is still evolving and the goal is to learn, specialization hinders our goals in several ways.

[Here is the first.]

1. It increases coordination costs. Those are the costs that accrue in time spent communicating, discussing, justifying, and prioritizing the work to be done. These costs scale super-linearly with the number of people involved. (As J. Richard Hackman taught us, the number of relationships (r) grows as a function number of members (n) per this equation: r = (n^2-n) / 2. And, each relationship bares some amount of coordination costs). When data scientists are organized by function, the many specialists needed at each step, and with each change, and each handoff, and so forth, make coordination costs high. For example, statistical modeling specialists who want to experiment with new features will have to coordinate with data engineers who augment the data sets every time they want to try something new. Similarly, every new model trained means the modeler will need someone to coordinate with for deployment. Coordination costs act as a tax on iteration, making it more difficult and expensive, and more likely to dissuade exploration. That can hamper learning.

*     *     *

Here is a direct link to the complete article.

Eric Colson is Chief Algorithms Officer at Stitch Fix. Prior to that he was Vice President of Data Science and Engineering at Netflix. @ericcolson


Posted in

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: