It has been now more than a decade that I have been underwhelming MBA students in the first weeks after they arrive at my school with a workshop entitled “The Aqueduct Challenge.” The challenge is simple enough: given the existence of the Roman aqueduct in Segovia (pictured above), what can we conclude was the population of the city?
I originally conceived the exercise as an offbeat estimation problem, of the kind then much in vogue at top-shelf employers – McKinsey, Google, Microsoft, etc… – like “how many hairdressers are there in Singapore?” (And, incidentally, this is a poor strategy for candidate screening, as I’ll discuss briefly at the end of this post.) The idea was to use an example from the Roman period (since I am a historian) to exercise critical-thought and analytical tool-sets needed to come up with some kind of an answer. Students, working in groups, apply their critical thinking skills to this issue of Roman-era Segovian demographics for an hour, after which we have a lively discussion dissecting all the answers from each group.
Which were all invariably wrong.
For the first maybe five years of running the exercise I didn’t think too much about this 100% failure rate in getting the right answer. But after more than a hundred iterations of the exercise (and thousands of wholly incorrect answers), it dawned on me that this was statistically anomalous.
It seems I had inadvertently created the world’s worst estimation question.
Ordinarily, the way to resolve estimation questions is to establish a simple framework and then to fill in the missing variables by making assumptions. For example, to answer how many hairdressers there are in Singapore, we can (1) start by estimating how often an individual visits a hairdresser per year, (2) work out a total number of visits per year using a population estimation of Singapore, and (3) estimate how many clients a hairdresser can see in a day. Doing simple arithmetic, we can work out how many hairdressers are needed to meet annual demand, and then add in a few tweaks to look clever (e.g. skew by age & gender, types of hairdresser visits, or account for hairdresser downtime/vacation days) and come up with a credible answer. Et voilà.
In general, for estimation questions such as these, I would predict that the more people who collaborate on answering the question, the more accurate it will become. This is because at every point where we have to fill in a blank, such as how often an individual visits the hairdresser per year, the more eyeballs on the problem, the more accurate the estimation.
For example, let’s suppose I ask “estimate the possible market size of people who will embrace the metaverse, grateful that tech has finally solved the misery of interacting with people in the real-world now that we can hide behind an eternally-cheery jack-in-the-box pikachu avatar (now with legs!).” If the one person I ask is the CEO of Meta then the answer will be: everyone. Or if the person I ask is a particularly ornery professor of humanities at IE Business School I know, I will get another answer: nobody. Put both those answers together and we’re certain to arrive at a closer approximation of the truth (which is six).
Back to my aqueduct problem. As any student who can vaguely recall suffering through the experience can attest to, the parameters are very straightforward and the number of variables you need to take into account very small. So, as a group exercise, it ought to follow this rule: greater accuracy based on more inputs to generate the output. So, since no group ever gets the question right, it would seem quite useless, not just as a means for evaluating the kind of “critical thinking” logic needed to solve such problems, but indeed as an exercise in exploring critical thinking skills generally. It’s just a mean-spirited exercise designed to waste two hours of someone’s life they are not getting back. And, worse: are paying for. Not nice.
What was the problem?
Well, indeed, my aqueduct challenge would be completely useless as an interview question. But it turns out there is an interesting insight that comes from that zero success-rate, and it has nothing to do with any individual’s ability to parse numbers to make credible estimation guesses. Instead, my little exercise reveals a problem we can encounter in the context of collective problem solving.
Now, there are broadly two ways of thinking about collective problem solving.
In the opening of his book, The Wisdom of Crowds, James Surowiecki recounts the experience of Francis Galton, an eminent Victorian scientist, who attended a country fair where there was a contest being held to guess the weight of a slaughtered ox. Being old, grouchy, and firmly convinced of the general idiocy of the population as a whole, Galton expected the answers to be wildly inaccurate. Instead, after tabulating all of the guesses (almost 800 of them) and taking their average, he found that:
The crowd had guessed that the ox, after it had been slaughtered and dressed, would weigh 1,197 pounds. After it had been slaughtered and dressed, the ox weighed 1,198 pounds. In other words, the crowd’s judgment was essentially perfect.
This was not a one-off result. It has been repeatedly demonstrated that there is an astonishing accuracy in the collective result of individual choices. This captures one way of thinking about collective problem solving: that is, as the sum of individual problem-solving collectivized.
But then there is truly collaborative collective problem solving. As Steven Sloman and Philip Fernbach note in their book The Knowledge Illusion,
Intelligence is not a property of an individual; it's a property of a team. The person who can solve hard math problems can certainly make a contribution, but so can the person who can manage groups dynamics with a person who can remember the details of an important encounter. We can't measure intelligence by putting a person alone in a room and giving him or her a test; we can measure it only by evaluating the production of groups that the person is a part of.
This is where my exercise comes in. Unlike individuals in a crowd looking at an ox and guessing a weight, my problem requires a kind of collective thinking out loud. If we follow Sloman and Fernbach’s logic, we would conclude that there is no intelligence whatsoever, given the inaccurate results of the group output. But that obviously cannot be right. Instead, it reveals precisely the kind of problems we are likely to encounter when we leverage our critical thinking skills in professional contexts, which typically involve collaborative collective effort. Simply put, individual ideas and contributions will be affected by the social context in which they occur.
As Surowiecki suggests, a flaw in current management thinking is the over-democratization of decision-making. Rather than harness the power of the collective accuracy of individual choices, this has led instead to the increased bureaucratization of the firm, driven by layers of input derived through meetings and consultations. “Companies,” he notes, “that tried to make the decision-making process more “democratic” thought democracy [led instead to] endless discussion rather than a wider distribution of decision-making power.” (That probably sounds familiar.)
Individuals in such instances are not making individual choices in isolation that are later tabulated and collectivised. Instead, they are making individual contributions that are shaped by the collective social environment in which they are being formulated.
So to get to the heart of the matter, what I think my aqueduct exercise demonstrates is a simple and harmless instance of a much larger issue. When an individual has to make an assumption in order to solve a problem, they will of course have a reason for believing that assumption to be broadly true. If this leads, however, to a very poor outcome, they might at the individual level, be propelled to rethink their ideas, or at least feel rather flustered - a sense that something is wrong.
But when that same person solves the same problem collectively with other people who also share that assumption, then its power to inform their collective thinking increases exponentially, even if the outcome remains highly dubious. The ability to then rethink assumptions that are derived collectively becomes exponentially reduced as a function of collective thinking itself.
To be clear, I am not talking about the kind of assumptions that might inform, for example, our political views. Those are rarely agnostic, as they often have social, emotional and even intellectual attachments. Instead, I am talking about the assumptions we make that are generally neutral, or at least do not engage us strongly at an emotional or intellectual level.
The really devious thing about this is that, as I noted above, in general the more eyeballs on a problem, the more we should have confidence in the resulting conclusion. It’s a good thing! More input = better output. So broadly speaking, when we engage in forms of collective reasoning, we have a strong basis for expecting that the outcomes will be more robust.
This can understandably lead to a kind of confidence bias. This itself is a subspecies of confirmation bias – our tendency to place greater weight on things that we are already inclined to support or believe. Testing out assumptions is never easy, but testing them in a collective context can, paradoxically, be harder still, if those assumptions are broadly shared. We thus end up with a misplaced confidence in our approach and conclusions, precisely because of the collective environment in which they have been derived.
For instance, I recently wrote about the basic use-case problem in crypto, and this might serve as an example. While there are any number of potential use-cases to support belief in the future of crypto, the prevailing one is built from the assumption that major currencies are unsustainable over the long-term because of systemic conflicts of interest and mismanagement in state-driven monetary policy. The assumption is that fiat currencies are not a good long-term store of value. For any number of reasons, that is a tenuous assumption. But a primary driver in pushing up the value of crypto was precisely this shared underlying assumption, thus essentially elevating it from highly uncertain variable to established fact. And that’s creates a dangerous foundation.
We need, therefore, to bear in mind that the heuristics and assumptions which we use on an almost daily basis to navigate the world around us can at times undermine our critical abilities and that these can be dangerously amplified, paradoxically enough, when others agree with us, even as the answers they lead to (e.g. we should go all in on the metaverse as the future of humanity) are telling us we perhaps should reconsider.
That, then, is I think the real value of my otherwise terrible exercise. It forces us to confront the limits not just of our critical thinking, but how our critical thinking works in a collective setting. Since many bad corporate outcomes are arrived at through a formal collective decision making process, perhaps it is not the worst lesson.
By way of conclusion, I noted that the estimation-type question used at various firms is a poor screening strategy, and in this I follow the insights of the former head of HR at Google, Laszlo Bock. He points out in his excellent book Work Rules,
the case interviews and brainteasers used by many firms [are worthless]. These include problems such as: “Your client is a paper manufacturer that is considering building a second plant. Should they?” or “Estimate how many gas stations there are in Manhattan.”… Performance on these kinds of questions is at best a discrete skill that can be improved through practice, eliminating their utility for assessing candidates. At worst, they rely on some trivial bit of information or insight that is withheld from the candidate, and serve primarily to make the interviewer feel clever and self-satisfied. They have little if any ability to predict how candidates will perform in a job.
He provides some links to the relevant literature if you are interested in learning more. But as someone who does quite well at such questions but who would also be indubitably a very poor performer, it seems about right.
I've been telling people about that exercise ever since. Still annoys me that we got it wrong. I would gladly take my team through it if it was available on the metaverse :P
Dear Rolf, it has been a pleasure reading your piece! Undoubtedly great results from your classroom experience and on collective-decision making. Within a different forum some points are worth discussing more in depth. Some philosophical, some from personal corporate experiences.
Before getting too much into it - one question remains unanswered: What was the size of the population in Segovia back when the Aqueduct was built?