Some days just appear to have a theme that constantly nags or colors everything that happens. It happens that yesterday was one of those days for me. I’m in the process of finishing up a project and I need to complete a book chapter on the validation of ABM based analysis for national security analysis. Importantly, the issue is less about the validation of the models themselves, and more concerned with whether or not analysts can use models for responsible inferences in support of decisionmaking. It also happened that the director of the Krasnow Institute where I work posted this on his own blog, and I came across this older piece in on the Wired website regarding some DARPA work that I’m familiar with. To add just a small piece more, a former government senior executive came to speak to my department yesterday about the challenges of ‘wicked problems’ in government, particularly in the design, development, and employment of new technologies and the difficulty of getting to work with the social structure of the government (building a new widget was rarely the difficult part).
So, yesterday was a healthy dose of thinking about how modeling and simulation, particularly social science models, can be used to support the national security community and can we tell if the work is any good?
I’m not entirely sure how to answer the larger question asked by my institute’s director, regarding whether computational social science can or should be judged according to Popper’s criteria of being falsifiable. Several years ago my answer would have firmly been ‘yes,’ but today I’m on the fence. My hesitation increasingly comes from the posting about DARPA’s models, the complex and diverse ways in which models can be used, and even what term ‘scientific’ means. This problem becomes even more difficult when the realities and needs of policymaking enter into consideration, where choices are between alternative courses of action of which only one (assuming the decsionmaker chooses from the set of those analyzed a priori) plays out the others will remain as counterfactuals beyond the reach of empirical study.
In fact, after starting some readings on the philosophy of science (something that I plan to spend a lot more time on after the summer), it seems that Popper’s definition of science and his strict falsifiability criteria has diminished and has increasingly been challenged by the scientific community. A simple example given by Samir Okasha’s introductory text on the subject mentioned that Popper accused Freudian and Marxist theories of being unscientific, noting that the observations of people and societies did not conform to the theories’ predictions so therefore any adherence to Freudian or Marxist ideas was not rational in light of the evidence. However, Okasha noted that the same problem emerged when Adams and Leverrier made predictions about Uranus’ orbit using Newtonian physics, but that the prediction was wrong. Rather than abandon Newton’s theory, however, they searched for intervening factors that could explain the observed anomaly and discovered Neptune. The impulse to save the theory led to new discoveries that eventually closed the gap between expectation and observation in spite of initial falsification. So, the question then arises, how much tinkering, adapting, and additional searching is allowed to save to save a theory by bolting on new concepts and imagining intervening factors, before we throw something out? In the strictest sense, Okasha argues, Popper would have thrown out Newton the moment it failed, and yet with some additional work it was saved and even played an essential role in the discovery of another planet.
The example above may seem trivial, but I’m convinced that the problem at its core is central to the social sciences where the complexity of the system and the incompleteness of our observations ensure the survival of multiple, competing, plausible explanations for events and ideas. Moreover, in spite of evidence to the contrary, there seems to be a legitimate need to simultaneously save theories that appear to underperform and develop alternatives to them, so it is tough to declare one pursuit scientific and not.
What is more troubling to me than the long run question of whether computational social science will develop according to Popper’s criteria or perhaps evolve according to some other standard, is the attitude towards models characterized by the piece on DARPA. The article mentions several people and programs that I’ve worked with, on, or around over the years, so I’m not impartial on any of this. Indeed, these programs were responsible for directing me away from mainstream international relations towards computational social science and ABM in the first place, when in October of 2001 I started working at National Defense University’s Center for Technology and National Security Policy on a program called Pre-Conflict Management Tools (see here, here, and here for descriptions). This program attempted to tie together a series of technologies such as web-scraping, data-mining, computational models, and collaborative planning into a single system that could provide a multilateral set of users the ability to detect emerging conflict and state failure in enough time to mobilize resources and intervene before a major, perhaps irreversible problem emerged.  Indeed, our work was probably best characterized as an effort to implement methods for robust, adaptive planning developed by RAND’s Pardee Center.  Obviously, social science models of conflict and state-failure were a core technology and we took great interest in how to develop and use them.
One significant difference existed between our project and those that DARPA and others have funded more recently—our emphasis was on the way to use models in a collaborative and highly uncertain decisionmaking environment. Our interest was not whether the model was good at predicting anything, but rather could an ensemble of many alternative models generate enough cases against which different policies could be explored against alternative futures. For our effort, the point was to bring modeling and simulation into the interagency decisionmaking process in order to make it more collaborative, forward looking, and robust to uncertainties and disagreements about the current state of the world, the structure of the system and its future states, and even the goals and motives of stakeholders. From our perspective, it was more important that the ensemble’s models be credible to the users, capturing their own mindsets and organization beliefs, than predictive in any objective sense. After all, if decisionmakers would reject the findings of any model whose internal structure or logic was foreign and not credible in their opinion, then what good would it do to provide them with predictions they would ultimately ignore? Therefore, we emphasized the need to generate information that was useful to the decision, for which models played a significant role, but the decision was the focus of our work, not the models.  This meant aligning the models with the beliefs of the policymakers and their organizations to ensure that the theory in the models was deemed consistent with how they thought, rather than introducing alien concepts that would be rejected by users regardless of their precision.
The shift in emphasis carried with it a host of challenges and major differences that other programs have poorly understood. Time and again I’ve seen highly skilled modelers present “accurate†predictions only to get zero traction with decisionmakers that they are trying to support. First and foremost, it is important to understand that many, indeed most, of the models are fundamentally flawed for policy purposes. Social science models are generally constructed to explain the largest possible number of cases and are therefore searching for common elements that provide some explanatory or predictive power across multiple cases. As a result, they tend to focus on structural variables, or perhaps, procedural ones if they involve the operational practices or rules of important organizations. For example, conflict may be predicted based on variables such as the GDP of a state, or perhaps the ratio of GDP between a state and its neighbors, energy consumption, level of military expenditures, population sizes, gender ratios, age distributions, cultural homogeneity/heterogeneity, etc. In each case, the goal is identify and operationalize variables that can be measured across time and space. So, we can compare the situation of Germany in 1900 with that of 1914, 1924, and 1939, each time taking different measurements to produce new estimates.
The structural or procedural emphasis becomes problematic for two reasons: the problem of agents, and the problem of counterfactuals. The agent problem is that structural and procedural models allow for the search and comparison of what is being measured across time and space for a set a cases. What is not measured is essentially set to zero. So, it is possible to construct a model that explains World War II as an inevitable result of structural forces and organizational behaviors, but that then implies that particular people—Hitler, Stalin, Churchill, Roosevelt, Mussolini, Hirohito, and others simply didn’t matter. While I can accept that structural and operational forces set the context that limited and motivated the choices of the agents themselves, I cannot accept that their choices were irrelevant or inconsequential or preordained. Indeed, many of World War II’s most strategically impactful and long-ranging consequences may only be explainable by considering the individuals involved, such as the decision-making that led to the Holocaust or Stalin’s agreement to divide Poland with Hitler.
The question of counterfactuals is related to agents, but not exclusively so. Simply take any historical event, such as the assassination of the Archduke Ferdinand that set the events of World War I in motion, and imagine the existence of three models, each of which is predicting the consequences of the assassination. Model 1 predicts that 100% of the time the assassination occurs, a continental war breaks out. Model 2 makes the same prediction 50% of the time, while no great power war results in other 50% of the cases. Finally, model 3 predicts the outbreak of major war in only 1% of cases, while the remaining 99% of cases show considerable qualitative diversity. Which is the ‘correct’ model? Can we even determine which the better one is?
The answers to these questions are largely contingent on why we have built the model, and shows how academic and policy modeling diverge. In the academic side, the stronger the model conforms to the known empirical outcome, the more highly regarded the model is—so model 1 will likely be regarded as the best. In the policy side, however, the purpose of the modeling is often to generate many alternative cases to let policymakers know what could happen, so model 3 may provide the most valuable information to planners by giving analysts and decisionmakers the most information to consider. Indeed, a model 4 that never showed a general war breaking out, which would therefore be rejected under academic standards, may nevertheless provide vary valuable insights into the problem and an extended range of cases of explore in the policy process.
Additionally, it is important to realize that the purpose of policymaking, and the nature of policymakers, is to act and change the world. To imply that the world is structurally deterministic is to tell them their choices do not matter because forces beyond their control have already set the future. This also means that policymakers are not interested in the model’s performance explaining a population of cases, e.g. all civil wars, but are focused on the one case they must mange and make very difficult decisions about. The specific character of their accounts and responsibilities makes them more likely to seek to advice and wisdom of specialists such as historians, anthropologists, and area experts than generalists (that modelers tend to be) because of the need for very specific insights and knowledge.
Finally, it is critical to note that explanation and prediction are not the same. Â Time and again, the assumption is that policymakers want to know what will happen, but in actuality they want to know why. Â Thus, analysts spend more time tracing possible causal paths within system than simply deriving numerical forecasts. Â The result is that many of the most powerful predictive tools, such as statistical models, are surprisingly ineffective in support of strategic decisionmaking. Â Many of the modeling tools that have been pursued by organizations like DARPA and IARPA may prove difficult to transition regardless of the laboratory success because they are pursuing prediction at the expense of explanation.
How does this come back to DARPA? The discussion of batting averages of predictions (who predicts at 90% accuracy, who gets it mostly right, etc.) is simply not helpful or relevant, and the idea that models and humans are in some sort of bake-off or competition to see which predicts better serves no actual policymaking purpose. As every student of intelligence knows, better predictions do not lead to better policies because it is often the institutional, organizational, and personal relationships between analysts and policymakers that determine whether policymakers will entertain and even listen to information that doesn’t conform to their existing beliefs. Unless there is a desire to replace the policymaker—not the analyst—with computers, better models will be surprisingly unhelpful to the policy community the full range of issues that need to be addressed.  I’m even less comfortable with this because, as Herbert Simon argued, models cant be responsible for ethical judgment, regardless of their accuracy.
There is also a perversion in the entire process as described in the DARPA piece (and I realize that they are talking about more agencies than DARPA exclusively). The piece simultaneously noted that models must be populated by expert opinion and are then run against the experts to outperform them. What exactly should modelers tell a junior or senior analyst, “fill out my spreadsheet so we can feed this model and put you out of a job?â€Â Having spent the last decade working this problem, I can definitively state that analysts do not appreciate having their experience and expertise reduced to a data set, particularly to feed models that presume the majority of what they study and analyze—the nuances and peculiarities of specific people and places—do not matter in the face of structural concerns.
Again, all of my experience comes back to the belief that the modeling community should be working with analysts to aid them in supporting decisionmakers by expanding the range of ideas and cases they can consider, particularly in the face of uncertainty about the world and with respect to the lack of empirical means for considering alternative scenarios or counterfactuals (I’ve spent a couple weeks debating if it is possible to have a counterfactual about the future, but that will be the topic of another positing at some point far far away). My heart sinks the more I see well intentioned and very intelligent people thrash about, not recognizing that they are focusing their energies in ways that run against the actual needs of analysts and the policymakers they support. I was disheartened to see our original project from 2001 transition from model use to model development, simply because I think it has reoriented the community towards doing work that is less relevant to the needs of decisionmakers who are in desperate need of the support that computational social science can provide.
OK, this is only a comment on one small part of this posting. Is it reasonable to put Popper in the same bucket with Snow’s “Two Cultures” in terms of the damage done? Even Popper did not think most of what we would call physical science met his criteria for falsifiability. The underpinnings of Popperian “science” are purely theoretical and of little true utility. McCloskey argues that statistical tests of significance based on falsifiability have done as much harm as good. It is frustrating that the need to conform to a mathematical definition of science consumes so much potentially useful thinking. In a field as new and poorly understood / defined as computational social science to become obsessed over falsifiability is a red herring of significant size.
Or, in a more numerous and nasty vein – As the question is being asked I can see the Zen master raising his staff ready to strike the questioner and proclaim “Stupid question”.