Abstract
Probability plays a ubiquitous role in decision-making through a process in which we use data from groups of past outcomes to make inferences about new situations. Yet in recent years, many forensic mental health professionals have become persuaded that overly wide confidence intervals render actuarial risk assessment instruments virtually useless in individual assessments. If this were true, the mathematical properties of probabilistic judgments would preclude forensic clinicians from applying group-based findings about risk to individuals. As a consequence, actuarially based risk estimates might be barred from use in legal proceedings. Using a fictional scenario, I seek to show how group data have an obvious application to individual decisions. I also explain how misunderstanding the aims of risk assessment has led to mistakes about how, when, and why group data apply to individual instances. Although actuarially based statements about individuals' risk have many pitfalls, confidence intervals pose no barrier to using actuarial tools derived from group data to improve decision-making about individual instances.
Over the past two decades, forensic mental health professionals have developed several actuarial tools for assessing the risk that an individual will engage in future criminal or aggressive behavior.1,–,4 An actuarial risk assessment instrument (ARAI5) implements a procedure for obtaining, weighting, and combining a relatively small number of prespecified items to yield a numerical judgment concerning the probability of future violence. The empirical underpinnings of these algorithms and probability judgments come from studies of reference groups in which the same data items and outcomes were gathered and evaluated.
ARAIs have received much criticism. By their very design, they depend on relationships established in specific populations at specific times in the past, and these relationships may not apply, or may not apply in exactly the same way, to future populations living in different social contexts and circumstances.6,7 The creators of some ARAIs recommend that evaluators accept their risk estimates rigidly (e.g., Ref. 8, p 182), without allowing for the potential presence of other factors with clear relationships to risk that “a prudent evaluator will always consider” (Ref. 9, p 3). Practicing clinicians can be tempted by the apparent definitiveness of numerical values to apply ARAIs uncritically or beyond their limited areas of established application, with results that can be misleading and prejudicial in legal contexts.10,–,12
The criticism of ARAIs that has aroused the most professional consternation in recent years involves a “controversy [that] relates to the applicability of group-derived risk estimates to an individual case” (Ref. 7, p 180). The controversy stems from mathematical claims set out in three publications by Hart, Cooke, and Michie (HCM), that the confidence intervals (CIs) for individual risk estimates are so wide “as to render risk estimates virtually meaningless” (Ref. 5, p s60). HCM initially made their case5 using previously published data for the Violence Risk Assessment Guide (VRAG)8 and the STATIC-99.9 More recently, Hart and Cooke used logistic regression methods to conclude that ARAIs cannot “estimate the specific probability or absolute likelihood of future violence with any reasonable degree of precision or certainty” (Ref. 13, p 81). If correct, this conclusion would represent a “brick wall limiting predictive accuracy at the individual level” (as one commentator put it14). Hart and Cooke concluded that “it is difficult to understand how ARAIs can be found legally admissible under Daubert or similar criteria … when the margins of error for individual risk estimates made using the tests are large, unknown, or incalculable” (Ref. 13, p 97).
Critics15 have pointed out that the assertions of Hart and colleagues imply that people are mistaken when they do things that seem perfectly logical and rational. Yet the HCM argument has perplexed or persuaded many psychologists and psychiatrists. For example, DeClue and Zavodny have advised forensic mental health professionals not to report estimates of individual risk because “Hart and Cooke persuasively show that the lack of precision is not a limitation in one sample or one tool, but is endemic to attempts to make such predictions about individuals” (Ref. 16, p 149).
There are many good reasons for not making ARAI-based statements about individuals' risk of recidivism, but the mathematical argument offered by HCM is not one of them. The HCM argument errs in assuming implicitly that the purpose of risk assessment and probabilistic judgment is to make a prediction of something. Usually, however, we assess probabilities and risks to decide what to do, given the information we have, when the outcome is uncertain. Once this main purpose is clarified, the problems with the HCM argument become easier to see.
In this article, I summarize Hart and Cooke's most recent publication,13 which they regard as an improvement on their earlier statements of the HCM thesis. Then, using a data set discussed by HCM, I describe a hypothetical betting scenario to convince readers that useful risk estimates (or probabilities) for individual instances flow naturally and obviously from information about groups of outcomes. Having convinced readers that it is practical and sensible to use group-derived probabilities for decisions about individual instances, I examine several key assertions by HCM to explain where their notions were valid and where their mathematical assertions led them astray.
The HCM Argument
As Hart and Cooke explain, ARAIs are tools “designed to estimate the likelihood of future criminal or violent behavior” (Ref. 13 13, p 81) and to “make individual risk estimates” (p 83) of the form “the risk that Jones will commit future violence is similar to the risk of people” (p 82) in a group with characteristics similar to those of Jones. Not all members of a group look or behave alike, however. HCM therefore “tried to distinguish between the precision of risk estimates at the aggregate or group level versus precision at the individual level” (p 83).
Suppose one draws a random sample of size n from a much larger group of persons, some of whom carry a particular trait. One can then estimate the proportion of persons in the group who have the trait by counting the number of persons in the sample with the trait, then dividing by n. “Like all sample statistics,” state Hart and Cooke, “the proportion estimated is associated with a degree of uncertainty, or ‘margin of error’” (Ref. 13, p 83). The size of the error depends, in part, on the size of n; the larger n is, the smaller the calculated margin of error, and vice versa.
To make statements about whether an individual from the group has the trait, say Hart and Cooke, one might want “to calculate the margin of error for individual propensities inferred from group risk estimates” (Ref. 13, pp 84–5). In their first article, HCM “employed an ad hoc procedure” to estimate a confidence interval for this individual propensity: they chose a formula developed by Wilson17 and set n = 1 to calculate the precision of risk estimates for individuals. The resulting intervals were so broad as to encompass most of the possible 0-to-1 probability range, leading HCM to conclude that ARAIs “appeared to have some (albeit weak) predictive validity at the group level,” but “the margins of error for individual risk estimates made using ARAIs are either large, unknown, or incalculable” (Ref. 13, p 85).
In response to what HCM interpreted as criticisms of their “ad hoc procedure,” Cooke and Michie18 used multivariate logistic regression “to predict the probability of a categorical outcome variable” (Ref. 13, p 86). They produced what they interpreted as prediction intervals and found that “the corresponding precision of individual probability estimates for offenders … was very low” (Ref. 13, p 86), with values again spanning nearly the entire possible 0-to-1 probability range.
In the third article, Hart and Cooke19 used four item scores on the Sexual Violence Risk-20 administered to 90 sex offenders as the independent variables in a logistic regression “to evaluate the precision of individual risk estimates [here, of sex offense recidivism] made using ARAIs” (Ref. 13, p 88). They then generated 90 “individual risk estimates and their margins of error” (i.e., a 95 percent confidence interval for each individual's risk estimate). These intervals “overlapped completely” within the low- and high-risk groups, “and almost completely across groups save for a handful of cases …. These findings clearly illustrate that it was virtually impossible to make meaningful distinctions among subjects based on individual risk estimates made using ARAI scores” (Ref. 13, p 93). Hart and Cooke concluded that ARAIs are mathematically incapable of “estimat[ing] the specific probability or absolute likelihood that an individual person will commit violence in the future with any reasonable degree of precision or certainty” (Ref. 13, p 95).
Were HCM correct? To answer this, let us place some group data used by HCM5 in a concrete (but hypothetical) context to see whether such data can yield probabilities precise enough to prove useful in individual instances.
Aunt Dorothy's Bequest
The morning after their Aunt Dorothy's funeral, her surviving kin—nephews Jim and Steve, and nieces Kathy and Mary—sat at Dorothy's kitchen table as Jim, the executor of Dorothy's estate, read provisions of their late aunt's will. After covering disposition of major financial assets and other items, Jim read this paragraph:
I hereby bequeath my penny collection to my nieces Kathy and Mary, on condition that, during the six months after my death, they will honor my memory and amuse themselves by using the collection to engage in low-stakes games of chance.
Dorothy's penny collection occupied nine large jars with labels indicating that the contents of each were collected in one of the nine years before Dorothy's death. Kathy and Mary decided that for the next half year, they would make small bets on whether individual pennies drawn from the collection came from the Philadelphia or the Denver mint. (Denver pennies bear a D just below the year of mintage; pennies minted in Philadelphia have no letter below the year.)
Each evening over the next six months, Steve, Kathy, and Mary held three-way phone calls during which Steve took a jar, mixed its contents thoroughly, reached in blindly, drew a penny, and held it while Kathy and Mary used the following betting process:
The sisters took turns naming a price for a ticket like the one shown in Figure 1 that paid $1.00 for a Denver penny, but nothing otherwise. (Prices in fractional cents were allowed.)
After one sister set the price, the other sister announced whether she would buy or sell the ticket. Then Steve announced the outcome (D or no D).
Six such bets occurred each evening, with Mary and Kathy each naming three prices, and after determining the result of each bet, Steve returned the penny to the jar from which he had drawn it.
Ticket of the type bought and sold by Mary and Kathy as they bet on outcomes of penny drawings.
From the outset, Kathy and Mary played only for bragging rights; rather than keep the money, they planned to use their betting proceeds to buy dinner for everyone after six months.
Before the betting started, Jim had told Steve, “You know, Dorothy mentioned something about the penny collection a few months ago. When she started collecting the pennies, almost all of them had Denver mint marks. But over the nine years, Denver pennies became less and less common. When you draw each penny, tell Kathy and Mary which jar it came from before they place their bets. It might affect the betting odds that they agree on.” Steve told Kathy what Dorothy had said, but he forgot to tell Mary. So, although both sisters heard which jars each of the pennies came from, only Kathy knew about the trend her aunt had observed.
Fifteen weeks and 103 phone calls later, 618 bets had taken place. From the outset, Kathy and Mary kept track of the drawings' results and accumulated the data shown in the first three columns of Table 1. The first column, labeled y, indicates how many years (y = {1, 2, …, 9}) before Dorothy's death the jar's pennies were collected. The second column, labeled r, shows the number of Denver pennies observed, and the third column, labeled n, shows the number of draws from each jar. (By amazing coincidence, the numbers of Denver pennies and draws for each year equal the seven-year violent recidivism rates and numbers of recidivists in each risk category as reported by Quinsey and colleagues (Ref. 20, p 240) and by HCM.5)
Results and Inferences About πy, the Proportion of Denver Pennies in Jar y, After n Draws (With Replacement) From Each Jar in the Penny Collection
Throughout the betting, both sisters examined their data to see what they could reasonably say about the proportion of Denver pennies in Jar y (y = {1, 2, …, 9}), given the outcomes thus far. Before we consider each sister's analysis after 618 bets, let's think about their aims. Both sisters sought to determine as precisely as possible what proportion of each jar's pennies came from the Denver mint, because this proportion would equal the probability of drawing a Denver penny. Notice also that this probability would equal the price in dollars at which each sister would be indifferent between buying and selling the ticket and the price that she would propose for the ticket. Why?
Suppose it was Mary's turn to name a price for a penny from Jar 4, and suppose that her best estimate (based on the data she had accumulated so far) was that 20 percent of a jar's pennies came from the Denver mint. If Mary named a price above $0.20, she would know that by selling the ticket, Kathy would gain a small advantage; if Mary named a price below $0.20, Kathy could buy the ticket and gain an advantage. So, unless Mary knew that Kathy was making systematic errors in estimating the proportion of Denver pennies in the jars, she could avoid giving Kathy an advantage only by naming a price equal to her best estimate of the proportion. Because Kathy was in exactly the same position as Mary, she also adopted the strategy of price equals best estimate of the proportion.
Now, neither sister knew exactly how the other was estimating proportions, but they had in fact used different procedures. Because Mary had not heard Dorothy's statements to Jim about the declining frequency of Denver pennies, Mary treated the results from each jar as independent of one another. Using y subscripts to designate the jars (again, y = {1, 2, …, 9}), Mary could simply have estimated each year's Denver-penny proportion πy as π̂y = , the number of Denver pennies, ry, divided by the total number of drawings, ny, from Jar y, in the belief that this unbiased value21 should represent her expectation for future outcomes.
Mary took a different approach, however. She realized first that having observed no Denver pennies in 11 draws from Jar 1 did not necessarily mean that the jar contained no Denver pennies, so that setting a price of $0.00 seemed illogical. Similarly, getting nine Denver pennies in the first nine draws from Jar 9 did not imply that all the pennies in that jar came from Denver and that the fair ticket price must be $1.00. Mary also knew, however, that the data from each jar placed some numerical restrictions on what she should believe about the jar's likely contents. Therefore, Mary was interested in both the best single-number estimates for πy and in knowing what her data should lead her to conclude about the plausible range of values for πy. The technical background for Mary's reasoning and calculations appears in Appendix I, and her estimates and 95 percent credible intervals appear in the fourth and fifth columns of Table 1.
A different perspective informed Kathy's Bayesian analysis. She (based on Dorothy's statement) believed that the proportions of Denver pennies in each jar would be correlated with and therefore linked mathematically to y. Appendix II describes the details of Kathy's calculations, and her estimates and 95 percent credible intervals results appear in the sixth and seventh columns of Table 1.
Because the sisters began with different prior assumptions about the same data, their estimates of πy for each jar were not identical. After 111 drawings from Jar 4, for example, Mary's estimates left her indifferent about buying or selling a $0.17 ticket that paid $1.00 upon her drawing a Denver penny. Kathy would have bought such a ticket gladly, however, because she was fairly confident that π4 was more than .17.
An important conclusion from this discussion is that Mary and Kathy have posited different, subjective probabilities22,–,25 regarding the next draw from Jar 4. Realizing that probability is subjective makes it reasonable to utter a phrase such as “Kathy's probability of drawing a Denver penny from Jar 4,” because the phrase refers not to the contents of Jar 4, but to Kathy's degree of belief about the likelihood of drawing a Denver penny from that jar. The sisters' subjective probabilities determined how they proposed and accepted wagers, and their betting behavior represented a concrete illustration of the following general principle: “probability … is a rate at which an individual is willing to bet on the occurrence of an event. Betting rates are the primitive measurements that reveal your probabilities or someone else's probabilities, which are the only probabilities that really exist” (Ref. 26, p 90).
The preceding paragraphs let us distinguish among related but different quantities that the sisters might describe, based on their assumptions and the available data:
Asked to describe πy, the proportion of Denver pennies in the jar from year y, each sister might respond with her single-number estimate of πy listed in Table 1 (that is, her expectation based on the methods described in the appendices), or she might instead say that she was 95 percent sure that πy lay within the intervals shown in Table 1.
Asked to describe the proportion of pennies that would have Denver mint marks were a large number of subsequent drawings to occur, the sisters might give similar answers (i.e., either reporting the single-number values for π̂y or the 95% ranges that describe their beliefs).
Asked about the price of a Figure 1–type ticket that would leave them indifferent between buying and selling it, the sisters would use the single-number values for π̂y listed in Table 1. These values are the sisters' expectations for each of the nine jars, the bases for their decisions about bets, and their single-event probabilities that the next penny from Jar y will bear a Denver mint mark.
Using her data in Table 1, Kathy might say, “The probability of drawing a Denver penny from Jar 4 is 16 to 24 percent.” Though this assertion seems to refer to a single instance (the next drawing), Kathy's statement implicitly refers either to (a) a proportion of the jar's contents, or to (b) a plausible frequency for a certain type of event over the long run. If (a) is what Kathy intends, the assertion means, “Based on the data and my background assumptions, I'm pretty sure that 16 to 24 percent of pennies in Jar 4 bear Denver mint marks.” If (b) is Kathy's intention, her assertion means, “In a very large series of drawings, I'm 95 percent sure that 16 to 24 percent of the pennies will bear Denver mint marks.”
Responding to HCM
In evaluating the penny data, the sisters used Bayesian statistical methods that are known to produce results very close to those yielded by the traditional, frequentist methods used by Hart, Cooke, and Michie in their 2007 and 2013 publications. If the mathematical arguments of HCM are correct, however, then the sisters, as Hart and Cooke suggest, “should consider whether it is best to give up altogether on the idea of calculating probability estimates of” drawing Denver pennies from each jar (see Ref. 13, p 98).
One point of telling the penny-betting story was to recast the calculations of Hart, Michie, and Cooke5 in a context that makes it easy to see the relevance of group data to individual decisions. After 618 drawings, Mary would be foolish to think she knew next to nothing about the jars' contents or about how to establish a price for a ticket, the payoff of which depended on the next draw from Jar y. Mary's Bayesian analyses of the penny data yielded virtually the same group data (i.e., the same 95% intervals) as those that Hart, Michie, and Cooke reported in their Table 1,5 and Mary felt 95 percent sure that her intervals contained the true proportion of Denver pennies in each jar. For purposes of making a betting decision, however, the relevant probability for Mary was the value at which she would be indifferent about buying or selling a ticket like the one shown in Figure 1. This value should equal her best point estimate of proportion of pennies that came from Denver.
Thinking about probabilities as expressions of beliefs that can be the basis for decisions helps one to avoid a mistake that HCM make in their discussion of group data. They ask readers to imagine a game of chance and write, “Suppose that Dealer, from an ordinary deck of cards, deals one to Player. If the card is a diamond, Player loses; but if the card is one of the other three suits, Player wins. After each deal, Dealer replaces the card and shuffles the deck” (Ref. 5, p s62). Over 10,000 games, say Hart and colleagues, Player can be 95 percent confident he will win 74 to 76 percent of the games. But if Dealer and Player play this game just once, “the estimated probability of a win is still 75 percent but the 95 percent CI is 12 to 99 percent. The simplest interpretation of this result is that Player cannot be highly confident that he will win—or lose—on a given deal” (Ref. 5, p s62).
Indeed, Player should not be confident, but not because of the interval that Hart and colleagues provided, which is what Player would calculate (using Wilson's method) for plausible values of the deck's nondiamond proportion if Player began knowing nothing about decks of cards and learned that in a single draw, one-fourth of the cards was a diamond. Leaving aside the impossibility of such a draw, HCM elided the distinction between one's confidence about a single yes-or-no outcome and one's probability concerning that outcome. Player should know that in a standard 52-card deck, three-fourths of the cards are not diamonds. If the payoff for a nondiamond is $1.00, the fair price for each round of the Dealer-Player game described above is $0.75, because probability of getting a nondiamond on any instance is .75. The problem with Hart and colleagues evaluation of ARAI data5 is not their “ad hoc procedure” for interpreting information about a group of outcomes, but their misuse of Wilson's method. Their interval calculations effectively throw out all the information about each group, just as Player would be doing if he threw out his knowledge of 52-card decks, drew one card, and learned (somehow) that it was one-fourth a diamond.
Kathy's Bayesian analysis used the statistical procedure (logistic regression) that Hart and Cooke13 employed to model probabilities for individuals. Kathy's credible intervals for each jar's Denver-penny proportions were narrower than those that Hart and Cooke13 described, in part because Kathy used a larger data set. Had her data set been smaller, Kathy's credible intervals would have been wider, but her single-value probability estimates still would have helped her to make decisions about bets.
The HCM articles often refer to a predicted probability or an estimated probability, and they produce calculations which, they asserted, show that these entities are too imprecise to be useful. But what HCM tried to calculate is puzzling. To see why, imagine a lunchtime discussion among Kathy, Mary, and their longtime friend Jane, who questioned whether they could apply their group data to individual bets.
“For the jars from which Steve has drawn lots of pennies,” said Jane, “you know within fairly narrow intervals what fraction of Denver pennies you'd get in a very large number of future drawings from the jar. Those intervals are group results, however. On any individual penny drawing, you cannot predict the outcome with much confidence.”
“We aren't trying to predict what will happen on any particular draw,” replied Mary. “We're simply setting prices and deciding how to bet on each drawing.”
“The probabilities we've calculated aren't predictions,” added Kathy. “They represent degrees of belief based on rational mathematical strategies and our knowledge and experience. We don't know how many pennies are in each jar or how many came from Denver, but we're implementing the best possible strategy based on what we do know.”
“But you still can't tell me the precise probability you predict for the next penny, nor can you give me a confidence interval for your prediction,” protested Jane.
“We aren't trying to predict a probability, or anything else!” responded Mary. “I don't even understand what you mean.”
“Maybe this will help,” offered Kathy. “We aren't trying to predict anything: neither the outcome of the next penny drawing, nor our probability for that drawing. Probabilities aren't something we predict; they are degrees of belief that we ascribe to possible outcomes. Based on our data, we have formed beliefs about the intervals within which each jar's Denver proportion probably lies. But for purposes of making bets, our single-value estimates of the jars' proportions of Denver pennies are our probabilities for drawing a Denver penny. If I say my probability for getting a Denver penny is .20, that means I'm indifferent between selling or buying a $0.20 ticket that pays $1.00 if the next penny has a D mint mark.”
Hart and Cooke believe that “the state of knowledge is arguably more advanced” in medicine than in psychology, yet “it is not common for physicians to give individual risk estimates” (Ref. 13, p 98) for outcomes. They quoted Henderson and Keiding,27 who believe that while “‘models and statistical indices can be useful at the group or population level, … human survival is so uncertain that even the best statistical analysis cannot provide single-number predictions of real use for individual patients’” (Ref. 13, p 99, quoting Ref. 27, p 703).
But it's easy to think of counterexamples. Suppose a 50-year-old man learns that half of people with his diagnosis die in five years. He would find this information very useful in deciding whether to purchase an annuity that would begin payouts only after he reached his 65th birthday. When you purchase insurance coverage, you may not tell yourself that you're making a bet, but that's what insurance is, and insurers find actuarial data very useful in deciding whether to offer you coverage and what your premium will be.
Hart and Cooke also state that “the definition of individual risk estimate used by ARAIs assumes that every person has a propensity for violence that is stable, dispositional, or trait-like” (Ref. 13, p 87). This characterization is incorrect for reasons that the penny story helps us understand. Each penny is unique: it has characteristics (e.g., its position in the jar) that make it theoretically distinguishable from other pennies and that influence whether it is the next one drawn by Steve. Other characteristics (e.g., mintage year) affect a penny's likelihood of coming from Denver. When the sisters set prices and made bets, however, the only information they had about each penny was its source jar, so the pennies' other characteristics could not affect their probability estimates. Similarly, if all one knew about an individual was his Static-99R score and that he came from a population for which the Static-99R data and rates were relevant, the individual's Static-99R score would be the best and the only basis for making a probabilistic judgment about his future behavior. This is true even though many factors not considered by the Static-99R (e.g., employment status, substance use, and family relationships) affect a sex offender's likelihood of recidivism.
Making Predictions Versus Assessing Risks
Making individual predictions is neither the aim of ARAIs nor the purpose for which they are designed. As the term actuarial risk assessment instrument suggests, a validated actuarial tool provides a numerical value for the risk of an event.
In states that authorize civil commitment of so-called sexual predators (see, for example, Ref. 28), courts typically solicit expert testimony relevant to whether an individual is “likely to engage in acts of sexual violence” if not confined. Courts disagree about exactly what numerical value is entailed by the word “likely.”29,–,31 Yet under a plain-English interpretation as well as the interpretations that U.S. courts have provided, this phrase requests a statement regarding an individual's probability of engaging in a certain kind of behavior, not a prediction.
When Kathy and Mary bet on Dorothy's pennies, the results of previous drawings from a particular jar were obviously relevant to the likelihood that the next penny from that jar would bear a Denver mint mark. In providing so-called norms—for example, rates of recidivism or violent acts—for translating ARAI scores to probabilities of recidivism, ARAI designers are saying that their source data (the bases for the norms) are as relevant to any evaluee as are the source data used by Mary and Kathy to calculate probabilities of drawing Denver pennies.
This claim is questionable. An ARAI may do equally well at ranking the risks of individuals from two populations, yet have probabilities associated with particular scores that differ because the populations' overall base rates differ.6,32 Helmus and colleagues33 and Singh and colleagues7 have shown that the offending rates associated with particular ARAIs scores differ across locales, but this should not surprise anyone. Social, economic, and political conditions in different places are likely to influence interpersonal behaviors such as acting violently or committing a sex offense.
Thus, one can disagree with the HCM mathematical argument, yet agree with Hart and Cooke that “it is arbitrary and therefore inappropriate to rely solely on a statistical algorithm … professionals [should] recognize that their decisions ultimately require consideration of the totality of circumstances—not just the items of a particular test” (Ref. 13, p 98). Sensible exponents of actuarially validated risk assessment know that factors besides those considered by the instrument may influence risk, but because actuarially based risk assessment methods typically outperform other judgment methods (especially unstructured clinical judgment), the onus rests with those who propose adjusting estimates to prove that their adjustments yield results that are superior to those based on actuarial judgment alone.
Final Comments
The term probability causes confusion because it has many uses. Readers interested in exploring these uses would do well to start with the recent article by Buchanan,34 which provides a short, elegant discussion of probability in the context of forensic risk assessment.
Thinking carefully about probability involves thinking carefully about numbers, and many people, including judges and jurors, have trouble understanding numerical information and using it rationally. Even numerically sophisticated people can get confused by the statistics that describe probabilities, estimates of proportions, and risks of events, and also by the relationships between these mathematical quantities, people's predictions about individual events, and optimal decisions about what to do in uncertain individual circumstances.
To make these relationships clearer and to dispel the misunderstandings generated by the well-intended efforts of HCM, I have explicated the relationship between rational use of data and probabilities with a story about betting. Some people object to betting on moral grounds,35 and some mental health professionals may disapprove of describing psycholegal matters as though the clinicians involved were making bets,36 but since the 17th century, the arguments for deriving and illustrating basic principles of probability have used gambling as a standard metaphor for explaining those principles.37
I have used the same metaphor to give readers an intuitive feel for why HCM's mathematical analyses must contain errors. Some readers may take offense at legal schemes that impose confinement based on principles that govern rational betting. If you are one of those readers, I agree with you, but such opinions reflect moral or legal positions about the proper basis for confining people, not mathematical arguments about the precision of risk assessments.
Appendix I
Mary approached the problem of estimating from a Bayesian perspective. She sought to establish a probability distribution p(πy|y, ry, ny) for each jar, based on her data. She treated the series of penny drawings as a set of Bernoulli trials—that is, as independent, random experiments with exactly two possible outcomes—in which the probability of a Denver penny was the same every time a drawing occurred. Starting with the Jeffreys' prior for binomial data from Bernoulli trials, Mary's posterior distributions for p(πy|y, ry, ny), after having observing ry Denver pennies in ny draws from each jar, were Beta(ry + ½, ny – ry + ½). The Jeffreys' prior for beta distributions produces intervals with good coverage properties from a traditional (frequentist) statistical standpoint.38 (For further explanations of the rationale for using this prior and examples applied to psychiatric contexts, see Refs. 39,40.)
The expectation for a Beta(α,β) distribution on the interval [0,1] is α/(α+β). In Table 1, Mary's Bayesian estimates for πy reflect this calculation, so they differ a bit from what calculating estimates of πy as π̂y = would yield. Because Mary considered and used the data to assign Bayesian probability distributions to πy, then having observed (for example) that r2 = 6 and n2 = 71, she could say, “I am 95 percent sure that π2 lies between .036 and .166.”
Appendix II
From among several candidates for link functions (see Ref. 41, § 6.5), Kathy chose this logistic regression model to fit the data:
The first part of Kathy's model states that ry, the number of Denver pennies out of ny drawings from Jar y, results from a series of Bernoulli trials that follow (or are distributed as) a binomial distribution, where πy is the actual (but unobserved) proportion of Denver pennies in Jar y. (If taken by itself, this part of Kathy's model would be the same as Mary's.) The second part posits a standard logistic model in which y (years before death) is the independent variable and ȳ (the mean years in the data sample; here, ȳ = 3.74) aids in convergence because it reduces the dependence of α on β (see Ref. 41, p 115). Kathy's model was implemented using WinBUGS, a free statistical software program for effecting Bayesian analyses using Markov chain Monte Carlo methods.41 For additional details, see the footnote to Table 1.
Footnotes
Disclosures of financial or other potential conflicts of interest: None.
- © 2015 American Academy of Psychiatry and the Law