Abstract
Structured risk assessment tools are essential in forensic psychiatry to evaluate the likelihood of recidivism. The Violence Risk Appraisal Guide-Revised (VRAG-R) was developed as an update to the VRAG, but its predictive validity across offender populations remains underexamined. Our study aimed to examine the predictive validity of the VRAG-R for general, violent (including and excluding sexual offenses), and sexual recidivism. We conducted a systematic review and meta-analysis, searching 10 databases and gray literature sources for studies reporting psychometric outcomes for the VRAG-R published since 2013. Risk of bias was assessed using Prediction Model Risk of Bias Assessment (PROBAST) and data extraction followed the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies (CHARMS) checklist. Area under the curve (AUC) values were pooled using random-effects meta-analysis. In total, 15 studies comprising 3,932 participants were included. The VRAG-R showed acceptable predictive validity for general recidivism (pooled AUC = .71, 95% CI: .67 to .75) and violent recidivism (AUC = .72, 95% CI: .69 to .75). Predictive validity for sexual recidivism was modest (AUC = .65, 95% CI: .61 to .68). In conclusion, the VRAG-R demonstrates acceptable predictive validity for general and violent recidivism, comparable with other tools. Its performance in predicting sexual recidivism, however, is limited and concerns about generalizability remain. Future research should prioritize diverse samples, reporting of calibration, and continued evaluation of performance.
Assessing the risk of criminal recidivism is a crucial task in forensic psychiatry that directly informs the practices of the criminal justice system to protect the public. Over time, the methodological approach for risk assessment has shifted toward structured professional judgment (SPJ) and actuarial tools, moving away from unstructured clinical judgment, which relied solely on individual clinical impressions.1,2 Actuarial risk assessment relies on both discrimination, to distinguish between high- and low-risk individuals, and calibration, to ensure the predicted probabilities reflect the observed outcomes. In forensic risk assessment, the area under the receiver operating characteristic curve (AUC) is used to assess a tool’s ability to distinguish between individuals who do or do not recidivate, with values ranging from .5 to 1. For example, an AUC of .5 indicates that the tool is no better than chance, whereas an AUC of 1 indicates that the tool perfectly distinguishes individuals each time.3 The authors of the VRAG tool have characterized an AUC of .56 as small, .64 as moderate, and .71 or higher as high.4 In contrast, SPJ involves a combination of actuarial methods and unstructured clinical judgment, allowing clinicians to consider defined risk factors while also exercising professional judgment in weighing and integrating them into the overall risk determination.
The Violence Risk Appraisal Guide (VRAG) was a pioneering tool in the field, as it was among the first actuarial instruments developed. Clinicians completed a 12-item scale that placed an offender in one of nine risk categories. The scale was developed based on offender characteristics that were found to most strongly correlate with violent recidivism.5 Although the instrument was originally developed using a sample of male offenders who were assessed or treated at a maximum-security psychiatric facility in Ontario, Canada, its findings have since been replicated across a range of populations internationally. Additionally, the tool has been extensively validated, with more than 60 successful replication studies since its creation demonstrating an average AUC of .72.6,7
Since the original release of the VRAG, the field has seen a proliferation in the development of risk assessment tools.8,9 This growth reflects the emerging recognition of the limited accuracy of unstructured clinical judgment, which has been shown to be unreliable and prone to biases.2 Moreover, the superiority of structured and standardized prediction methods has been consistently replicated in forensic research.1 Today, more than 200 structured tools are utilized within criminal justice systems globally, providing clinicians with an abundance of options but also creating challenges in determining which tool is most appropriate to use.10,11
Despite these developments, the VRAG has maintained its status as one of the most commonly used risk assessment tools in the field.10 In 2013, the Violence Risk Appraisal Guide-Revised (VRAG-R) was released, incorporating a larger sample that included most of the original cohort. The purpose of the VRAG-R was to simplify the scoring system of the VRAG by replacing the item requiring the total Psychopathy Checklist-Revised (PCL-R)12 score with only Facet 4 of the PCL-R.6 Additionally, the authors introduced an item that evaluated sexual offending, recommending the VRAG-R as a replacement for both the VRAG and the Sex Offender Risk Appraisal Guide (SORAG),7,13 which had been developed as a modification of the VRAG to predict recidivism among sexual offenders.14
It is important to examine the predictive validity of the VRAG-R in its own right, as the adoption has likely been influenced by the extensive validation of its predecessor.6 This systematic review and meta-analysis aims to critically evaluate the predictive performance of the VRAG-R. We specifically assess whether the VRAG-R demonstrates predictive accuracy comparable with that of the original VRAG across diverse offender populations and for different types of recidivism, including violent, sexual, and general reoffending. Additionally, where reported in studies of the VRAG-R, we compare the VRAG-R’s performance against some of the most validated and commonly used risk assessment tools, including structured professional judgment tools, such as the Historical Clinical Risk Management-20, Version 3 (HCR-20V3) and Sexual Violence Risk-20 (SVR-20), as well as actuarial tools, such as the Psychopathy Checklist-Screening Version (PCL:SV) and STATIC-99R.15,16 By synthesizing the existing literature, this study seeks to critically examine the tool’s performance, which can assist practitioners and policymakers in better understanding the strengths and limitations of the VRAG-R in forensic risk assessment and how the tool performs compared with other available tools.
Methods
We developed a protocol following the Preferred Reporting Items for Systematic Review and Meta-Analysis guideline (PRISMA),17 which was registered on the International Prospective Register of Systematic Reviews PROSPERO (CRD42024599060). Our study used the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies (CHARMS)18 to guide data extraction and critical appraisal of our included studies.18 Furthermore, we used the Prediction Model Risk of Bias Assessment (PROBAST)19 to review each included study for risk of bias and applicability.19
Search Strategy
A systematic search strategy was designed in conjunction with a librarian experienced in systematic review searching (SB). We searched Medline, Criminal Justice Abstracts, Cumulative Index to Nursing and Allied Health Literature (CINAHL), Embase, Education Resources Information Center (ERIC), Health and Psychosocial Instruments (HaPI), Mental Measurements Yearbook, PsycINFO, Scopus, and Web of Science for articles published from 2013 onward. No study type or language limits were applied to the search results. In addition, we carried out a manual review of the citations of included studies, a search of Google Scholar limited to the first 100 references, and a gray literature search of conference papers, dissertations, and preprints.
Study Selection Criteria
All studies that reported psychometric data on the VRAG-R published in English were eligible for inclusion, including cohort, case-control, and observational studies. Studies of populations of people convicted of a criminal offense were included, with no exclusions based on gender, age, setting, or type of offense.
All identified references were uploaded to Covidence20 to facilitate reference management. Two of the study authors (VA and AN) independently screened all titles and abstracts. Full-text review was completed by both reviewers only if one or both reviewers indicated the study met eligibility criteria based on preliminary review of the title and abstract. Any discrepancies at each stage were resolved through discussion and consensus between the two reviewers, consulting a third reviewer where necessary.
Data Extraction and Coding
Two of the study authors (VA and AN) independently extracted data using the CHARMS Checklist template.18,19 For all eligible articles, study information and design, participant characteristics, outcome measures (including general, violent, or sexual recidivism), VRAG-R performance (i.e., calibration, discrimination, and overall measures), and interpretation of the predictive validity of the VRAG-R were extracted. Within the calibration and discrimination measures, we additionally aimed to extract the observed: expected ratio (O:E) and the hazard ratio (HR) where available. The O:E refers to the number of recidivists actually observed to the number of recidivists predicted by the VRAG-R, whereas the HR compares the risk of recidivism between two groups.21
Risk of Bias Assessment
Two of the study authors independently completed an assessment of the methodological quality of the studies according to the four domains of PROBAST using the template created by Fernandez-Felix et al.19 The studies were examined for the risk of bias introduced by the inclusion and exclusion of participants as well as the definition and assessment of predictors and outcomes. Additionally, the articles were assessed for their methods of analysis, including their handling of missing data, reporting of results, and statistical methods.
Statistical Methods Used for Meta-Analysis
For a comprehensive analysis of the VRAG-R’s performance, we aimed to examine both its discrimination and calibration measures. To enable consistent scaling across studies, we performed a logit transformation of the AUC values. Where data were available, we carried out separate meta-analyses for different follow-up periods to explore temporal effects on VRAG-R performance.
Meta-Analysis
To quantitatively synthesize the predictive validity of the VRAG-R, we conducted meta-analyses on reported area under the curve (AUC) values across eligible studies. AUC was chosen as the primary performance metric because it was the only metric consistently reported across included studies. Although we intended to summarize calibration metrics, these were infrequently or inconsistently reported, precluding meaningful meta-analytic synthesis.
All meta-analyses were conducted using Stata (version 17.0; StataCorp LLC). AUC values were logit-transformed prior to analysis to stabilize variances and normalize distributions. Pooled estimates were calculated using a random-effects model (DerSimonian and Laird method)22 to account for expected heterogeneity across studies. Between-study heterogeneity was assessed using the I2 statistic, with values above 50 percent interpreted as moderate to high heterogeneity.
To examine the influence of follow-up duration, we conducted subgroup analyses stratifying studies by follow-up length (less than five years versus greater than or equal to five years). Because of limited sample sizes, formal subgroup analyses by gender, age, or setting were not feasible.
Publication bias was visually assessed for asymmetry using funnel plots and statistically tested with Egger’s test23 where applicable.
Results
Our literature search identified 1,050 records, of which 429 duplicates were identified and removed by Covidence. A total of 621 titles and abstracts were screened, which excluded 569 studies. After reviewing the full text of 52 studies, 15 studies met inclusion criteria (Fig. 1) for the systematic review. Of these, 13 were retrospective cohort studies and two were prospective cohort studies. In total, 3,932 participants were included in the analysis, with most participants being adult White males. The studies were conducted across seven countries, including Canada, Switzerland, Austria, Mexico, Germany, Belgium, and Australia, which all have different criminal justice systems and population compositions. Only one study focused exclusively on female participants (n = 525) and three on youth samples below the age of 25 (n = 822).

Figure 1. PRISMA flow diagram.
Recidivism was most often defined as any new criminal charges or conviction following a previous conviction. Recidivism data were extracted from various sources depending on the scope of the study, including federal and provincial databases or at the intuitional level, such as correctional facilities, hospitals, or programs tailored for specific populations (i.e., sex offenders and individuals found to be not criminally responsible.) There were three studies that focused solely on participants who were in institutional settings. For these participants, recidivism was defined as violent incidents as per the Modified Overt Aggression Scale (MOAS) or the Staff Observation Aggression Scale-Revised (SOAS-R), although only incidents that were threatening or did harm to others were coded. For participants who were in the community, recidivism definitions differed across studies, particularly violent recidivism, where there was no consistency between studies as to whether sexual offenses were included in the definition. For example, although some studies categorized contact sexual offenses as violent recidivism,24,–,29 others explicitly excluded them from this category.1,30,–,35 To account for this variability, recidivism outcomes were categorized to include general recidivism, nonsexual violent recidivism, violent and sexual recidivism, as well as sexual recidivism. There were three studies conducted with solely sexual offenders,26,29,31 examining the predictive validity of the VRAG-R for general, violent, and sexual recidivism.
All 15 studies reported the AUC as a measure of the ability of the tool to distinguish between recidivists and nonrecidivists. The follow-up times in all 15 studies varied, ranging from 28 days to 17.75 years. We carried out a meta-analysis of reported AUCs in these studies.
Across the 15 included studies, the AUCs of the VRAG-R varied depending on the recidivism category and follow-up duration. Focusing on long-term predictive validity, the pooled AUC values using the longest available follow-up period reported in each study were examined, according to the categories of general, nonsexual violent, violent and sexual, and sexual recidivism (Fig. 2).
For general recidivism, 10 studies were included, with AUC values ranging from .60 to .86 and a pooled estimate of .71 (95% confidence interval (CI) .67 to .75; I2 = 68.6%). For nonsexual violent recidivism, eight studies were included, with AUC values ranging from .60 to .80 and a pooled estimate of .72 (95% CI .68 to .76; I2 = .0%). Similarly, seven studies were included for violent and sexual recidivism, with AUC values ranging from .66 to .75 and a pooled estimate of .72 (95% CI .69 to .75; I2 = .0%). Finally, for sexual recidivism, five studies were included, with AUC values ranging from .63 to .69 and a pooled estimate of .65 (95% CI .61 to .68; I2 = .0%).

Figure 2. Meta-analysis of the predictive validity of the VRAG-R for general, nonsexual violent, violent and sexual, and sexual recidivism for the longest availability follow-up periods. AUC = area under the curve; CI = confidence interval.
The degree of heterogeneity varied across outcome categories. For general recidivism, moderate heterogeneity was observed (I2 = 68.6%), suggesting substantial variability in effect sizes across studies. This may be attributed to differences in study populations, definitions of recidivism, or follow-up durations. In contrast, heterogeneity was negligible for nonsexual violent recidivism (I2 = .0%), violent and sexual recidivism (I2 = .0%), and sexual recidivism (I2 = .0%), indicating consistent effect sizes across studies for these outcomes.
To explore potential sources of heterogeneity, particularly for general and sexual recidivism, we conducted subgroup analyses based on follow-up duration (less than five years versus greater than or equal to five years). In our secondary analysis, the pooled AUC values for general, violent and sexual, and nonsexual violent recidivism were stratified by follow-up duration (Figs. 3 and 4). These analyses suggested that follow-up time did not account for the heterogeneity observed in general recidivism, and pooled AUC values remained consistent across time points.

Figure 3. Meta-analysis of the predictive validity of the VRAG-R for general, nonsexual violent, violent and sexual, and sexual recidivism for less than five years of follow-up. AUC = area under the curve; CI = confidence interval.

Figure 4. Meta-analysis of the predictive validity of the VRAG-R for general, nonsexual violent, violent and sexual, and sexual recidivism for more than five years of follow-up. AUC = area under the curve; CI = confidence interval.
Although there were insufficient data to support formal subgroup analyses for specific populations, some insights can be taken from the available data. In the single study that focused on female forensic psychiatry patients, the VRAG-R was found to have an AUC of .69 (95% CI .65 to .74) for general recidivism and .68 (95% CI .61 to .75) for violent recidivism.4 Particularly, in Canadian studies, there was also minority representation of women in the sample, with one study concluding that women had lower VRAG-R scores in general when compared with the men.28 There were three studies that solely focused on youth offenders, although the age definition of youth varied. Two of the three studies found that the VRAG-R had acceptable predictive validity for violent and general recidivism in youth whereas the other did not, with an AUC value of only .60 (95% CI .47 to .73) for general recidivism and an AUC value of only .63 (95% CI .50 to .77) for violent recidivism.27,30,34 Of the three studies focused on youth offenders, only one study examined sexual recidivism in youth sexual offenders, which reported an AUC of .69 (95% CI .53 to .85).30 There were two studies that examined the predictive validity of the VRAG-R for inpatient violence, which is a population on which the VRAG was previously validated. In contrast to the VRAG, neither study found that the VRAG-R demonstrated predictive validity for inpatient violence.32,33 Finally, the predictive validity of the VRAG-R for sexual recidivism in sexual offenders was examined through three studies. All three concluded that the VRAG-R had poor to fair predictive validity for sexual recidivism, with AUC values ranging from .56 to .63.26,29,31
All studies were rated as having high risk of bias overall using PROBAST, primarily because of deficiencies in the analysis domain. There was incomplete reporting of calibration measures, with only two studies reporting a Hosmer-Lemeshow test.36 Additionally, a funnel plot was generated to assess for potential publication bias across the included studies (Fig. 5). The mild asymmetry, particularly among smaller studies, may indicate that studies with weaker predictive validity of the VRAG-R were not published or included.

Figure 5. Funnel plot assessing publication bias. CI = confidence interval.
Discussion
This systematic review and meta-analysis examined the predictive validity of the VRAG-R for general, violent, and sexual recidivism. The results indicated acceptable to good predictive validity of the VRAG-R for general and violent recidivism, with pooled estimated AUC values falling between .67 and .75 for general recidivism and between .68 and .76 for violent recidivism. Contrary to the developmental sample, the VRAG-R demonstrated only poor to fair predictive validity for sexual recidivism, with a pooled estimated AUC between .61 and .68.
To contextualize our findings, it is important to compare the pooled estimated AUC values for recidivism with other commonly used risk assessment instruments. For general recidivism, our pooled estimated AUC across studies based on the longest follow-up periods available was .71 (95% CI .67 to .75; I2 = 68.6%). This estimate is similar to the predictive validity of other commonly used actuarial risk assessment tools, such as the HCR-20V3 and the PCL-SV, which have pooled AUC values of .69 (95% CI .65 to .72) and .67 (95% CI .56 to .77).16 This finding is replicated when examining violent (including sexual offenses) recidivism, where our pooled estimated AUC across studies based on the longest follow-up periods available was .72 (95% CI .69 to .75; I2 = .0%), which is similar to the pooled estimated AUC values for the HCR-20V3 (.69, 95% CI .65 to .72), Static-99 (.64, 95% CI .53 to .73), and the original VRAG (.69, 95% CI .63 to .75).16
The pooled estimated AUC based on the longest follow-up periods available for sexual recidivism was more modest, at .65 (95% CI .61 to .68; I2 = .0%). This estimate is in keeping with the established predictive validity of other commonly used instruments for predicting sexual recidivism. The Static-99 has a reported AUC value of .66 (95% CI .57 to .74), the SORAG ranges from .64 to .66, and the STABLE-2007 has an AUC of .67 (95% CI: .65 to .70).16,37,–,40 The SVR-20 has the highest predictive validity, with reported AUC values ranging from .72 to .80.41,–,43 This comparison highlights a broader pattern among actuarial risk assessment tools for sexual recidivism, including the VRAG-R, that their predictive accuracy is generally modest.
Overall, the clinical implications of our findings suggest that the VRAG-R performs similarly to other commonly used risk assessment tools in forensic psychiatry evaluations. Despite the similar predictive validity observed, it remains pertinent to interpret these findings within the broader context of evidence that continues to advocate for the integration of actuarial risk assessment tools with clinical judgment.44 This suggestion is based on numerous studies that, although clinical judgment alone underperforms in predictive accuracy when directly compared with actuarial risk assessment, the combination of the two approaches yields the highest predictive accuracy in comparison to either method in isolation.45,–,47
Furthermore, although actuarial risk assessment typically reduces sources of bias and inconsistencies among and within assessors and removes variability introduced by irrelevant factors, there are many areas relevant to risk assessment that are not well captured by existing models or tools.44 One of the earliest and most persistent critiques of actuarial risk assessment tools, however, is their limited consideration of the diversity in the populations they aim to assess. This concern was clearly illustrated in our findings, where only one out of 15 studies focused solely on females, the majority of the population was White, and only one study was completely in a country with a majority Black, Indigenous, or people of color (BIPOC) population. This risk, although contentious, has been observed in a study involving 25,980 participants, with the validity of multiple actuarial risk assessment tools found to have greater predictive validity when used in populations similar to their original validation samples, which are majority middle-aged White men.10 Finally, there were no validation studies completed in the United States of America. This may have been related to the poor generalizability of the original VRAG in American populations as well as the abundance of other risk assessment tools that are more commonly used.48,49
Limitations of the Review of Literature
As previously mentioned, the population in the studies was largely homogenous, which raises concerns about the generalizability of the VRAG-R across gender identities, ages, and ethnicities. Further exacerbating this limitation was our decision to only include English language studies, which could exclude relevant findings from non-English-speaking populations.
Examining the included studies, there were several methodological limitations that could be addressed in future research. First, there was significant variability in the definition of violent recidivism, particularly regarding whether sexual offenses were accounted for in this measure and which types of sexual offenses were included. Although we separated our pooled estimates for violent recidivism by its inclusion or exclusion of sexual offenses, this heterogeneity may still affect the validity of our meta-analytic conclusions and limit the comparability of our estimates with other risk assessment tools. Second, although all the included studies reported predictive validity, almost none reported calibration data. Although this is not a problem isolated to our meta-analysis, it continues to represent concerns with how well predicted risks align with actual outcomes. Furthermore, although most studies reported limited demographic data, more comprehensive reporting of variables such as age, race or ethnicity, gender, mental health diagnoses, and socioeconomic status would have provided more opportunity for subgroup analyses to examine the predictive validity of the VRAG-R in different populations.
Finally, there is an urgent need for more diversity and inclusion in research participation and recruitment. Given that Indigenous and other ethnocultural groups are disproportionately represented within North American criminal justice systems, it is problematic that most recidivism risk assessment tools have been developed and validated in predominantly White populations.50 Different ethnocultural groups may have different offense patterns and risk factors for recidivism that are unaccounted for in current risk assessment tools.51 Apart from the possibility of inaccurate risk assessment, it also reduces opportunities to identify risk factors that could be modifiable to reduce an individual’s risk and allow for culturally competent interventions. More research is also needed to examine the tool’s predictive validity across the lifespan and in different genders. Of the three studies focused on youth, only one examined sexual offending, whereas another reported that the VRAG-R demonstrated comparatively weaker predictive validity. Similarly, few studies had female participants, and the only study that focused solely on females demonstrated only moderate predictive validity. Thus, longstanding concerns remain regarding the generalizability of risk assessment tools that have only been validated in a homogenous sample.
Given the comparably lower predictive validity of the VRAG-R for sexual recidivism, more data on sexual offenses is needed to assess the tool’s performance in this domain. This limitation also extends to the original VRAG-R validation study, where violent recidivism was inclusive of sexual assault.6 Moreover, as the VRAG-R was introduced as a replacement of the SORAG, more research is needed to examine the performance of the VRAG-R in predicting violent and sexual recidivism in sexual offenders, which was the original purpose behind the creation of the SORAG. Although this approach is in keeping with our increasing recognition of sexual violence as a distinct form of violence, it also addresses concerns raised by the even lower AUC values observed for sexual recidivism in sexual offenders compared with our pooled estimates for sexual recidivism in all offenders.
Finally, the mild asymmetry on visual inspection of our funnel plot suggests the possibility of publication bias, whereby smaller studies with negative findings on the VRAG-R were not included, which may inaccurately inflate the pooled estimated AUC values across all recidivism categories.
Conclusion
To our knowledge, this is the first systematic review and meta-analysis examining the predictive validity of the VRAG-R across general, violent, and sexual recidivism since its introduction in 2013. Similar to its predecessor, the VRAG-R demonstrates acceptable to good discriminative accuracy for general and violent recidivism, although its performance in predicting sexual recidivism is more modest. These findings underscore its continued clinical relevance but also its limitations, including concerns regarding its generalizability.
As the validation samples remained demographically homogenous, their generalizability and applicability across culturally and contextually diverse populations remain ambiguous.
Future research in this area would benefit from methodological improvements, such as the reporting of both discrimination and calibration metrics. In addition, future research should prioritize examining the tools across diverse populations, including different age groups, women, racialized groups, and non-English-speaking populations.
Although the predictive validity of the VRAG-R for sexual recidivism is comparable with that of the other commonly used tools, it showed only modest ability to discriminate between those who do and do not reoffend, highlighting the need for more research in the variety of factors that contribute to sexual recidivism. Furthermore, future progress may be most impactful if focused on rigorous and inclusive validation practices instead of incremental refinement of existing tools designed for the same homogenous populations.
Footnotes
Disclosures of financial or other potential conflicts of interest: None.
- © 2026 American Academy of Psychiatry and the Law







