Abstract
Growing concern about the use of incarceration is driving significant reform in juvenile legal system decision-making and is likely to have a substantial impact on the role residential options play in the future continuum of care. It appears inevitable that surviving institutions or alternative residential models will be increasingly scrutinized for their impact on youth development. While rehabilitative models focused on youth development are a promising and growing part of residential institutions, few tools are available to measure quality. For institutions to sustain a focus on quality assessment, programs should use an organized and specified treatment model`` against which staff behavior can be assessed. This study examined the concurrent validity and item functioning of corresponding youth and expert ratings of social and therapeutic climate across multiple sites in a state-wide juvenile residential setting (n = 225 paired observations). Results suggest that the reliability of expert ratings of therapeutic climate exceeds the reliability of youth ratings, whereas reliability for other indicators of social climate are roughly equal between rater types. In addition, youth and expert ratings had weak concurrent validity. Implications for the use of youth versus expertly trained raters for measuring social and therapeutic environment are discussed.
Secure placement continues to be a common sanction for youth involved in the legal system. According to the Census of Juveniles in Residential Placement (United States), over 60,000 youth were placed in some type of secure facility in 2015.1 About half of these youth spent time in facilities self-classified as residential treatment centers or long-term secure placements, and, as a result, spent a significant amount of time exposed to institutional programming. Youth in secure placements tend to have more severe behavioral health needs than the general population2 and many institutions are not adequately prepared to meet these needs.3,4 Growing concern about the effectiveness and ethics of incarceration heightens the importance of attending to the impacts of these settings.5 To date, little research is available on the quality of forensic institutions generally, and for youth settings in particular.6
Although rehabilitative models focused on youth development (e.g., skills-based) are having a growing impact on juvenile corrections,7 little is known about the effective components of these models and the environments in which they are implemented.8,–,11 The available research typically evaluates therapeutic residential programs as black box interventions yielding little information on the impact on effectiveness of variation in staff competencies, youth characteristics, clinical components, dosage, and adherence to the model.9,12,–,14 Program quality monitoring tools can serve the dual function of supporting onsite implementation while contributing broader knowledge regarding the elements that drive successful outcomes in residential programs, including social climate and specific intervention techniques.
Social Climate
Social climate, or the “feel” of a unit’s social environment, is considered an important aspect of a rehabilitative milieu in adult and youth psychiatric and forensic psychiatric settings.15,–,17 Positive climate in these settings is associated with higher staff and resident satisfaction,18,–,20 lower institutional violence,21 stronger therapeutic alliance22 and negative attitudes toward offending.23 Perceived safety and order are also important for facilitating positive social climate and outcomes. For example, research suggests that overcrowding is associated with youth violence toward staff and suicidal behavior.24 The correspondence between social climate and institutional adjustment suggests this is a valuable measurement construct to use for quality performance monitoring and research.17 An early measure of social climate for forensic settings, the Correctional Institutions Environment Scale25 is widely used but is also criticized for poor internal consistency and unreliable factor structure.26–27
A more recently introduced measure, the Essen Climate Evaluation Scheme (EssenCES) was designed to address these shortcomings in addition to being shorter and easier to administer.28 The EssenCES is administered to both clients and staff and measures three domains of social climate in forensic settings: Therapeutic Hold (TH) measures client perceptions of staff support and care; Experience Safety (ES) measures how safe staff and clients feel in the unit; and the Prisoners’ Cohesion and Mutual Support subscale measures whether clients exhibit care toward each other. These three domains were developed to reflect face validity28 and provide evidence of the value of peer support in therapeutic communities on outcomes.29 A validation study of the EssenCES across adult prison and secure psychiatric settings found that social climate can be reliably measured using these domains.17
Adherence and Treatment Quality
Although social climate appears to be a strong predictor of institutional adjustment and treatment outcomes, it does not specifically capture adherence or the delivery of expected treatment components. Adherence to an expected treatment approach is strongly related to youth outcomes in general, including treatment addressing disruptive disorders,30 complex behavioral health treatment,31 and reoffending.32 Adherence in clinical trials and real-world monitoring is typically measured through the use of expert clinicians;33 however, the use of external raters to assess social or therapeutic climate is infrequent. We did not find any published forensic studies examining the validity of observational climate ratings of secure youth placements by external raters. As secure settings increasingly consider delivering complex, multicomponent therapeutic models, it will be important to ensure that self-report or more passive models of adherence assessment have adequate sensitivity. For example, dialectical behavioral therapy (DBT), a cognitive–behavioral therapy (CBT)-based intervention originally developed to treat borderline personality disorder, is now being widely used in rehabilitative placements and implemented at a pace that has exceeded the ability of research to assess its appropriateness for forensic contexts.34 Research on the sensitivity and validity of tools to assess quality implementation of therapeutic models is needed.
Current Study
The current study examines the concurrent validity and item functioning of two approaches used to assess the therapeutic environment across multiple sites within a state-run residential placement system for legally involved youth. As the extant literature has established the validity of youth report as a measure of social climate, we examined the correspondence of youth and trained external raters on social climate and related domains to replicate previous research39 and examine the validity of expert rater scores. We further examine the reliability of items specifically related to therapeutic environment to assess the relative performance of youth versus expert rater measures of social climate. We hypothesized that item functioning, reliability, and associations between youth and expert raters would be comparable on measures of social climate and that expert rater scores would demonstrate superior reliability on therapeutic domains.
The study procedures were approved by the Washington State Institutional Review Board.
Methods
Sample
The study sample included 1,740 youth from 13 youth residential placements from December 2008 through December 2013 in Washington State. All youth in the sample had either a Category A felony (e.g., manslaughter, assault in second degree, robbery in second degree) or had numerous prior criminal adjudications. The demographics of the sample included 979 youth of color (56%), including African American (n = 297, 17%), Latino/a (n = 325, 19%), American Indian/Alaskan Native (n = 67, 4%), Asian/Pacific Islander (n = 42, 2%), mixed race (n = 233, 13%), and other (n = 15, 1%). Approximately 44 percent of the sample was White (n = 758) and 0.2 percent did not report race/ethnicity in the sample (n = 3). Youth were between 11 and 20 years old (M = 16.62, SD = 1.62), and primarily male (n = 1,571, 90%). Institutional settings included four secure facilities (n = 2,115, 82%), eight community group homes (n = 441, 17%), and one boot camp (n = 14, 0.5%).
Data and Procedures
Data for the study came from two administrative databases managed by the state juvenile residential agency, hereafter referred to as JR. The first database included youth survey ratings of environmental quality. The second database included environmental quality ratings conducted by highly trained quality assessment staff employed by JR who were external to the residential setting. During the study timeframe, all youth living in the residential units within each facility (i.e., secure facility, community group home/transitional program, or boot camp) were administered institutional quality surveys every two months. The surveys were developed to align with the institution’s externally-rated environmental quality assessment. These items were created by JR but align with published measures of institutional quality.6,35 Youth completed the surveys by hand, which were then collected by the environmental assessment team and subsequently entered into a centralized database. Collection by the environmental assessment team was expected to provide a higher level of anonymity than collection by the unit staff. Youth survey forms were not entirely anonymous, however, in that they included the youth’s JR number. How youth perceived the confidentiality of these forms is unknown, but we judge the risk of bias is low as unit staff did not receive punishment or reward as a result of youth responses.
Residential units were assessed by three different expert raters approximately every two months. Raters were full-time employees who received formal training in the assessment process by first observing and then being shadowed by existing raters until they achieved acceptable interrater reliability as determined by the assessment supervisor. The environmental rating process included a day-long site visit by trained staff raters to observe unit climate and residential staff practices. Each living unit was rated by two experts who then compared ratings, discussed discrepancies, and came to a consensus score.
For this study, items from the youth survey and quality assessment tool were organized to match the social climate domains of previous published studies of institutional climate (Table 1). Items from the JR tools were grouped to match the domains of these previous studies by the first and second authors who sorted items individually and then met to compare results and develop the final grouping.36 This was followed by confirmation from the remaining coauthors regarding the final item placement. Item sets aligned with four subscales of organizational functioning, including overall organization, staff connectedness, social support, and future orientation of the program. In addition to these validated domains, items reflective of the therapeutic orientation of the unit were grouped within a new domain the authors termed “treatment milieu.” These items focused on clinical components of the institutional treatment program and staff readiness to support treatment in the therapeutic milieu.
Measures
Overall Facility Organization
Overall organization was created using three items from the youth survey and three items from the expert tool (Likert scales). Example items include “Do you know what structure/activities to expect on a daily basis?” (youth, ranging from 0 [never] to 4 [always]) and “Important treatment specific information is communicated among staff daily” (staff, ranging from 0 [poor implementation] to 3 [strong implementation]).
Staff Connectedness
Staff connectedness was measured with one item from the youth survey and three items from the expert tool. Example items include “Does the staff’s voice remain firm and supportive when a youth is not following directions?” (youth) and “Staff are respectful in their communication with youth” (staff).
Social Support
Social support was measured using three youth items and four expert tool items. Example items include “Are staff working with you to accomplish your treatment goals?” (youth) and “Staff convey genuine regard and liking toward youth” (staff).
Future Orientation of the Program
The youth survey contained one item that directly aligned with the “future orientation of the program” subscale: “Do staff work with you on how to apply your community/home setting?” This item was added to the youth survey when the tool underwent minor revisions after a pilot testing phase (January–March 2012). Consequently, youth ratings for this item are only available for a subset of cases (n = 908).
Treatment Milieu
Two items were used from the youth survey and three items from the expert tool to measure treatment milieu. Example items include “Do staff help coach you on how to use your skills? (youth) and “Staff apply [treatment] strategies in the milieu” (staff).
Analytic Strategy
During the study period, youth provided 2,570 living unit ratings and expert raters provided 677 living unit ratings across 36 living units (hereafter, units). This included 26 units across four secure institutions, nine community group homes, and one boot camp. Because youth and expert rater assessments were not always captured in the same month over the two-month period, scale scores for the expert rater and youth data were aggregated at three-month intervals to ensure the time period captured at least one mean rating from each source. For example, if youth survey scores were conducted at Month 1 and 3 and expert rating scores were available in Month 2, youth survey scores were averaged for Months 1 and 3 and the expert rating score from Month 2 was assigned to the aggregated 3-month time period (Months 1, 2 and 3). This aggregated score was then treated as a single time point (hereafter, time). After aggregation, the total number of observations included 243 time/unit expert ratings and 228 time/unit youth ratings (termed “analysis units”). Of these, 225 analysis units had both expert and youth ratings. As a result, correlation analyses between youth and expert scales were conducted using a sample size of n = 225 time/unit cases. Within this sample, there was an average of 3.49 expert ratings per time/unit (SD = 2.66; median = 3, range = 1–13), and 12.25 youth ratings per time/unit (SD = 8.83; median = 10, range = 1–48). Individual item performance as well as composite scores aggregated across living unit and three-month time periods (total of 14 time points) separately for youth and expert ratings were examined.
The analytic strategy was consistent with previous assessments of measure reliability for institutional quality (for example, see Ref. 6). Descriptive statistics were used to assess item functioning using SPSS version 24. Intraclass correlation coefficients (ICCs) were used to assess rater reliability for youth within the five subscales of organizational functioning. ICCs were computed using the following formula: ICC = covariance intercept variance/(covariance intercept variance + covariance residual estimate).37,38 ICCs were not computed for expert ratings because the environmental rating process required that raters reach consensus even though both rater scores are recorded as separate assessments.
Bivariate correlations were used to assess inter-relations among subscales as well as the relationship between youth ratings and expert ratings of organizational functioning. Independent samples t-tests were used to determine whether youth and expert ratings varied by institution type.
Results
Reliability × Rater Type
ICC results suggested substantial variation among youth as raters of organizational functioning across living units, with ICCs for individual items ranging from poor (ICC = .10) to excellent (ICC = .97).39 Compared with previously published youth ratings,6 our sample of youth raters demonstrated higher consistency in ratings of Overall Organization (ICC = .37 vs. .16 respectively), Social Support (.50 vs. .36 respectively), and Future Orientation of the Program (.50 vs. .37 respectively), and less consistently in ratings of Staff Connectedness (ICC = .29 vs. .50). Youth in the current sample were found to be relatively consistent raters of the treatment milieu (ICC = .34).
In aggregate (mean of all quality assessment items), expert raters had high interrater reliability (ICC = .93) across living units and time, which closely resembled findings from a prior study using a subset of the same data (ICC = .98).40
Reliability by Facility Type
Independent samples t-tests using standardized means (z scores) indicated significant differences in ratings of the treatment milieu by facility type (secure vs. community group homes) among both youth and expert raters with community group homes scoring lower on treatment milieu when rated by youth, t(222) = −3.68, p < .0001 and expert raters t(240) = −2.23, p < .05 (see Fig. 1). Descriptively, youth scored both secure facilities (M = .19, SD = .57) and community group homes (M = −.11, SD = .49) lower on treatment milieu compared with expert ratings of these facility types (secure facilities: M = .30, SD = .98; community group homes: M = .03, SD = .50).
We found no significant differences in youth or expert ratings of institutional order, caring adults, or reentry planning across facility type (secure vs. group home).
Item Functioning within Scales
Table 2 displays the descriptive statistics and reliability coefficients for the individual items and subscales of organizational functioning for both youth and expert ratings. The subscales demonstrated good reliability in our sample, with Cronbach’s α coefficients ranging from .61 to .78 for youth ratings and .69 to .83 for expert ratings. Youth ratings demonstrated comparable reliability for the Overall Organization subscale (α = .67) compared with the extant literature using youth ratings (α = .63),6 while expert rater scores demonstrated stronger reliability (α = .80). The new treatment milieu scale demonstrated acceptable reliability for both the youth (α = .61) and expert rater (.78) samples, with stronger reliability for expert raters. The range among subscales scores and items was greater for expert ratings, suggesting greater precision in measurement by experts compared with youth. This is demonstrated by comparting the distance of the lowest mean scale score from the average score aggregated across all scales. For expert raters, the lowest item score was “program effectively reinforces behaviors” (M = 1.41, SD = .71). Standardized, this item was .58 standard deviations from the mean of all expert rating items. In comparison, the lowest item score for youth ratings was “Would you describe staff as ‘excited’ to work with youth during interactions?” (M = 2.21, SD = 1.07). Standardized, this item was .36 standard deviations from the mean of all youth items.
Convergent and Concurrent Validity
Scale intercorrelations are reported in Table 3. All subscales of the organizational functioning dimensions were moderately inter-correlated between rater groups, demonstrating weak concurrent validity: Overall Organization (r = .41), Staff Connectedness (r = .43), Social Support (r = .40), and Treatment Milieu (r = .49). Within raters, all scales were expected to demonstrate convergent validity given the interdependence and conceptual overlap among domains. Convergent validity was strongest for the expert raters with all scales highly correlated (r = .75 to r = .82). Youth ratings were moderately to highly correlated (r = .66 to r = .78).
Discussion
This study examined the concurrent validity of youth and expert ratings of social climate and treatment milieu in a state-wide juvenile residential system. Consistent with the study hypothesis, we found that the reliability of expert ratings of treatment milieu exceeded the reliability of youth ratings, although both were in the acceptable range and concordant with previous reliability studies of institutional climate ratings by youth (References 6 and 35, for example). The analysis also revealed that youth and expert rating were only modestly correlated at the lower bound for acceptable concurrent validity.
Low concurrent validity raises questions about the adequacy of measurement. External staff ratings demonstrated higher observed reliability in scale scores. This strongly suggests that expert ratings were the more accurate measure of institutional social and therapeutic climate. At the same time, the findings also replicated previous analyses demonstrating acceptable consistency and reliability of youth as raters of environmental quality. As between-youth reliability scores (as measured by intraclass correlations) were comparable with previously published youth scores on the same domains,6 we see the results of the current study as replicating and confirming extant research on youth as adequate raters of institutional quality. Together, results suggest that while youth ratings are adequate measures of institutional social climate, greater reliability will likely be achieved with expert raters.
Greater range in standardized item scores among expert ratings also suggests their reviews yielded higher precision in assessment. This was most apparent in the measurement of the new treatment milieu construct as measured by deviation from the averaged subscale scores. The mean expert rating of treatment milieu was lower than other subscales while the youth rating of treatment milieu did not differ from other subscales. The reliability of measurement was also higher for experts than youth in this domain, which supported our hypothesis that experts would have more knowledge of acceptable levels of fidelity or adherence to the therapeutic approach, which led them to be more precise, and consequently, harsher raters. This aligns with other clinical research in which trained, expert coders of specialized treatment approaches rate therapist competence more harshly than ratings using self-report or client assessment.41
At the same time, more precise measurement may not yield meaningful difference in predicting youth skill improvement or outcomes. Although there is some indication that treatment fidelity is related to client symptom improvement,17,22,30 the overall literature is mixed. A number of studies demonstrate the importance of nonspecific factors, like therapeutic rapport, independent of specific skills as predictors of client recovery. For juvenile residential settings, there is a small literature demonstrating the predictive strength between youth ratings of institutional climate and recidivism outcomes.6 Further research is needed to determine whether more precise and reliable ratings of social and therapeutic climate made by external raters within an established quality assurance infrastructure translate into more robust indicators of youth outcomes.
Limitations
The study is limited to the therapeutic environments in six residential units across one state. The findings are expected to generalize only to those youth institutions that are using a structured therapeutic approach in which staff are expected to engage in routine positive reinforcement and coaching of youth behavior and social-emotional skills (e.g., problem-solving, emotion regulation, stress management). Although more than half of the youth were youth of color, the largest single demographic was White, and the youth ratings may reflect perceptions that align with systematic differences in experiences of these environments by race/ethnicity and gender that are not fully accounted for in our models. We note the lack of multivariate models to examine possible moderators of youth or staff response by race/ethnicity as a limitation and the results should be generalized with caution. Given the significant variability in youth inter-rater reliability, it is likely that variance due to youth factors not accounted for may affect some of the associations identified between youth and expert rater validity.
Conclusion
Youth rehabilitation and treatment models are growing in importance in the operations of secure juvenile placements. This study found that expert raters provide more precise and reliable assessments of institutional social and therapeutic climate, suggesting that the investment in expert led quality monitoring is important for valid measurement of therapeutic placements. As youth ratings were still in the acceptable range of reliability, institutions not able to invest in expert led review should continue with youth ratings of quality. As expert rating requires more workforce training and resources, the benefit of more precision may be outweighed by the greater feasibility of obtaining youth ratings. Future research will need to examine whether more precise ratings of institutional quality by experts are also better predictors of youth outcomes.
Footnotes
Data for this study was collected with the support of a grant from the National Institute of Justice 2012-IJ-CX-0040).
Disclosures of financial or other potential conflicts of interest: None.
- © 2022 American Academy of Psychiatry and the Law