2010 6th Annual Scientific Meeting – Translational and Cultural Adaptation of Key Outcome Measures in International CNS Trials – Session Summary
International Society for CNS Clinical Trials and Methodology Annual Meeting
Washington, DC
22-24 February 2010
Chairs: Amir Kalali, MD; Richard Keefe, PhD
Speakers: Amir Kalali, MD; David Daniel, MD; Richard Keefe, PhD; Paul Harvey, PhD; Karl Broich, MD; Ravi Anand, MD
The Current Reality of Outcome Measures in International Clinical Trials: Should We Be Doing Things Differently?
Amir Kalali
Dr. Kalali framed the questions to be addressed by the session. What is our practice now in the use of outcome measures in international trials and what should we be doing? Dr. Kalali noted that there is variability in the quality and a heterogeneity in the focus of symptom outcome measures used in clinical trials. Some scales, such as the HAM-D, were not originally designed to assess outpatients in clinical trials and contain items that are not very sensitive to change. Patient-reported outcome measures (PROs) are a different kind of measure than clinician-rated scales, and may require different approaches to translation and adaptation. There are guidelines for translation and validation of PROs available (the ISPOR Principles of Good Practice) but there are no equivalent guidelines for clinician-rated scales.
Dr. Kalali defined some commonly used terms surrounding the adaptation of scales for international use. If translation is thought of as conceptually accurate conversion from one language to another, linguistic validation is the validation of that conceptual accuracy by a pilot study. Cultural adaptation adds to linguistic validation item customization and analysis of the psychometric properties in the new culture. Standardization is accomplished when there is complete psychometric validation and establishment of standard scores in the new language and culture. Translation is often the only step taken as there are practical issues of budgetary and time constraints. At present there is a lack of consensus on when translation is necessary variability among sponsors in their approaches. It would be useful if 1) the field developed consensus on these issues for clinician-reported outcome measures, 2) there could be official translated versions available to all, and 3) more methodological research were conducted addressing these issues.
Challenges and Best Practices in Training and Monitoring Measurement of Psychopathology in Linguistically and Culturally Diverse Settings
David G. Daniel
Dr. Daniel gave an example of the challenges associated with the use of psychiatric rating scales by presenting data showing the different patterns of YMRS ratings among US, European, and Indian raters. The key question he posed was: What are the most appropriate endpoints for training, certification, and measures of inter-rater reliability? This question is comprised of several other questions. Should rater training be unified across all cultures in a study or should it emphasize “synchronization” within a cultural region? Should an effort be made to synchronize each individual item on rating scale or to focus on the total score? Should efforts be focused on standardizing cross-sectional measurement or on standardizing assessment of change? Should the emphasis be on agreement with a “gold standard,” whether centrally or regionally determined, or on concordance among raters, whether global or regional?
While these questions do not easily lend themselves to comprehensive answers, Dr. Daniels made some observations that could guide planning of cross-cultural rater training and measurement. The measurements that may be the most difficult to standardize involve sexuality, aggression, and attitudes toward authority. Items based solely on interviewer observations are especially hard to standardize. Forced shifts from entrenched cultural practices tend to regress back to habitual practices over time. The manner in which feedback is given, in particular whether it leads to loss of face, affects its acceptability. Native language experts synchronized to global practices may be the best approach to giving feedback to raters. Cultural neutral measures of concordance, such as agreement with the mode, may be useful in international trials. Total scores may be more relevant and less sensitive to cultural influences than individual items.
Translating and Implementing Measures of Neurocognition in Eastern Europe and Asia
Richard S. E. Keefe
Dr. Keefe began by reviewing the components of the MATRICS battery and of the BACS and BAC-A batteries. The guidelines for adapting psychological tests for different languages and cultures have been established and refined for several decades. The guidelines delineate four aspects of test equivalence: construct equivalence, functional equivalence, translational equivalence, and metric equivalence. Back translation alone is not enough to establish translational equivalence since it may not account for meaning in the new culture. The question “where does a bird with webbed feet usually live?” provides an example. In Swedish “webbed” is translated as “swimming,” making the question easier to answer. To the extent that a test is not culturally sensitive or linguistically equivalent, cognitive outcome measures will increasingly depend upon features that are unrelated to the construct intended, and thus will be increasingly less likely to be sensitive to real treatment change. Furthermore, in the US population is used to testing, and in American culture speed is valued. In other cultures there is more of an emphasis on being correct. Speed may not be valued and may even be considered rude. If a test is dependent on understanding English or American culture, a drug relying on that test as an outcome measure will not demonstrate an effect. Dr. Keefe used the example of MATRICS testing in Singapore, where the t-scores vary wildly across different ethnic groups.
Social cognition and emotional recognition tests are particularly sensitive to cultural changes. Cultural issues with other key tests include problems of translation for verbal memory tests; issues with letter-number sequencing in Mandarin, for example, which does not sequentially order letters; different emotional valences associated with the same word in different cultures; and differences in the level of comfort with computers.
There are good reasons to collect community normative data for translated versions of cognitive batteries. Differences in the difficulty level between English and translated tests and cultural variations in the meaning of terms will be detected and adjusted for. Scores are put on a common metric across languages, based on the mean and standard deviation of stratified community samples. This enables comparison across samples using different languages.
Detailed attention to the issues surrounding the development and implementation of culturally valid neurocognitive tests is likely to enhance any real signal in a clinical trial with cognitive endpoints.
Cultural Considerations of Functional Outcomes in Chinese and Other Cultures
Philip D. Harvey
For years behaviorally oriented therapists have appreciated the “competence-performance distinction,” that is, the distinction between the “can I do it?” and “do I do it?” This is a critical decision for attempting to enhance outcomes by targeting disability directly rather than targeting its determinants. For practical reasons, real-world outcomes, such as marriage, employment, and residential status, are unrealistic targets for treatment studies. Thus measures of competence may be most the important outcome measures. Whether competence can be measured in a valid way and whether it can be measured across cultures are open questions.
Dr. Harvey presented data that showed that no correlation between cognitive ability and work or marital status. A large-scale study in patients with schizophrenia found that it was possible to predict residential status with good sensitivity and specificity with the UPSA. The evidence from several studies is that cognitive and functional capacity are closely correlated and that the UPSA correlates well with cognitive ability.
There have been two studies to date measuring performance-based disability across cultures. The first measured cognition and disability across cultures, in rural Sweden and the Bronx. Patients in Sweden and more likely to be financially independent and living independently, but this seemed to be due to the greater government involvement in caring for these patients. The second was a study in China using the UPSA-B and comparing healthy subjects to patients with unipolar depression, bipolar disorder, and schizophrenia. At the lowest level of education there was not a significant difference among the groups. However when subjects with less than 6 years of education were excluded, the performance of the healthy subjects was clearly better than that of the patients.
Cultural Aspects of Outcome Measures – European and US Regulatory Viewpoints
Tom Laughren
Dr. Laughren began by noting that this whole area has not received a lot of attention from the Division, especially questions surrounding instruments that measure function. Lorie Burke and the SEALD (Study Endpoints & Label Development) group at the FDA have recommended to the Division that for multinational trials FDA ask for information on item validity and probably on language subsets. Dr. Laughren said that a guidance on these issues will be coming. A lot of work already has been done, such as the ISPOR standards and work by the International Society on Quality of Life and the International Test Commission. The FDA’s guidance on patient-reported outcomes begins to address these issues as well and many of the same basic principles apply. Translation and cultural adaptation issues will be discussed at end-of-phase two meetings and pre-NDA meetings. Two domains in particular, sexual function and suicidality, stand out as challenges for assessment across cultures. Dr. Laughren noted in conclusion that the integrity and validity of multinational clinical trials depends on the validity of the measures.
Karl Broich
Dr. Broich said that Dr. Laughren had nicely summarized the regulatory perspective and that European regulators had similar views. (see video) There is an EMA document on clinical data collected from outside the EU, but it is written more for use by EMA assessors. EMA have seen differences in outcome in different countries and regions. One schizophrenia study found significant differences for the drug over placebo in Russia but not in the EU or US. Three drugs are approved for fibromyalgia in the US but not in the EU because large effects were observed in the US population but not in the EU population.
Discussion
Ravi Anand
Dr. Anand began by noting that the use of diagnostic psychiatric, cognitive, or functional ability instruments outside the settings they were developed for is fraught with significant issues. Abnormal scores may be attributed to pathology rather than to factors such as educational level, literacy, and cultural differences in cognitive and perceptual information processing, possibly leading to errors of overdiagnosis, or, for similar reasons, to underdiagnosis. The issue is complicated by different cognitive and perceptual development in populations with different cultural and physical environments. Members of different cultures are exposed to different perceptual and cognitive problems in daily life. Thus test materials not validated for a cultural group carry different psychometric properties and normative cut-offs than for the original group.
The ‘gold standard’ solution is conceptualize, develop, test, and validate instruments in the actual population to be studied. The ‘silver standard’ would be to translate with cultural adaptation, perform reconciliation testing checking for linguistic distortion, and perform a validation study to support the results. The ‘iron standard’ is to translate, back translate, and reconcile. Diagnostic instruments require far greater standardization than outcome instruments. Diagnosis should not be made with translated instruments.
