Additional related information may be found at: |
Neuropsychopharmacology: The Fifth Generation of Progress |
Methodological and Statistical Progress in Psychiatric Clinical Research: A Statistician's Perspective
Helena Chmura Kraemer, Ph.D.
A generation or two ago, there was still active debate as to whether clinical questions in psychiatry could be addressed with scientific research. Such questions arose from a premise that the interaction between clinician and patient in a psychiatric framework was so specific to that particular clinician and that particular patient that it was inappropriate to generalize from any sample of patients or any sample of clinicians. One hears little such debate now, and recent years have seen the standards of excellence in psychiatric clinical research become as stringent as those in any other area of clinical research.
One result of this evolution is that the methodological and statistical tools developed in other areas of medical research have become available to psychiatric research. Such tools are relevant to issues of a) research question conceptualization and specification, b) representativeness of samples and of generalization from particular samples to populations, c) measurement and classification, d) powerful and cost-effective research designs, e) statistical analysis and f) documentation and presentation of results. At the same time, research methods developed within the context of psychiatry to cope with the special problems of studying human behavior are being "exported" to other areas of medical research. This development has occurred because behavioral issues (e.g., diet, exercise, affect, smoking, coping with stress, quality of life) have become more relevant to the prevention and to the successful treatment of physical disorders such as cancer, heart disease, and acquired immune deficiency syndrome (AIDS). This blurring of the lines between clinical research in psychiatry and in other fields of medicine promises to continue and to increase, with benefits to all fields concerned.
In recent years, psychiatric research journals have published randomized clinical trials, including several multisite studies, assessing the efficacy and, with growing frequency, the effectiveness of pharmacological and psychotherapeutic treatments for mental disorders. We have seen epidemiologic studies with a national and international perspective and application of new and powerful genetic models to psychiatric disorders. We have seen the DSM-IV move toward a more empirical basis than was true of earlier DSM versions, as a result of serious and thoughtful reconsideration of the problems of accurate diagnosis and prognosis of psychiatric disorders. A refocusing of thinking about psychiatric disorders has occurred, with a more developmental perspective and a growing emphasis on prevention, detection of disorders, and cost-effectiveness of both detection and treatment methods. Cross-disciplinary interactions have brought new insights and strengths to the study of mental disorders.
That's the good news. Now for the bad: the availability of these methodological riches has not translated into their wide, consistent, and successful usage. While there are many methodological "gems" in the psychiatric literature, the statistical methods used in most research papers in the psychiatric research literature are generally those that were current in medical research 50 or more years ago: one- and two-sample t tests, Pearson product moment correlation coefficients, simple chi-square tests of independence in contingency tables, simple analysis of variance, or linear regression models (7). To make matters worse, frequently even these classic statistical procedures are misapplied, yielding potentially invalid results (1, 34). Even more frequently, these procedures are applied when much more powerful and informative methods are available (23).
Who cares whether the statistical analysis is done right or not? Because a primary goal of statistical methodology is to identify strategies to ensure the reproducibility, validity, and generalizability of research results, the natural result of misuse and abuse of statistical methods is nonreplicability. It is thus important to note how frequently the introductory sections of psychiatric research proposals or papers extensively tally the inconsistencies within the body of previous work on a particular issue: five papers say "yea," and five say "nay," and the remaining 20 are inconclusive. Practice guidelines in psychiatry (e.g., see Stress ) sound a recurrent theme: " . . . it is unclear . . . "; " . . . currently no clear consensus . . . "; ". . . more definitive strategies cannot be suggested . . . " The basis for much of the guidelines for psychiatric practice continues to be not clinical research results but, instead, what is "intuitively appealing," "expert opinion," and "conventional wisdom." Over and over again, the plea is made for more definitive data and for more specific and convincing studies.
Even more disturbing is the spate of results published by prestigious research groups, particularly in psychiatric genetics that have been later retracted or contradicted. Such events carry serious messages about the inadequacy of design, sampling, measurement, or analytic procedures used in those studies. Who cares whether the statistical and methodological issues are well dealt with? We all should—patients, policy makers, psychiatric clinicians, researchers, and statisticians.
We who consider ourselves statisticians with a special interest in psychiatric research must share in the responsibility for this situation. It may well be that there are too few of us, and those few make themselves inaccessible to psychiatric researchers (23), or that we communicate or teach badly or reluctantly. There may be a certain arrogance and contemptuousness in our interactions and communications that can be off-putting. Perhaps we are impatient dealing with the real problems of psychiatric researchers, and sometimes we do ask for the impossible: sample size in the thousands where only tens are available, absolutely reliable measures where the best fall short of that, and so on. Perhaps we are sometimes more concerned with elegant and complex mathematical methods than with whether those methods serve the needs of psychiatric research.
Perhaps so, but the major source of the problems lies instead in the quantity and quality of training of psychiatric researchers that often impairs interdisciplinary communications. Many psychiatrists receive little or no training at all in statistical methodology. Psychiatrists who lack sensitivity to the ways in which poor statistical approaches can produce misleading results are poorly equipped to interact effectively even with the most well motivated statistician, let alone to read the psychiatric research literature or to produce important and reproducible research findings. For example, too many have no clear understanding of what "statistically significant" does and does not mean (4, 29); they often mistakenly interpret the term to mean true, big, or important (see Pharmocological treatment of Obesity).
When psychiatrists are given statistical training, the emphasis of that training is frequently on "number crunching": how to compute a t-test statistic or a product moment correlation coefficient and, most of all, how to generate those all-important but frequently misleading "stars" indicating statistical significance. There is little coverage of the more fundamental concepts—vital sampling, measurement, and analytic issues. Instead, particularly with today's ready access to computer statistical packages and the greater comfort of today's psychiatrists with the use of computers, the emphasis is on which computer buttons to push. Very complex statistical methods are frequently implemented by computer programs that are widely distributed to researchers who know little or nothing about their underlying assumptions, the robustness of the results to deviations from those assumptions, and hence any of the caveats to using the methods. The disquietude statisticians feel about this situation is analogous to what physicians might feel if all drugs currently available only with physician's prescription were made available over the counter. Under these circumstances, nonreproducibility and nonvalidity of results are inevitable (6).
In the journal review process, not infrequently a paper submitted with "state-of-the-art" statistical approaches, clearly explained and documented, is rejected by reviewers and editors on the claim that the readers of the journal would not understand or would not be interested. Reviewers and editors sometimes even demand as a condition of publication that t tests or Pearson r tests be used when these simple methods are neither appropriate nor optimal. Frequently, but not always, the same results can be validly achieved with less complex statistical approaches, in which case they should be (application of Occam's razor). However, routinely to reject papers for necessary complexity or routinely to sacrifice results by demanding less-than-optimal approaches, merely because they are more simple, insulates the psychiatric community from ever upgrading their methodological skills. Thus, the "horse and buggy" methods continue as the mainstay of the field, even with the ready availability not only of "horseless buggies" but also of modern high-speed cars, trains, jets, and perhaps even spaceships.
The focus of this discussion will not be on the general range of statistical methods commonly seen in the psychiatric research literature in the last generation, nor will it be on the range of statistical methods available. That information is readily available in other sources to those interested. Instead, the focus will be on those statistical and methodological issues that within this last generation have become most salient in addressing psychiatric clinical research questions. It is not the practice but the pitfalls and the potential of statistical methods in psychiatric clinical research that will be considered—namely, those that are likely to have the greatest impact, positive or negative, on future progress in this field.
Most fundamental to all psychiatric research is the reliability (precision) and validity (accuracy) of a psychiatric diagnosis. Psychiatric diagnoses are used for selection of patients for a study, defining both inclusion and exclusion criteria. They are used as outcome measures in epidemiological assessments of risk factors, in genetic studies, and in clinical trials. Diagnoses are used in monitoring subjects over time to either (a) assess the development of a disorder, (b) assess the natural history of a disorder or its resolution with treatment, or (c) remove subjects from a study because of emergence of side effects. In short, the quality of diagnosis affects every aspect of every type of psychiatric clinical research. If a diagnosis is not valid (accurate), subjects may be inappropriately included in (or excluded from) a study, and the outcome measures may simply be wrong, possibly biasing all the conclusions based on that study. On the other hand, if the diagnosis is valid but not very reliable (precise), signals are muddled. Sample sizes necessary to detect what is going on must be very large, because one result of low reliability is low power. Therefore, research becomes less efficient. That problem can sometimes be overcome by enormous expenditure of time or money to generate sample sizes in the hundreds or thousands. But then what is seen in the results of such studies may seem very weak and unimpressive, because effect sizes, as well as power, are attenuated by unreliability (19). Inadequate reliability and validity of diagnoses used in different studies may be one of the major sources of inconsistent and nonreproducible results. For all these reasons, the potential of statistical methodology in psychiatric research is fundamentally related to how well the issue of quality of diagnosis can be addressed (see The Psychopharmocology of Sexual Behavior ).
A Diagnosis Is Not the Same Thing as a Disorder
A disorder represents something wrong (a disease, an abnormality, an injury, etc.) in the patient. A diagnosis, on the other hand, represents the opinion of a clinician as to whether or not that disorder is present. One can think of a diagnosis as having three independent components, one of which is the presence of the disorder:
Diagnosis = A: Disorder in the patients
+ B: Characteristics of the patients irrelevant to the disorder (contaminants)
+ C: Random error (noninformative about the patients).
Among patients of a population, the proportion of variability in the diagnosis directly due to variability among the patients having the disorder is the classical definition of "validity" or accuracy of the diagnosis (schematically A/(A + B + C)). The proportion of variability among patients of a population in the diagnosis as a consequence of variability not due to error, but due to true differences among the patients, is the classical definition of "reliability" or precision of the diagnosis (schematically (A + B)/(A + B + C)). It is important to note that under these definitions (24) one cannot speak of the reliability or validity of an individual patient's diagnosis, but only of the reliability or validity of a diagnosis in a population of subjects. It is clear that, by definition, reliability is always greater than validity. A totally unreliable diagnosis cannot be valid, but a totally reliable diagnosis can be totally invalid, i.e., you cannot be completely inconsistent yet always right, but you can be consistently wrong.
Moreover, both the reliability and validity of a diagnosis can and does vary from one population to another. In a population in which the disorder is either very rare or very common, there will be little variability among the subjects in the diagnosis (A), and this variability is easily overcome even by moderate contamination (B) or minor error (C), leading to the so-called "base-rate problem."
However, the variations of reliability from one clinical population to another depend not on the prevalence in the population, but on the degree of homogeneity of the population. Low- and high-prevalence populations tend to be homogeneous, and thus tend to have low reliability, although very homogeneous moderate-prevalence populations will also demonstrate low reliability. Consequently, it is more difficult to develop highly reliable procedures for very-low-prevalence or very-high-prevalence populations. That sounds a warning that it will be very difficult to find true effects of any clinical or policy significance in populations in which the disorder of interest is very rare without investing major effort to find diagnostic procedures for that population that are very reliable.
The types of major effort that are necessary are well known. The diagnostic protocol (definitions and rules) must be made clear, the conditions under which the diagnostic protocol can validly be applied should be stipulated and adhered to, and the diagnosticians should be well-trained and consistent in their application of the diagnostic protocol. If all this effort does not produce acceptable reliability in a low-prevalence population, it has been known since 1910 (3, 32), and most clinicians intuitively understand, that consensus of repeated independent assessments has greater reliability than any one such assessment. This principle has been shown to apply to diagnosis (19), and methods have been proposed as to how to form the appropriate consensus to accomplish this purpose (21). The procedure of having a second or third opinion on each subject in a study may well be costly and time-consuming, but it may spell the difference between clear, unambiguous findings and non-statistically significant results difficult to publish, statistically significant results that appear to have little clinical significance, or simply inconsistent and confusing results. This becomes all the more important in studying low-prevalence disorders.
The Importance of Validity
While a great deal of attention has been focused on reliability, the issue of validity of psychiatric diagnoses has been, to some extent, neglected. Contaminants of a diagnosis (B, above) may include such factors as level of education, facility with the language, level of cooperation of the patients or their families, and so on. Random error (C, above) may include patient inconsistencies in the manifestation of symptoms or between the report of the patient and their family used in the diagnosis (when the patient is examined in the morning she says or does one thing, whereas in the afternoon she says or does another; the disorder, however, remains present or absent). It may include observer (diagnostician) error or instrumental error (i.e., some problem intrinsic to the diagnostic method). How one designs a study to assess reliability and validity of a diagnostic procedure in a population depends on which B and C sources one is concerned about, and in what population.
Consequently, how strong the influence of irrelevant information (B) is may also vary from population to population, and thus the validity of a diagnosis may vary as does the reliability. If a major source of such irrelevant information is related to sociodemographic differences, the reliability or validity in a population of white, middle-class subjects may be considerably higher than in a population more heterogeneous in terms of race, education, and income level. It is an inconvenient fact of life that the reliability and validity of a diagnosis must be reestablished for each different population to which it is applied, and by each different research unit that seeks to apply it.
Questions of the quality of diagnosis have attracted far more research attention in psychiatry than in other medical fields, which suggests that there may be more problems with psychiatric diagnosis than with medical diagnosis in general. Psychiatric clinicians and researchers are typically very surprised to find out that what evidence there is on reliability of diagnosis in other medical fields suggests strongly that the reliability of psychiatric diagnosis in general is neither much better nor much worse than in other fields of medicine (17, 18). This is an important observation, because one recurrent argument used to delay looking into the validity of psychiatric diagnoses has been that the reliability remains too low to start such studies. It may well be that reluctance to pursue the issue of validity aggressively will, in the long run, make the difference.
In principle, it is relatively easy to establish the reliability of a diagnosis, to increase that reliability if it is not adequate, and to cope with its effects on power and effect size in research. The real challenge is the assessment of validity (27). Because the validity of a diagnosis depends on how closely the diagnosis corresponds to the disorder (A, above), some criterion of the presence/absence of the disorder must exist—that is, a "gold standard"—against which the diagnosis can be assessed. It is repeatedly and correctly stated that such "gold standards" for psychiatric diagnoses do not exist. Nor do they for other medical diagnoses!
What differentiates the situation for psychiatric versus other medical disorders seems to be some sort of process of "triangulation." Generally one starts with a "face valid" diagnosis, a diagnostic definition that incorporates the clinical view of the range and importance of symptoms involved in the disorder (as in the DSM and most psychiatric approaches), often suggested by clinical observation of subjects and their responses to treatment. Then, as one uses that diagnosis in research studies, one slowly gains greater understanding of the etiology, the risk factors, the biological concomitants, the symptomatology, the natural course, or response to treatment that characterizes the disorder and differentiates it from other disorders. This empirical knowledge is then "folded into" an upgraded diagnostic procedure, thus moving closer and closer to the goal of a completely valid diagnosis. As the diagnosis improves, the quality of the research information improves, which then spawns further improvement in the diagnosis until some upper limit (usually considerably short of perfection) is reached.
What seems to differentiate psychiatric diagnosis from other medical diagnoses is that process of "folding in" the results of research studies to upgrade diagnostic procedures for the next generation of research studies. For example, the diagnosis of coronary artery disease today is far different from what it was a generation ago; it has progressed far beyond simple observation and classification of clinical signs and symptoms, and it continues to be updated. In contrast, the process of diagnosis of schizophrenia and depression over the same period of time has changed very little. It is still based only on "face valid" observation and classification of signs and symptoms.
Fundamentally, there is no one "gold standard" for the validity of any diagnosis. There are many. None of them are pure "gold," and all of them are changing over time. It is reasonable to use the DSM-IIIR as a "gold standard" for the DSM IV if the accumulated clinical evidence supports that the DSM IIIR "works," but then to propose to supplant the DSM IIIR by the DSM IV diagnosis if it can also be shown, using other "gold standards," that the DSM IV is better in some respects than the DSM IIIR (more reliable; more consistent over sites; more closely related to functional impairment; less subject to certain types of contamination; more homogeneous etiology, course, response to treatment, etc.). If that can be done, then the DSM IV becomes the "gold standard" for the next generation of studies. Then, if the next generation of studies based on the DSM IV discloses biological concomitants of a disorder (perhaps blood and urine tests, imaging, genetic screens), these procedures should ultimately become part of the DSM V or VI.
The implementation of such validation studies is greatly facilitated by application of signal detection methods (20). These are a body of exploratory procedures in which a "gold standard" (acknowledged to be less than perfect) is used to assess a set of signs, symptoms, contaminants, biological responses, and any other readily obtained information (collectively called "tests") thought to be relevant to the disorder. The goal is to determine the best choice of "tests," the optimal way of combining these "tests" (symptoms a and b, for example, or symptoms a or b?), and the optimal cutpoints (5 out of 7 symptoms, for example, or 3 out of 7?) to generate optimally sensitive tests, as well as to determine specific or efficient tests for the "gold standard," depending on the relative clinical importance of false-negatives and false-positives. Moreover, one can include considerations of test cost as well as accuracy. Signal detection methods provide tools to perform all of the types of tasks that have been done subjectively and without empirical scientific documentation in the formulation of the DSM and other psychiatric diagnostic systems.
Clearly, if such a process were ever to take place in the development of psychiatric diagnoses, what we "know" about the disorders will also change. If the sensitivity of the diagnosis of the disorder increases (and specificity holds steady), one would expect the prevalence to increase, the onset time to occur earlier, and the duration of illness to increase. If the specificity of the diagnosis of the disorder increases (and sensitivity holds steady), the opposite effects will occur. If both sensitivity and specificity increase, smaller sample sizes will be needed to detect effects of all kinds in research studies, and the effects detected will seem of greater clinical significance. Most of all, it will be easier to replicate true findings and to achieve consistent results across studies.
Currently, a great deal of emphasis is placed on setting criteria so that the prevalence does not change, or so that the ICD and the DSM systems are more comparable, or setting criteria to reflect social norms. What if the decision not to use mammography were based on the argument that, with its increased sensitivity to early cancers, the prevalence of breast cancer would increase? Or if we proposed not to use mammography because it is generally not used in Europe (or some other locale), and we want to ensure comparability between the diagnosis of cancer in the United States and Europe. Or if we proposed not to use mammography because some powerful political pressure group opposed it? Guidelines for use of routine mammography are still controversial, but the basis of the controversy is the question about the accuracy and value of the test to protect the interests of the patients, not considerations such as those above. The same should be true of psychiatric diagnoses.
Meta-analysis (5, 14, 35), the synthesis of the results of many studies related to a particular research question, is a statistical procedure as old as the field of statistics. The term "meta-analysis," however, was only introduced in the 1960s. Since then, the approach has been simultaneously lauded as the greatest new methodological development, and decried and derided as "meta-analysis, meta-garbage." Nevertheless, all designers of clinical research projects, in effect, do meta-analysis. They do it well or badly, superficially or in depth, formally or informally, but they do it. The vast majority of such meta-analyses are never submitted for publication. Many that are published are quite badly done, and therein lies the source of the controversy.
The basic principles of a meta-analysis require that 1) an attempt is made to locate all studies undertaken that address a particular research question; 2) each study contributes a summary statistic, an "effect size", that quantitatively describes the results of that study relative to the research question; 3) a consensus effect size is estimated, from which a conclusion is drawn as to the "state of the field" in regard to that research question.
The problem lies in how each step is accomplished.
1. It is simply not possible to locate and gain access to all studies undertaken on a particular research question. Studies published in English might be relatively easy to locate; studies not submitted or accepted for publication, studies undertaken but not completed, even studies published in other languages, might not be readily accessible to the meta-analyst. The accessible studies may not be representative of all studies undertaken. Given the importance attached to "statistically significant" results, the accessible studies may overestimate the true effect size, particularly accessible studies having low a priori power to detect a clinically significant effect size, or studies that report serendipitous findings (the "file drawer problem" [28]) Thus, meta-analysis may produce seriously biased conclusions if the search is not both very careful and very complete.
2. Some studies that are accessible may suffer serious methodological flaws. If such studies contribute an effect size that is included on the same basis as effect sizes from valid and powerful studies, these flawed studies can seriously bias the results. Thus, the meta-analyst bears the same burden as does a peer-reviewer of a proposal or a reviewer of a manuscript submitted for publication—to ascertain whether adequate scientific standards are met. Fatally flawed studies should be set aside (given zero weighting in the synthesis), and others might be weighted according to a rating reflecting scientific adequacy. Not to do so has been referred to as the "garbage in, garbage out" problem.
Moreover, it is not always easy to delimit which studies are "relevant" to the research question. If, for example, one sought to find the relative effectiveness of psychotherapy versus psychopharmacology in the treatment of a disorder, what would be included under the title of "psychotherapy", what under the title of "psychopharmacology", and what under the definition of the "disorder"? A meta-analyst, overly anxious to cast a broad net, might define any treatment delivered without ingestion or injection as "psychotherapy", anything ingested or injected under medical advice as "psychopharmacology", and propose a definition of the "disorder" so broad as to be essentially meaningless. This has been called the "apples and oranges" problem.
3. Not all studies related to the same research issue use comparable outcome measures. The concept of an effect size is based on finding some interpretable measure that can be calculated to be comparable for a variety of different measures of the same outcome, not for a variety of measures of different outcomes. Once again, a meta-analyst may include measures that actually relate to very different constructs (e.g., combining measures of improvement of clinical depression with measures of cost of treatment).
4. It is as important to detect the heterogeneities among effect sizes from various studies as it is to detect the homogeneities. Thus, the analytic procedure of "pooling" effect sizes should start with testing homogeneity of effect sizes, finding homogeneous subsets of effect sizes, and pooling only those. There should then be some effort to explain any heterogeneity found. To do this often assuages problems listed above. For example, if one did indeed take too broad a view of what constituted "psychotherapy", that might be disclosed in the finding that certain classes of therapies included produced effect sizes quite different from the effect sizes of other classes. Alternatively, if one applied generous standards as to the design of efficacy or effectiveness trials and included both randomized, blinded, controlled studies along with those that were non-randomized, non-blinded, or uncontrolled, it would not be surprising to find that the effects found in non-randomized, non-blinded, or non-controlled studies were more optimistic than those in randomized, controlled and blinded studies.
Yet, done rigorously, thoughtfully, and carefully, meta-analysis answers many vital questions. First of all, an overview of past studies may reveal which methodological approaches are successful and which unsuccessful for a particular research question. Then, since a pooled effect size is much more precisely estimated than is the effect size from any single study, its confidence interval would strongly suggest whether the research question has already been resolved, and, if so, whether the effect size is of a magnitude to be of clinical or policy, as well as statistical, significance. If the confidence interval suggests that the result is still in doubt, the meta-analysis may often provide information on how best to design future studies that might, in combination with those of the past, resolve the issue. This is exactly the kind of information that is needed to conceptualize and design cost-effective and powerful future studies. Such information should constitute the background and rationale for each proposed new study, as well as the basis for the study design.
One lesson from past studies is that it is much cheaper and easier to do cross-sectional than prospective, longitudinal clinical research studies, but that such studies frequently provide misleading answers to research questions related to process or development. By far, most psychiatric clinical research studies are, if not strictly cross-sectional (observation of a subject at only one point of time), then, at most, very short term. Even in longitudinal studies, data may be analyzed in a cross-sectional manner. Such studies then have, at best, the same value as serial, cross-sectional studies, which would have been cheaper, less time-consuming, and less subject to sampling bias (due to dropouts over the follow-up time in longitudinal studies). Many of the studies that attempt to present longitudinal perspectives do so with cross-sectional designs using retrospective recall data or data from records to recreate the history of a subject.
Research and clinical evidence indicates that there are major individual differences among those who at some time during their lifetimes suffer from a psychiatric disorder. Some have quite early onset, others quite late. Some may have a single episode, whereas others may have a succession of episodes. Some, but not all, patients may seek and receive treatment and do so during some episodes and not others. What treatment is received may differ from one episode to another for a subject, and from one subject to another. Whether the effect of the treatment is evanescent or influences all that is to follow for that patient may vary. Duration of episodes and/or remissions may differ both within and between subjects. Some subjects may completely recover, whereas others may not.
Thus, for each subject who ever experiences a psychiatric disorder at any time, the disorder is a process taking place over the lifetime of that individual and can only be fully understood by following that patient over time and observing the time course or trajectory (seePharmocological treatment of Obesity). Differences in age of onset, number and duration of episodes, duration of remissions, and occurrence of recovery may be manifestations of fundamentally different disorders (each with a different etiology, course, and/or response to treatment) which we may be inadvertently lumping under one diagnostic title. Certain disorders to which we now assign different diagnostic titles may in fact be manifestations of essentially the same disorder. Without access to a lifetime perspective on the disorder that could only be gained in longitudinal studies, it is difficult to see how such problems can be resolved.Attempts to study certain aspects of lifetime course using retrospective recall in cross-sectional studies generate all the usual problems, both with retrospective sampling (e.g., is the availability of the subjects enrolled in the study itself affected by the disorder?) and with retrospective data and recall. What one recalls may be affected by events that occur between the event and the report. For example, a brief bout of depression during one's teen years, followed by many disorder-free years may be more likely forgotten by the age of 50 than one that was followed by unpleasant related events (e.g., a suicide attempt) or many later episodes of depression or depression-related outcomes. Analysis of retrospective recall data might then suggest that early-onset depression has a high association with unpleasant subsequent related events and many recurrences when compared to later-onset depression. But that may be purely a statistical artifact.
Use of clinical records to recreate history generates other problems. Experience, particularly in multisite clinical research studies, demonstrates the difficulty in training diagnosticians to adhere to acceptable standards of consistency and reliability. That experience makes it less credible that psychiatric diagnoses recorded by psychiatrists and psychologists (and others), who are not trained to uniform standards and who operate over many sites and many years, could be expected to produce consistent, reliable, and valid diagnoses to be employed as a basis of clinical research studies done 5, 10, or 20 years later. Moreover, with current health care funding problems, the diagnoses listed in clinical records may relate more closely to what is covered by health insurance than what the patient actually has.
Yet it is unrealistic to propose prospective lifetime follow-up of nationally representative samples for a variety of reasons. Not the least of these is the prohibitive cost and the long delay in achieving even partial answers to crucial clinical questions. However, it is quite feasible to follow each subject in a study for 3–5 years or so (much longer than current follow-up time), and then (depending on the nature of the research question) to piece together information from different cohorts of subjects to gain an understanding of long-term patterns revealed over time. Such designs have come to be called "accelerated lifetime studies" or "cohort-sequential designs" (2) or "overlapping cohort designs" (26). It is also quite possible to follow the subject not only to the end of treatment, as is typically done, but for a period of 1–3 years, to check for delayed effects of treatment or maintenance or relapse effects.
Within recent years, excellent and exciting statistical analytic methods have become readily available to deal with such follow-up studies. Two general approaches deserve mention.
Survival methods (13, 16, 30, 31) deal with research questions concerning time to an event: age of onset, duration of episode or remission, latency to treatment response, and so on. These methods require a well-defined zero point (birth for age of onset, beginning of episode or remission for duration; beginning of treatment for latency to response) and a well-defined event (onset, remission, relapse, response). These methods deal well with irregular follow-up frequency or duration and with censored follow-ups (e.g., patient loss to follow-up or death from a competing cause), which are some of the most common problems in longitudinal research designs. Such methods concern complete description of survival (Kaplan–Meier curves) under minimal restrictive assumptions, as well as comparisons of survival curves and identification of factors that predict survival (e.g., Cox proportional hazards model).
Random regression models (growth models, hierarchical linear models, etc.) deal with repeated measures of subjects over time (9, 12) [see also Pharmocological treatment of Obesity]. These methods require a well-defined outcome measure that can be validly, reliably, and repeatedly obtained from each subject over the follow-up time. The basic principle of all such methods is that the first step in the analysis examines the trajectory of each individual subject and characterizes that trajectory in one or more clinically meaningful ways (e.g., rate of change over time, peak response over time). The second step compares subjects on those characterizations (e.g., are subjects in the treatment group likely to improve more rapidly than subjects in the control group?) or assesses the predictors of those characterizations (e.g., is the initial severity of the disorder predictive of the rate of response to treatment?). These methods may be quite simple or, depending on which mathematical assumptions can be reasonably made in a particular context, quite mathematically complex. However, when such methodologies are suitably employed, they cope well with the problems typical of longitudinal research (irregular follow-up, dropouts, unreliability of measurements, etc.) and offer great power and sensitivity to detect the type of effects most important to psychiatric advances.
Perhaps the most crucial issue to be dealt with in longitudinal studies is that of onset of disorders—time of onset and factors predicting onset. This is urgent for three reasons: i) Understanding the etiology (causes) of psychiatric disorders may fundamentally require observations before, at, or immediately after the onset of the disorder; ii) efforts to prevent mental disorders absolutely require knowing the time and risk factors for onset; and iii) it may be, as it is in cancer, heart disease, and other disorders, that responsivity to treatment is closely related to how early in the course of the disorder the treatment is initiated. The lack of effectiveness of many psychiatric treatments may be related simply to the fact that they are initiated too late in the disease process.
Many problems of inconsistency or non-replicability relate to inconsistent and ambiguous use of the terms of science (11). What is reported often goes far beyond what was empirically demonstrated in a study. An important particular case in point is that of the terminology surrounding risk assessment, in particular the use of the terms "cause" or "causal."
It has been proposed (22) that a factor that is shown to correlate with an outcome (e.g., onset of disease) should be called simply a "correlate." If a correlate is shown to precede the outcome, then it can be called a "risk factor." If a risk factor cannot be shown to change within the individuals of a population, either spontaneously or in response to intervention, it can be called a "fixed marker." If the risk factor can be shown to change, it can be called a "variable risk factor." If one can show that manipulation of the variable risk factor produces change in the risk of the outcome, then it can be called a "causal risk factor." A variable risk factor that cannot be manipulated or, when manipulated, cannot be shown to change the risk of the outcome is called a "variable marker."
In this terminology, many "causal risk factors" may not actually directly cause the disorder, but they do play some role in the causal chain that leads to the disorder. For example, unsafe sex practices or use of dirty needles are causal risk factors for AIDS, but they are not the cause. The primary goal of risk factor research is at least to identify causal risk factors, for only intervention on these factors (not on fixed or variable markers) can effect change in risk.
As currently used in the research literature, factors merely shown to be correlates are often referred to as risk factors or even as causes. Many of these may be concomitants or results of a disorder, and many of these may simply be pseudocorrelations. Fixed markers cannot be shown to be causal risk factors (and therefore causes) but are invaluable for the identification of "at risk" populations, and their correlates may be causal risk factors. For example, race and gender are fixed markers for many disorders, but it may be poverty and disadvantage associated with race, or life events associated with gender, that may be causal risk factors. Variable markers have been demonstrated not to be causal risk factors, but they too often provide hints as to possible causal risk factors. Age, for example, is a variable marker for many disorders, but what aspect of aging is actually the causal risk factor is the issue of interest and importance. All risk factors are important, both in the clinical and research context. However, a clear path to understanding the causes of disorders, and thus their prevention and treatment, requires that we apply clear and unambiguous terminology that reflects only what has been empirically demonstrated. Anything else can be misleading.
FITTING MATHEMATICAL MODELS TO THE "REAL WORLD" VERSUS FITTING THE "REAL WORLD" TO MATHEMATICAL MODELS
A Problem in Statistical Analysis
Long ago, Feinstein (10) eloquently summarized an issue now most crucial to the future of psychiatric clinical research: " . . . mathematics has only the secondary role of providing lines and colors for the map; the main goal is the identification of different clinical terrains. Mathematics has no value in helping us understand nature unless we begin by understanding nature. To start with mathematical formulations and to alter nature so that it fits the assumptions is a procrustean ‘non sequitur' unfortunately all too prevalent in ‘contemporary science.' What emerges is tenable and sometimes even elegant as mathematics, but is too distorted by its initial assumptions to be a valid representation of what goes on in nature."
Without exception, every statistical inference procedure (i.e., any procedure by which we try to understand what goes on in a population of subjects by studying a sample from that population) is based on a certain number of mathematical assumptions, a mathematical model. These assumptions are of four types:
1. Those guaranteed by the design. The assumption that subjects were randomly assigned to treatment and control groups is guaranteed by implementing a random process to do so. The assumption that errors or measurements are independent is guaranteed by blinding all assessors to data coming from other sources.
2. Those empirically shown to correspond reasonably well to reality in the context. Thus the assumption of equal variance in two groups of subjects can be checked by comparing the observed variances in the two groups. The assumption that two variables are linearly related can be checked by examining the scatter of those variables.
3. Those assumptions from which it can be mathematically demonstrated that deviations do not seriously compromise the inference (robustness). For example, if the sample sizes are equal, the two-sample t test is remarkably robust to deviations from equal variance assumptions. There are many studies showing that a Pearson product moment correlation coefficient is quite robust to deviations from the assumption that both variables being correlated have normal distribution (but nonrobust to deviations from the assumption that they are linearly related and equal variance assumptions).
4. Those assumptions that are fantasies; that is, they do not correspond to reality, and there is no robustness to protect the inferences. Fantastic assumptions are dangerous. In a purely mathematical or logical exercise, one can make any assumptions one wishes, because what follows is understood to be true only conditional on the assumptions (i.e., the conclusions are just as fantastic as the assumptions). Thus in a mathematical or logical exercise, if one wished to assume that the sun rose in the west, or that objects fall up instead of down, there is no impediment to doing so. It is clearly understood that what follows has no validity in the real world because the assumptions are clearly untrue in the real world. It is only a mathematical or logical exercise, no more.
However, when one is dealing with issues related to the health and well-being of patients, as one does in clinical practice or research, one cannot be casual about what assumptions are made. It is the responsibility of the ethical medical researcher to check those assumptions, lest fantastic ones do harm, and it is the responsibility of the ethical biostatistician to bring those assumptions to the attention of the medical researchers to make sure they are checked.
The consequences of entertaining fantasies in psychiatric clinical research has been given a resounding empirical demonstration in recent years. So far, every reported finding of a genetic basis for a psychiatric disorder has been refuted upon attempt to replicate or confirm, or, worse yet, the results have had to be retracted by those researchers reporting the finding. This situation is exactly what sound biostatistics in medical clinical research is meant to prevent.
It should be noted that the mathematical models used in these studies were fundamentally sound, having proven their value in other fields of medicine in finding linkages for breast cancer, for Huntington's disease, and so on. It is their application to psychiatric disorders that foments the problem. What, then, is so different about their application to psychiatric disorders?
First of all, genetic analyses are based on the assumption that those labeled "affected" have the disorder and those labeled "nonaffected" do not—that is, diagnoses are quite valid and reliable. In psychiatric genetics studies, whatever the disorder, there are multiple diagnoses that can be used to create these labels, and these often conflict with each other. Indeed many research proposals suggest trying several different ways of labeling those with and without one disorder in the same study. That suggestion alone calls into question the validity of the labeling of "affected" versus "nonaffected."
For any one of these diagnostic procedures, the evidence is clear that there is also substantial unreliability of diagnosis; that is, if someone labeled "affected" were independently evaluated by several other expert diagnosticians using the same diagnostic system, there would frequently be disagreement. When the label of "affected" is applied based on either self-report, family report, or abstracted clinical records from the distant past, one encounters even more serious questions about validity and reliability. The fact that multiple readers of the same, possibly flawed, clinical report or record would draw the same conclusion, often reported as the "reliability" of the diagnosis, is not assurance of either the precision or accuracy of what appears in the record or the report as to the clinical status of the patient.
The usual effect of unreliability (taking validity for granted for the moment) is to attenuate the power of statistical tests and effect sizes. Consequently, if unreliability were the only problem, and the only purpose of a genetics analysis was to decide whether there is evidence sufficient to support a claim of some genetic basis, the assumption of reliability would tend to fall under the third category of assumptions above, because the reports of positive results (i.e., whether the p-value would indicate statistical significance or not) would be robust to deviations from this assumption.
However, such robustness provides but little real protection here, because that protection applies only if one analysis were applied, using only one diagnostic rule, and examining only one locus. Then the current methods would have less than a 5% chance of a false-positive finding. However, when multiple different diagnostic rules are used, multiple loci are tested, and, frequently, multiple assessments of each locus, the probability of a false-positive finding increases directly with the number of multiple tests. Given enough families, enough different diagnostic rules, enough different loci, and enough different analytic approaches, it is almost certain that a false-positive finding will result. This has been called the "Gambler's Ruin" problem in classical statistics, for good reason.
Finally, the statistical methods underlying genetic analyses (and many of the other most common statistical tests) are based on likelihood ratio theory. The fundamental step in such methods is multiplying together the probabilities of independent outcomes (e.g., see ref. 25). In practice, this means that the diagnosis of each member of a family must be made "blind" to all other diagnoses and to genetic typing, and that genetic typing must be made "blind" to diagnoses, so that the errors of classification are independent.
One cannot label one subject with an ambiguous clinical picture "affected" because the diagnostician knows that the subject's mother and both sisters have already been labeled "affected," when another subject having a similar clinical picture is labeled "unaffected" because there are no other known "affected" members in the same family. Obtaining diagnoses from recall and clinical records becomes even more troublesome than usual, because one can never guarantee the independence necessary. How can one be sure that Dr. X, 20 years ago, did not decide to label this subject "depressed" partially because he already knew that depression "ran in the family"?
Without independence of errors, guaranteed by blinded diagnoses, the likelihood functions are invalid, and all the estimation and testing procedures based on them are also invalid. Thus if "blindness" is not guaranteed in the design of a genetic study, any existing methods requiring blindness are not robust to deviations from this assumption.
In short, because of multiple testing, poor quality diagnosis, and poorly designed studies, the occurrence of false-positive reports of psychiatric genetic linkages is no surprise. False-positive and false-negative results can be anticipated in any research context in which mathematical assumptions so crucial to the validity of results have been so casually dealt with. It is, of course, the false-positive results that are published.
This problem is probably not restricted to psychiatric genetic studies. Complex mathematical models, such as those in structural equation models, in Lisrel models, in path analyses models, latent class models, and so on, are based on many mathematical assumptions that either must be guaranteed in the design of the research or must be checked against empirical evidence, or the inferences must be mathematically demonstrated to be robust against deviations from those assumptions (15). However, there are so many such assumptions that generally few are guaranteed and few are checked. It may be that the only reason this problem has become so salient in psychiatric genetics and not much earlier in other such complex mathematical modeling, is simply because it is customary in genetic research, but not in other research areas, to seek immediate replication of results.
The easiest and simplest solution to these problems is simply to regard use of such complex analytic models as exploratory or hypotheses-generating studies to generate very specific hypotheses (one diagnosis, one mode of inheritance, one locus, one analytic approach). Only when those one or two specific hypothesis appear clear to the researchers from the exploratory study would it be appropriate to test those few hypotheses in a carefully designed, independent, focused study before publication of the "finding." In the psychiatric genetic framework, for example, this would have meant that the initial studies reporting linkage would have been considered pilot studies, and publication of conclusions would have been delayed until independent confirmation was obtained. However, such a strategy would require a new willingness to invest major funding resources in pilot studies, and perhaps new design strategies that would provide for independent replication concomitant with the pilot study proposal would have to be evolved.
A Problem in Design
Another aspect of problems with the relationship between what clinical research does and the "real world" has to do with efficacy, effectiveness, and efficiency of treatments (e.g., ref. 8). These terms carry important implications for clinical research, but they are not precisely and consistently defined in clinical research publications and often seem to be used interchangeably. Therefore, let us first propose some definitions.
To demonstrate efficacy, a study must show that in ideally selected subjects under ideal conditions the treatment "works" in one way or another. To demonstrate effectiveness, a study must show that in typical subjects with the disorder under conditions that can apply in the real world, the treatment "works" in some clinically meaningful sense. To demonstrate efficiency, a study must show that the effectiveness is achieved at reasonable cost to the individual and/or to society.
For a variety of reasons, most clinical research studies of treatment have focused on efficacy. Exclusion criteria used in sampling are frequently so strict that the sample may actually represent a small and unrepresentative minority of the clinical population with the disorder. The conditions of delivering treatment are often those that would be impossible to enact in "real-world" application. Such restrictions (a) maximize the probability that the treatment will "work" in the study and (b) minimize the probability that it will work in the real world. In an efficacy study, the sample sizes needed to detect effects will be smaller, both because the size of effects is maximized and the heterogeneity is minimized. Consequently the time and effort needed to do such studies are more limited.
With no requirement that the outcome measures (which define that the treatment "work") have clear clinical interpretation (which would be needed to demonstrate effectiveness) and no requirement for cost/risk measures (needed to demonstrate efficiency), treatments found to "work" in efficacy studies run a high risk that, when applied in the "real world," they will not live up to their billing. Thus the result of a "successful" efficacy study may be a recommendation for an ineffective and inefficient treatment. In the absence of emphasis on checking such results for effectiveness and efficiency, there is no protection against noneffective treatments going into widespread use, and there is little protection against unnecessary costs and risks to the patients.
This problem is probably not more common in evaluations of psychiatric than of other medical treatments. However, in this era of health care reform, the impact on psychiatry may be greater than on other medical areas. Because of the somewhat negative perceptions of the cost–benefit balance of psychiatric treatments in "real-world" applications, this perception may have a major impact in the years to come on coverage and availability of psychiatric treatments and funding for psychiatric research.
This is not to argue against studies of efficacy. It is extraordinarily difficult to design cost-effective research studies of effectiveness or efficiency of treatments without preliminary studies of efficacy. Efficacy studies are needed to (a) establish the feasibility of effectiveness and efficiency studies, (b) estimate clinical effect sizes needed for power calculations, and (c) field-test the methodologies (e.g., measurements, designs). For this reason, efficacy studies must continue to be supported. However, they should be short-term and low-cost studies, done in preparation for effectiveness and efficiency studies, not as an end in themselves. Effectiveness and efficiency studies should a) deal with samples representative of the clinical population with the disorder, b) deliver treatments under the conditions that can potentially be duplicated in ordinary clinical settings, c) evaluate the effects of the treatments using measures that have clinical meaning, and d) strive to evaluate the costs and risks to the patients. The consequent results are more likely to be generalizable to the "real world." The risk, of course, is that many treatments now favored may prove to be ineffective or inefficient, and studies documenting this will be used to decrease patients' access to such treatments. Perhaps it is this fear that makes studies of effectiveness or efficiency so unattractive to psychiatric researchers.
A major unresolved problem in the future of psychiatric clinical research is that of the value placed on exploratory research and the support afforded for such studies. Exploratory studies are often labeled with derogatory terms such as "fishing expeditions," "data dredging," and so on, and are not generally favorably viewed for funding, nor are their results favorably viewed for publication. Consequently, there is little incentive to propose or to do such studies.
Yet, as has been repeatedly noted above, there is little chance of designing cost-effective, replicable, confirmatory research projects without the data and experience that can only be gained from exploratory studies of various kinds. The discouragement of well-designed and well-executed exploratory studies may have contributed to the tendency to make premature jumps to confirmatory analyses (e.g., jumping to linkage studies without segregation studies) or to make premature generalizations from limited confirmatory analyses (e.g., recommending for or against treatments on the basis of efficacy studies in absence of effectiveness or efficiency studies). Well-done exploratory studies submitted for review are often inappropriately reported as confirmatory studies, often at the suggestion of reviewers and editors, by presenting hypotheses as if they were a priori when they are in fact post hoc or serendipitous, or by presenting invalid p values.
Clearly there are exploratory studies that merit derogatory labels, studies whose sampling, design, measurement, and analytic approaches are so poorly formulated or executed that one should not accept any of their results, even as preliminary indications or hypotheses generated for later testing. But there are poorly done confirmatory studies as well. The time has come, the need is urgent, to draw the distinction only between good and bad studies. A good study may be either exploratory or confirmatory, but a bad study may be either exploratory or confirmatory as well. Both approaches, exploratory and confirmatory, have their important place in clinical research, and neither approach can be excellently done without the other.
To resolve the methodological/statistical problems that have been identified or have arisen in the last generation of psychiatric research, there seem to be three urgent needs:
1. Better training of psychiatrists in statistical principles and better training of statisticians in psychiatric principles.
2. Changes in funding and publication policy to foster a greater tolerance for well-done exploratory research and a greater intolerance for badly done research, whether exploratory or confirmatory. This would require a clear understanding of the differences between exploratory and confirmatory research approaches and the place and value of both.
3. Better communication between psychiatric researchers and statisticians working in psychiatric research areas. Clear and precise terminology is required for such communication. Proposals of mathematical models based on assumptions that misrepresent psychiatric situations should not be tolerated by either the psychiatrists or the statisticians involved. Because one would not expect the usual statistician to be expert in psychiatry, nor would one expect the usual psychiatrist to be expert in statistics, this means a very close interactive collaboration, so that each expertise is well-represented.
George Santayana said: "Progress, far from consisting in change, depends on retentiveness . . . . When experience is not retained, as among savages, infancy is perpetual. Those who cannot remember the past are condemned to repeat it." It would be most interesting to see what the next generation of progress in psychiatric clinical research might be if we remembered and learned from our methodological errors of the past.
published 2000