Assessment of Perceived Functional Capacity: Using Rasch Analysis to Evaluate the Measurement Properties of Four Perceived Pain & Disability Scales Lois Lochhead B.S.R. University o f British Columbia, 1984 Thesis Submitted in Partial Fulfillment of The Requirements for the Degree of Master of Science In Community Health Sciences The University of Northern British Columbia August 2009 © Lois Lochhead, 2009 Library and Archives Canada Bibliothèque et Archives Canada Published Heritage Branch Direction du Patrimoine de l’édition 395 Wellington Street Ottawa ON K1A 0N4 Canada 395, rue Wellington Ottawa ON K1A 0N4 Canada Your file Votre référence ISBN: 978-0-494-60828-9 Our file Notre référence ISBN: 978-0-494-60828-9 NOTICE: AVIS: The author has granted a non­ exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non­ commercial purposes, in microform, paper, electronic and/or any other formats. L’auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par télécommunication ou par l’Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author’s permission. L’auteur conserve la propriété du droit d’auteur et des droits moraux qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n’y aura aucun contenu manquant. Canada Abstract Functional Capacity Evaluations (FCEs) include comparisons o f self-report and performance-based measures. A difference in the two scores can be interpreted as symptom magnification which can impact eligibility for benefits. FCEs typically include scales such as the Oswestry Disability Index (GDI), the Dallas Pain Questionnaire (DPQ), the Spinal Function Sort (SFS) and the Neck Disability Index (NDI). Rasch Modeling was used to evaluate their original classification categories. Examination included fit o f data to model expectations, threshold ordering o f items, differential item functioning and item difficulty. None o f the scales demonstrated unidimensionality. For the GDI and DPQ, rescaling and/or eliminating items improved the scales. The SFS is not a unidimensional scale and demonstrates differential item functioning. The NDI demonstrates unidimensionality when two o f the items are eliminated but disordered thresholds could not be fixed. Health Professionals using these measures should be aware that these scales do not perform as well as expected. I ll TABLE OF CONTENTS ABSTRACT.......................................................................................................................................... ii TABLE OF CONTENTS...................................................................................................................iii Index of T ables............................................................................................................................ vi Table of Figures.......................................................................................................................... vii G lossary......................................................................................................................................viii Acknowledgement........................................................................................................................x CHAPTER ONE: INTRODUCTION................................................................................................ 1 CHAPTER TWO: LITERATURE REVIEW .................................................................................. 8 Instrum ents.......................................................................................................................................10 Oswestry Disability Index (O D I)............................................................................................. 10 Dallas Pain Questionnaire (D PQ )............................................................................................. 15 Spinal Function Sort (SFS)........................................................................................................19 Neck Disability Index (N D I).................................................................................................... 23 Measurement Theory......................................................................................................................27 Item Response Theory (IRT).................................................................................................... 28 Rasch A nalysis........................................................................................................................... 31 Dichotomous M odel.............................................................................................................. 32 Rating Scale M odel............................................................................................................... 33 Partial Credit M odel.............................................................................................................. 35 Research Q uestions........................................................................................................................36 Significance of Proposed S tudy................................................................................................... 37 IV CHAPTER THREE: DESIGN AND M ETHODOLOGY.......................................................... 38 Subjects.......................................................................................................................................... 38 Instrumentation.............................................................................................................................. 39 Procedures...................................................................................................................................... 40 Data A nalysis................................................................................................................................ 40 CHAPTER FOUR: R ESU LTS....................................................................................................... 42 The Oswestry Disability Index...............................................................................................42 Diagnostic M easures...........................................................................................................43 Item D ifficulty......................................................................................................................50 Differential Item Functioning (D IF )..................................................................................52 Dallas Pain Questionnaire (DPQ)........................ .................................................................. 53 Diagnostic Measures for the D P Q ....................................................................................54 Item D ifficulty......................................................................................................................58 Differential Item Functioning (D IF )..................................................................................59 PACT Spinal Function Sort.................................................................................................... 60 Diagnostic Measures for the SFS...................................................................................... 61 Item D ifficulty......................................................................................................................62 Differential Item Function (DIF) for the S F S ...................................................................63 Neck Disability Index.............................................................................................................. 66 Diagnostic Measures for the N D I..................................................................................... 67 Item D ifficulty...................................................................................................................... 70 Differential Item Functioning............................................................................................. 71 CHAPTER FIVE: DISCUSSION...................................................................................................72 The Oswestry Disability Index (O D I)................................................................................... 72 Dallas Pain Questionnaire (D PQ )............................................................................................ 75 Spinal Function Sort (SFS)....................................................................................................... 76 Neck Disability Index (N D I).................................................................................................... 77 Conclusion....................................................................................................................................... 78 Limitations of D esign.....................................................................................................................78 Recommendations for Practitioners............................................................................................ 78 Recommendations for Future R esearch......................................................................................79 Oswestry Disability Index....................................................................................................79 Dallas Pain Questionnaire.................................................................................................... 79 Spinal Function S o rt.............................................................................................................79 Neck Disability In d ex ...........................................................................................................79 Appendix I - Oswestry Disability Index - Version 1.0.............................................................. 86 Appendix 2 - Dallas Pain Questionnaire........................................................................................88 Appendix 3 - Spinal Function Sort Instructions, Sample & Score Sheet................................... 90 Appendix 5 - Letter of Consent.......................................................................................................95 Appendix 6 - Consent to Evaluate...................................................................................................96 Appendix 7 - DOT Physical Demand Characteristics of W o rk ............................................... 97 Appendix 8 - Oswestry Disability Index - R escaled...................................................................98 VI Index of Tables Table 1 Physical Occupational Demands....................................................................................39 Table 2 Diagnostic Measures fo r the Oswestry Disability In d ex.............................................44 Table 3 Oswestry Categories Collapsed.......................................................................................48 Table 4 Diagnostic Measures fo r Rescaled Oswestry Disability Index................................... 49 Table 5 Diagnostic Measures fo r the Dallas Pain Questionnaire............................................54 Table 6 Spinal Function Sort Misfitting Item s.............................................................................61 Table 7 Diagnostic Measures fo r N D I..........................................................................................67 v il Table of Figures Figure 1 Age distribution of study sam ple................................................................................ 38 Figure 2 Standing item #6 Oswestry Disability Index............................................................. 45 Figure 3 Pain item #1 Oswestry Disability In d e x .................................................................... 46 Figure 4 Rescaled pain item # 1 ...................................................................................................50 Figure 5 Item Characteristic curves depicting item difficulty on the rescaled O D I F igured DIF by Gender for the O D I........................................................................................... 52 Figure 7 DPQ Item 3 Lifting Category Probability C urve...................................................... 55 Figure 8 Category probability curve of DPQ item 10 “Vocational” .....................................56 Figure 9 Category probability curve for the 4-point DPQ rating scale................................. 58 51 Figure 10 Item characteric curves depicting item endorsement for the 4-point D PQ ........... 59 Figure 11 Item Person Map for Spinal Function S o rt................................................................ 62 Figure 12 Spinal Function Sort Item DIF size by gender...........................................................64 Figure 13 Category Probability Curves NDI “Lifting” item # 3 ............................................... 68 Figure 14 Category probability curves for the NDI “Headache” item # 5 ...............................69 Figure 15 Item Difficulty for the N D I.......................................................................................... 70 Figure 16 DIF by Gender for the N D I......................................................................................... 71 V lll Glossary Ability Estimate The location of a person on a variable, inferred by using the collected observations. Calibration The process of estimating item difficulty/person ability by converting raw scores to logits on a measurement scale. DIF Differential Item Functioning is the loss of invariance of item estimates across testing situations such as when an item functions differently with men and women. DIF is evidence of item bias. Infit Mean Square Indicates degree of fit o f an item or person to the Rasch Model and is a transformation of the residuals, the difference between predicted and observed. Expected value of 1 with ranges from .6 to 1.4 deemed acceptable for rating scale survey items. Infit statistic is more sensitive to inlier patterns i.e. unexpected response patterns by persons on items that are targeted on them. Item Separation Index An estimate of the spread of items on a measure variable expressed in standard error units i.e. the adjusted item standard deviation divided by the average measurement error. Latent Trait Attribute of an individual that can be inferred from observation of behavior. Logit The unit of measurement resulting from the transformation of raw scores from ordinal data to log-odds ratios on a common interval scale. The log-odds of an event is the logit of the probability of the event. Outfit Mean Square Unstandardized estimates of degree of fit that are more sensitive to outliers - unexpected responses by persons on items that are distant to the subject’s ability. Values of .6 to 1.4 are acceptable for rating scale items. IX Partial Credit Model M asters’ Rasch Model for polytomous data which allows the item categories and/or threshold values to vary from item to item. Person Separation Index Estimate of the spread of persons on the measured variable expressed in standard error units. Rating Scale Model Andrich’s Rasch Model for polytomous data generated from Likert scales. It applies one set of threshold values to all items on the test. Threshold The point of equal probability of adjacent categories where the likelihood of not endorsing the item turns to the likelihood of endorsing the item. ZSTD Tests the significance of a particular mean square value. Values from -2.0 to +2.0 are acceptable. Acknowledgement It is not often that we are given a second chance to fulfill a dream. 1 am truly grateful for this opportunity and it is a pleasure to thank those who made this possible. Dr. Peter MacMillan, with his undying enthusiasm for all things Rasch, has been the ideal supervisor. His sage advice, insightful criticisms and patient encouragement aided the writing of this thesis from the formative stages to the final draft. 1 could not have done it without him. My committee members. Dr. Henry Harder and Dr. Ken Prkachin, kindly made time in their busy schedules to review the thesis and to prepare for the defense on short notice. Dr. Saif Zahir, generously offered his services as an external examiner so the defense could go forward. 1 am indebted to all of you. My husband. Chuck Attwater, patiently allowed me to study at the expense of household chores, social and family obligations as well as companionship; all of which only love could endure. In addition, he proof-read the final text, eliminated numerous errors and made valuable suggestions to improve the clarity of the work. My sons, Anthony and Patrick Daniele, have believed and trusted in me and have given their love, understanding and companionship throughout the years as we navigated some difficult waters. To them, 1 dedicate this thesis. “Lately it occurs to m e...W hat a long strange trip it’s been.” (Grateful Dead, 1970) CHAPTER ONE: INTRODUCTION Determination of an individual’s readiness to return to work following injury or illness often involves having the individual participate in a Functional Capacity Evaluation (FCE) or W ork Capacity Evaluation (WCE). These two terms are used interchangeably within the rehabilitation literature. Isernhagen (1995) defines Functional Capacity Evaluation as follows: FCE is a standardized battery of clinical tests that purport to measure a patient’s safe physical ability for work-related activity. Physical capacity as found in the FCE testing is compared to required physical job demands of the patient’s occupation. Critical job demands are assessed by a job analysis involving collecting relevant information by either direct observation, an interview with employer or employee, or existing job descriptions. (p.410) FCEs are typically performed by Physical Therapists or Occupational Therapists who have specialized training and certification in the administration of the test batteries that make up the evaluation. One such certification is the Certified W ork Capacity Evaluator (CWCE) designation available through Roy Matheson and Associates. The training consists of a five day education program where therapists are taught the protocols for administration and scoring of the individual tests. Evaluation of perception of ability, level of pain, physical effort, musculoskeletal evaluation, mobility, positional tolerances, dexterity, cardiovascular fitness and material handling are all included in the five days of training. There is little or no discussion about the psychometric properties of the tests; the focus is on proper administration. The participants in the training are assured that each test has undergone extensive reliability and validity testing. The referring agencies (W orker’s Compensation Board (WCE),Insurance Corporation of British Columbia (ICBC), Long Term Disability carriers,lawyers, etc.) usually provide written questions to the therapist who will be performing the FCE. Matheson (2006) gives the following examples of standardized referral questions: A. Did the client demonstrate full physical effort during the evaluation? B. Are the client’s subjective reports reliable? C. Is the client able to return to work at this time? D. If unable to return to his/her usual and customary job: What physical deficits hinder the worker’s ability to return to work at this time? What modifications are needed to return to modified work? Which rehabilitation options exist at this time? E. W hat are the client’s current functional abilities? F. What is the client’s loss of function as compared to pre-injury ability? G. Would the client benefit from additional rehabilitation services at this time? (p. 6) To answer question B above, one or more pen and paper tests such as the Oswestry Disability Index (ODI), the Dallas Pain Questionnaire (DPQ), the Performance Assessment and Capacity Testing (PACT) Spinal Function Sort (SFS) and the Neck Disability Index (NDI) are completed by the client. The scores on these questionnaires are compared with the client’s actual performance during the evaluation. All of these measures use a Likert-type or rating scale to score each item. The Dallas Pain Questionnaire uses a Visual Analogue Scale of varying lengths for each item whereas the other three instruments (ODI, SFS, and NDI) use scales of a consistent length for each item. For instance, in the PACT Spinal Function Sort (SFS) the items are scored on a “ 1-5” Likert-type scale. The range is from “ 1” (Able) to “5” (Unable) with “2”, “3” and “4” being used to depict abilities in the range between unable and able. The rating of “2” relates to a mild degree of restriction of ability, “3” relates to moderate restriction of ability and “4” relates to significant restriction of ability on the task. Scoring is done by adding the responses in each column and multiplying the number of “ 1” responses by four, “2” responses by three, “3” responses by two and “4” responses by one. No points are awarded for “5” or for items where the individual selected “?” . This presumes that selection of “4” significant restriction indicates twice as much restriction as a selection of “2” - mild restriction. It also means that an answer of “don’t know’’ is equivalent to an answer of “unable’’. The ODI and NDI offer a set of six statements for each response. The respondent checks off the statement that most represents his/her feelings for each item. Each of the statements was designed to sequentially represent more disability than the previous statement. These four measures have been developed using classical test theory (CTT). Measurement Theorv The definition of measurement as the “assignment of numerals to objects or events according to some rule’’ as proposed by Stevens (1946) has been widely adopted in the human sciences and forms the basis of questionnaire development and analysis. Responses are summed to form a score that represents the individual’s level of ability or agreement using classical test theory (CTT). CTT is based on the classical true score model as outlined by Crocker & Algina (1986) who say that “any observed test score could be envisioned as the composite of two hypothetical components - a true score and a random error component” (p. 66X This is expressed as X =T + E where X is the observed test score and T is the individual’s true score and E is the error that occurs between the true score and the observed score for that individual on the given test. CTT examines the success rate of a group of examinees on an item. The success rate of this group on any given item is known as the p value of the item and is used as the measurement of item difficulty. The higher the p value, the easier the item is to endorse. Item discrimination refers to the ability of an item to discriminate between higher and lower levels of the ability or trait we are attempting to measure. This is often expressed as the Pearson product-moment correlation coefficient (r) between the scores on the item and the total scores on the test. Alternatively a discrimination index, D, is defined as the difference between endorsement/difficulty values of a high trait group and a low trait group. Fan (1998) summarized the limitations of CTT as follows: The major limitation of CTT can be summarized as circular dependency: (a) The person statistic (i.e., observed score) is (item) sample dependent, and (b) the item statistics (i.e., item difficulty and item discrimination) are (examinee) sample dependent. This circular dependency poses some theoretical difficulties in CTT’s application in some measurement situations such as test equating, computerized adaptive testing (p.357). This means that the item parameters change even in their order of difficulty/ease of endorsement depending on the population to which it is administered; the standard error of measurement (SEM) is the same for all scores and reliability changes with population. This limits the ability to generalize from one population to another. Each of the four measures, the DPQ, ODI, SFS, and NDI has been extensively researched in terms of reliability and validity using CTT statistical analyses with the limitations as described above. Inherent in these four measures is a scoring scale that assumes an equal interval between each score and that each item contributes equally to the total score. W hile these tools are built from ordinal items, the scoring is done in a manner that would assume the data is interval in nature. Bond and Fox (2007) point out that: while classification and sedation are necessary precursors to the development of measurement systems, they are not sufficient for measurement. The distinctive attribute of a measurement system is the requirement for an arbitrary unit of difference that can be iterated between successive lengths, (p. 4) In contrast to CTT, Item Response Theory (IRT), also known as latent trait theory, focuses on the item level information as opposed to test level information. IRT is applied in the development and refining of measurement instruments as well as for equating tests. The probability of a correct response to an item is expressed in terms of person and item parameters. Person parameters may be the ability of the individual or the strength of his convictions. Item parameters include difficulty, and in some models, discrimination and pseudo-guessing. Items may be in the form of right/wrong responses, statements that relate to level of agreement or presence/absence/degree of symptoms. IRT models scale the ability of examinees and item difficulty on the same metric allowing meaningful comparison of an item and the ability of the person. Within IRT there are one, two and three parameter logistic models. The threeparameter (3 PL) model is so named because it employs three item parameters - item difficulty, item discrimination and pseudo-guessing. It is a logistic model as it converts the raw score summary into its natural logarithmic odds ratio to produce a linear (interval) measure. The two-parameter (2 PL) model assumes minimal guessing but items can vary in terms of difficulty and discrimination. The one-parameter model assumes that there is minimal guessing and equivalent item discrimination so that items are only described by a single parameter - item difficulty. (Crocker & Algina, 1986) Karabatsos (1999) reported to the 32"‘* annual conference of the Society for Mathematical Psychology that: There is strong support that almost 100% of the time, the parameters of the 2 PL and 3 PL violate interval scaling. On the other hand, the theoretical probabilities of the Rasch models will always support a stable, interval scale structure, (p. 18) There are two approaches to evaluate the single parameter of item difficulty; the Oneparameter Logistic model (Birnbaum, 1968) and the Rasch Model (Rasch, 1960). For practical purposes, when the person sample is parameterized by a mean and standard deviation for item estimation, it is a 1 PL IRT model. When each individual in the person sample is parameterized for item estimation, it is Rasch. While the 1 PL model is primarily a descriptive, computationally simpler approximation to the Normal Ogive Model of L.L. Thurstone (1927), as developed by Lord (1952), the Rasch Model is prescriptive offering distribution-free person ability and item difficulty estimates on a linear latent variable. The Rasch model was developed by Georg Rasch (1901-1980), a Danish mathematician whose initial work in the field was done in the field of educational measurement. For several decades these methods have been applied in the health care field to improve the psychometric reliability and validity of self-report test batteries. Rasch analysis facilitates the calibration of ordinal measures to interval measures and therefore can improve confidence in scores obtained on these self-report tests. Rasch analysis is also useful for equating two or more instruments that purport to measure the same construct. By equating instruments, it can be determined if the instruments measure the same construct and if so, do they measure it at the same level. Analysis with the Rasch model allows the researcher to order persons according to their perceived level of the latent trait and items according to their perceived difficulty/ease of endorsement. From this analysis, a method of scoring may be developed that improves the sensitivity and specificity of the tests. If all tests measure the same construct equally, a case can be made for administering only one of the tests to save time in the evaluation. It may also be determined that certain of the tests are more effective at measuring certain populations thus assisting the examiner with test selection. The purpose of this study is to evaluate the psychometric properties of the four instruments - the Oswestry Disability Index (ODI), The Dallas Pain Questionnaire (DPQ), the PACT Spinal Function Sort (SPS) and the Neck Disability Index (NDI) using Rasch Analysis to establish evidence for reliability and validity of each instrument as well as to attempt to equate the four instruments to evaluate the abilities of the persons and difficulty of the items. This is important as these measurement instruments are currently being used by W ork Capacity Evaluators to determine reliability of the client’s subjective reports as compared to demonstrated abilities. This practice can potentially affect the continuation of disability benefits and allotment of funds for rehabilitation as well as monetary awards in litigation. Since these instruments are frequently used as pre and post treatment measurements, efficacy of treatment is often established using a change in score on the questionnaire as the indicator. Improving the sensitivity and specificity of these instruments will, in turn, improve the accuracy of measurement of the trait in question. CHAPTER TWO: LITERATURE REVIEW Thew (2007) identified: a multitude of purposes for which Functional Capacity Evaluations are completed in Prince George. These include: (a) determining safe return to work to current employment; (b) pre-employment physical ability assessments; (c) determining current functional abilities or level of functioning or physical abilities; (d) comparing functional abilities to job demands; (e) developing a rehabilitation treatment plan; (f) assisting with return to work planning either to current occupation or alternative occupation; and (g) determining if evaluees are accurately reporting their abilities, (p. 20). Self report measures are used as part of Functional Capacity Evaluations to measure the individual’s perception of pain and disability and the results are compared to clinical examination and functional testing scores. This comparison gives the evaluator insight into the accuracy of the individual’s perception i.e. if they are magnifying or minimizing their abilities/symptoms. Clinicians have come to trust the reliability and validity of these self report measures. In the study by Thew (2007) in the transcript of the interview, the Clinician was quoted as saying: We try to use the forms as much as possible. The research that was done to validate these questionnaires can back up that they’re valid and reliable. If the person says they can’t do this and then they demonstrate it, you know that the questionnaire is a valid questionnaire, (p. 52) In a case such as the one described above by the Clinician, the individual might be considered to be magnifying his symptoms. Symptom Magnification Syndrome is a term which was coined by Leonard Matheson (1988) and it refers “to the conscious or sub­ conscious tendency of an individual to under-rate his/her abilities and/or over-state his/her limitations” (p. 11). An opinion rendered by a professional indicating symptom magnification can have tremendous negative consequences for a disabled person including loss of access to rehabilitation and loss of financial support. Conversely, if the individual is minimizing his symptoms and he overestimates his abilities compared to his true abilities, he can be returned to work and injure himself or others. For these reason, perceived functional capacity is an important part of every Functional Capacity Evaluation. The Matheson System for Functional Capacity Evaluation (2006): embodies the professional value system endorsed by the American Psychological Association, American Physical Therapy Association, and the National Institute of Oecupational Safety and Health (NIOSH), which states that each evaluation must address these five hierarehical components; Safety, Reliability, Validity, Practicality and Utility.” (Chapter 2 - page 3). Since the FCE is a battery of tests including the self report measures such as the Oswestry Disability Index (ODI), the Dallas Pain Questionnaire (DPQ), the PACT Spinal Function Sort (SFS) and the Neek Disability Index (NDI) each of these measures needs to satisfy the components as outlined above. 10 Instruments Oswestry Disability Index (ODI) The Oswestry Disability Index (ODI), see Appendix 1, is one of the most commonly used condition-specifie assessments. The ODI is a ten item scale that measures pain intensity, personal care, standing, sleeping, lifting, walking, sitting, sex life, social life and ability to travel. Each item includes six potential responses such that each response is presumed to describe a greater degree of disability ranging from no disability to total disability. Individual items are scored from “0” to “5” and then summed and doubled to obtain a percentage score. Fairbank, Couper, Davies and O ’Brien, (1980) designated five categories to interpret the Oswestry score. Low percentage scores represented less disability, whereas higher percentage scores indicated more disability. These five categories of disability are 0% to 20%, minimal; 20% to 40%, moderate; 40% to 60%, severe; 60% to 80%, crippled; and 80% to 100%, bed bound or exaggerating. The overlap at the transition points of each category are concerning; a score of 20% could relate to minimal or moderate disability. Despite the wide use of this scale and the many validation studies that have been done, this issue has not been addressed. Categories of disability provide the context from which to interpret a score and this can he important in the awarding of disability benefits. The scores are frequently used in case studies and clinical research as a reference point to interpret outcomes. The Oswestry Disability Index is also cited in the literature as the Oswestry Pain Questionnaire and the Oswestry Low Back Pain and Disability Index, the latter being its original name (Fairbank et al., 1980). The name has now been shortened to the Oswestry 11 Disability Index (ODI). for spinal disorders. It has become one of the most frequently used outcome measures A Medline review done on October 25, 2008 using a simple name search revealed 402 hits in the database compared to 2 hits for the Spinal Function Sort, 33 hits for the Dallas Pain Questionnaire and 175 hits for the Neck Disability Index. There are several versions (1.0, 1.1 and 2.0) of the ODI with small wording differences. There is also a revised version developed by a chiropractic study group in the United Kingdom with the intent to improve the sensitivity of the scale for less disabled persons (Hudson-Cook, TomesNicholson & Breen, 1989). In the revised version. Section 8, the sex life question was omitted and a new section was added that was called “changing degree of pain” - this related to whether the pain was improving or getting worse overall. The original version focused on current pain levels not changes in pain. Several other sections were reworded. For instance in Section 4 - Walking - the ODI version 2.0 offered the following choices: 1. Pain does not prevent me walking any distance. 2. Pain prevents me walking more than 1 mile 3. Pain prevents me walking more than Vi of a mile. 4. Pain prevents me walking more than 100 yards. 5. I can only walk using a stick or crutches 6. I am in bed most of the time and have to crawl to the toilet. The Revised Oswestry Disability Index offered these choices instead: 1. I have no pain on walking. 2. I have some pain with walking but it does not increase with distance. 3. I cannot walk more than 1 Mile without increasing pain. 4. I cannot walk more than Vi Mile without increasing pain. 5. I cannot walk more than Va Mile without increasing pain. 12 6. I cannot walk at all without increasing pain. Fairbank and Pynsent (2000) had this to say about the revised version: Its (the Revised Oswestry Disability Index) objective was to increase the sensitivity of the scale for less disabled patients, but it confuses impairment with disability...In the authors’ view, this version is not acceptable, because it confuses impairment questions with disability questions. Its wording is often complex, and some sections do not allow for no symptoms. It allows a measurement of changing symptoms, however. The statements “Pain does not prevent me walking any distance” from the ODI 2.0 and “I have no pain on walking” as used in the Revised ODI are not equivalent in my view. One statement refers to how pain affects the individual and the other refers to the amount of pain the person is experiencing. An individual could have pain with walking but not let that pain prevent him or her from walking any distance. walked are different. Also the increments between distances W ith the utilization of the metric system in Canada, younger respondents may have difficulty with distances in Imperial units. All versions are scored in the same manner for a score out of 50 possible points. The score is then doubled to provide a percentage which relates to the amount of perceived disability. An individual taking the Revised ODI might have a score that indicates a higher or lower level of disability than a similar individual would score on the ODI. The Matheson Functional Capacity Evaluation Certification Program uses only the ODI 1.0 and therefore only this version will be discussed with regard to development, testing, reliability and validity. 13 Development o f the Oswestry Disability Index In 1976, the development of this condition specific outcome measure was initiated by John O ’Brien. An orthopedic surgeon, Stephen Einstein and an occupational therapist, Judith Cooper interviewed patients with back pain and from responses obtained regarding limitations to activities of daily living, several drafts o f the questionnaire were tried. The final version was tested on 22 patients with a one day test-retest reliability of r = .99, p<.001 (Fairbank et al., 1980). This high correlation may contain a memory effect and in fact subsequent studies showed lower correlations when the test-retest interval was increased. W ith a four day interval. Kopec et al. (1996) reported a reliability of r = .91 and with a 1 week interval, Gronblad et al. (1993) found a reliability of r = .83. W ith longer intervals between tests, natural symptom fluctuation may influence the results. Within CTT internal consistency is commonly calculated using Cronbach’s a which is defined as: 2 2 where N is the number of items, ^.Yis the variance of the observed total test scores, and is the variances for the N individual items (Cronbach, 1951). When a is large it can be assumed that the total score is a reasonable representation of the individual item scores. Strong, Ashton and Large (1994) reported an internal consistency for the ODI of a = .71 on a sample of 100. The sample size of the original study was 22 so this sample size is better. Given that the formula for standard error is where sigma is the standard deviation 14 and n is the number of participants, it is easy to understand that the larger the sample the greater precision there is in the results. One validation study of the GDI was reported by Fairbank et al. in 1980. They evaluated a group of 25 individuals with a first episode of low back pain over a three week period. Since this was a first episode of low back pain, there was a reasonable expectation of improvement over the three week period. They reported that the results of a t-test using the scores at the beginning and end of the three week period were significant at a level of p < .05. Overall the percentage of disability had dropped by 28%. Again the sample size is small allowing a greater chance of error. The sample is not representative of all back pain patients which is the population it is currently used to assess. Beurskens, de Vet, & Koke (1996) performed a study with 81 subjects who had suffered with non-specific back pain for at least 6 weeks. These subjects were tested with the ODI before and after treatment. Using external criterion for improvement, it was determined that 38 of the subjects had improved. The effect size was calculated to be 0.8. Gronblad et al. (1993) showed a moderate correlation of r = .62 with the Huskisson’s (1974) Visual Analogue Scale, an established pain measure, on a sample of 94 patients. Following these studies, the ODI became the “gold standard” in the rehabilitation field. It was then used to validate other instruments such as the Pain Disability Index (Pollard, 1984), the Low Back Outcome Score (Greenough & Fraser, 1992) and many others. The issues with small sample size in the original studies seem to have been overlooked. Additionally, the use of differing populations to validate the instrument i.e. patients presenting with a first episode of back pain vs. non-specific back pain of more than 6 weeks duration limit the generalizability of the results due to the circular dependency inherent in 15 CTT. Fairbank and Pynsent (2000) proposed that “The wide use of the ODI is part of the validation process.” Thom W alsh (2000) responded to this claim as follows: The thought that wide use and reasonable performance as expected on a small sample are synonymous witb validation and a rigorous review is one that falls short of current capabilities in the field. It should no longer be enough to simply report findings that turned out as expected, or that a gold-standard measure is crowned as a result of widespread use. Good validation studies should state a clear hypothesis and test it using a rigorous design and statistical analysis. This review article nicely compiles a wide range of work utilizing the ODI over the past 20 years. While the breadth o f this compilation is notable, and the validation steps taken at various times have raised interesting questions, it has not, in my opinion, established a gold-standard measure, (p. 2953) The current researcher’s enthusiastic endorsement of W alsh’s (2000) views should be recognized as the raison d'être for this study. This is not simply for investigation of the ODI but for other instruments as well. In addition to the reliability and validity issues, the ODI consists of ordinal items totaled as a sum of equal-valued items to produce a disability rating. However, item values have been assigned rationally rather than empirically resulting in total scores that do not proportionally indicate the trait. There have been many changes in the field of measurement since the Oswestry Disability Index was first developed. Applying these new techniques such as Rasch analysis can improve the ability of the ODI to assess self-reported disability levels for this instrument as well as for the DPQ, SFS and NDI. Dallas Pain Questionnaire (DPQ) Lawlis, Cuenas, Selby and McCoy (1989) developed the D allas Pain Q uestionnaire (DPQ) to measure the impact of chronic spinal pain on four aspects (daily and work-leisure activities, anxiety-depression, and social interest) of a respondents’ life; see Appendix 2 for the full questionnaire. The DPQ is considered to measure two factors: Functional Activities 16 and Emotional Capacities. For this study only the items that report on the Functional Activities Factor are included. This relates to sections I to X which includes Pain Intensity, Personal Care, Lifting, Walking, Sitting, Standing, Sleeping, Social Life, Travelling and Vocational. It should be noted that a factor analysis conducted by Lawlis et al., (1989) showed loadings of .495 and .202 for the pain item on the Functional Activities Factor and the Emotional Capacities Factor respectively. Other item factor loadings were .615 and above. Correlation studies of the item scores with total scores showed that the pain item correlated .65 with daily activities and .52 with work/leisure activities (p < .0001). Correlation coefficients for all the other items were .78 and higher. Each item is scored on a visual analog scale (VAS). The standard VAS is a 10 cm continuous line between two points where the respondents indicate their level of agreement to a statement by indicating a position on the line. The score is then determined by measuring the distance from the end point to the indicated position. In the DPQ discrete values are created for each item by adding breaks delineated by to the scale. The scales are anchored at the beginning with words such as “I can lift as much as I did” and at the end with words such as “I cannot lift at all”. The respondents indicate their level of agreement by placing an “x” in one of the delineated segments. The length and number of segments of each scale varies - Sections I -V I are six units in length, VII is five units in length. The score is added for Sections 1-VII and multiplied by three to obtain a percentage score for that aspect (daily activities) of the Functional Activities Factor. Similarly for the second aspect, work-leisure activities, the scores on the VIII to X items are added. These items are eight, seven and eight units in length, scored “0-7”, “0-6” and “0-7” respectively. The score is multiplied by five to gain a percentage. Lawlis et al. (1989), who developed this instrument 17 with varying scale lengths, explain the rationale for the lack of uniformity in scale length as follows: Using previous pilot studies, differential weighting of each segment accounted for variances of total scores; therefore, by applying different numbers of segments with respect to high predicting variables, the scoring could be done without complicating the process by multiplying each segment before summing. For example, “lifting interference” was weighted slightly more than “sleeping interference” and hence was segmented into six rather than five scoring weights[ui]. (p. 512) Speed of administration and scoring is identified by the authors as positive features of the measurement tool and they comment that the test can be scored “in 60 seconds or less.” (Lawlis et al., 1989). Lawlis et al. (1989) reported a “stability reliability coefficient of 0.970 using the method described by Anastasi and Cronbach in 1961.” Their analysis included a total of 143 subjects divided into “pain” and “non-pain” groups. The pain group consisted of 104 chronic back pain patients, 48 women and 56 men, undergoing pain management training and treatment in an inpatient program who had been medically diagnosed and referred as well as 15 patients, five women and ten men, who had been discharged from the inpatient program to work and who were working. (Many people are discharged from chronic pain programs to work but do not return to work.) The comparison group consisted of 24 controls recruited from clinic staff and local airline employees. The problems with small samples (the working chronic pain group and the control group) and short time frames such as this was discussed in the Oswestry Disability Index section. Also no mention is made in the published results regarding any attempt to match the controls with the cases for demographics such as race, age or gender, nor to screen for prior back injury or current back pain. They report on a t-test that demonstrated that chronic pain patients have significantly higher DPQ scores than 18 “normals”. Concurrent functional validity was tested by comparing the scores of a small sample of 15 patients who were returning to work with scores obtained from functional capacity evaluation on tasks sueh as manual material handling. There was a negative correlation between this measured ability and scores on the DPQ. Citing this, Lawlis reported that “because these findings support its statistical properties, the DPQ appears to have utility for clinical and research purposes.” There is a large population of back pain patients who have not undergone extensive inpatient rehabilitation who were not represented in the sample. In fact, it could be argued that the largest population of individuals with chronic back pain remains in the workforce. Despite the fact that few studies have been done to further establish the psychometric properties of this instrument, the DPQ is now widely used to assess the consequences of chronic low back pain (LBP) and outcomes of treatment. Christensen, Laursen, Gelineck, Hansen and Biinger (2001) used it to assess the functional outcomes of posterolateral spinal fusion at unintended levels due to bone-graft migration. They reported that there was no significant difference in functional ability following this complication based on the results from the DPQ. Roche et al. (2007) used the DPQ as one of the measures for comparison of a functional restoration program with active individual physical therapy for patients with chronic low back pain. This population of working individuals with long-standing back pain was not included in the instrument development but the DPQ is being used to evaluate them. In France, Marty, Blotman, Avouac, Rozenberg and Valat (1998) developed a French language version which they reported to be reproducible, valid and sensitive. Ozguler, et al., (2002) used the French DPQ to classify individuals with low back pain in a working 19 population into 4 groups ranging from slightly disabled to disabled with emotional consequences. They first changed the scoring from the visual analogue scale to a numerical rating scale of 1-10 for all items. They felt that this made it more homogeneous but did no validation studies to support their conclusion. There are numerous other examples of use (and misuse) of this measurement tool within the literature. Spinal Function Sort (SFS) The PACT Spinal Function Sort (SFS) is a self-report measurement of physical work capacity that employs the use of pictorial activity and task sorts (PATS). See Appendix 3 for a copy of this instrument. According to Matheson (2004): the PATS approach is an efficient means of gathering information about ability to perform a wide variety of work activities and tasks in a brief period of time. In addition to providing information about abilities, these measures can provide information about the evaluee’s psychological status that may be valuable for rehabilitation planning (p. 175). The Spinal Function Sort (SFS) was developed by Leonard Matheson and Mary Matheson (1989). The items are a set of 50 pen and ink drawings presented in booklet format, 2 drawings to a page. Each drawing depicts a person involved in a work task with a brief description of the activity below the drawing. Standard instructions are read to the individual regarding how they should score each item. The examinee indicates his/her ability to perform each task as “Able” (scored as “ 1”) to “Unable” (scored as “5”) with “2”, “3” and “4” indicating slightly restricted (scored “2”) to very restricted (scored “4”). There is also a “Don’t Know” category. Two pairs of items within the 50 items are identical to gauge response consistency. To obtain a score on the instrument, the assessor counts the number of responses in each category “ 1” through “5”. All of the “ 1” responses are multiplied by a factor of four. 20 the “2” responses are multiplied by a factor of three, the “3” responses are multiplied by a factor of two, and the “4” responses are multiplied by a factor o f one. These products are then added to determine an overall rating of perceived capacity (RPC) score which is then related to the Physical Demand Characteristics chart found in Appendix 5. The Spinal Function Sort was developed by the Mathesons in 1989 in response to a perceived need for an assessment that emphasized material handling tasks or activities of daily living tasks that involved spinal movements or loading. To develop the instrument, 500 pictures of men and women performing such tasks were collected. Line drawings were made of the photographs for each task. The 208 resulting tasks were made into a card sort deck and given to 5 evaluators who grouped tasks according to biomechanical demands of the task. During the sorting process 43 groups were determined and a representative task was selected for each group by the test developers. Five tasks were added based on suggestions by the evaluators to give a final count of 48 tasks, two of which would be replicated within the set to measure consistency. Test-Retest reliability was established with a two day test-retest Pearson product moment correction of r = .85. (Matheson et al., 1989). The developers report a variety of test-retest reliabilities ranging from r = .85 for the two day test-retest to r = .77 for the eight day test-retest. Matheson, Matheson, and Grant (1993) reported further reliability studies in the Journal of Occupational Rehabilitation and suggested that additional research was needed. The validity of this instrument in terms of its relationship to functional performance status and to other measurable changes in status that occur with treatment is also an appropriate focus of investigation. Finally, research is needed to analyze the factor structure of the SFS. This will be useful to better 21 understand the underlying dynamics of the components of perceived functional ability that are sampled by the SFS. (p. 27) Gibson and Strong (1996) published a study evaluating the reliability and validity of the SFS. The sample consisted of 34 men and eight women who had diagnoses including lumbar, thoracic, neck and/or shoulder sprain, along with chronic illnesses such as systemic lupus erythematosus presenting for functional capacity evaluation. A sub sample of 14 of the 42 subjects (ten men and four women) in the study attended for a second administration of the SFS four to fourteen days later. No indication of why this sample size was used or how participants were selected is given in the article. Test-retest validity was established using intraclass correlation coefficient (ICC) ICC = .89. Internal consistency was measured using Cronbach’s alpha (a = .97). Again with this study, I have issues with sample size and the high internal consistency which can indicate item redundancy. To further examine the validity of the Spinal Function Sort as a measure of perceived capacity for work-related tasks in persons with chronic back pain, Gibson and Strong (1996) used correlational methods to determine the relationship between scores on the Spinal Function Sort and scores on other scales with established validity for measuring similar constructs in persons with chronic pain. Multiple regression was used to examine the prediction of scores on the Spinal Function Sort by the other measures used in the study. The Spinal Function Sort correlated significantly (p < .001), adjusted = .64, df = 5 with the Pain Self-Efficacy Questionnaire, Self-Efficacy Scale, Pain Disability Index, and Work Reentry Questionnaire 24. W ith Bonferroni adjustment for multiple comparisons requiring probability of less than .003, the Spinal Function Sort still correlated significantly with each of these measures and, of course, in the anticipated direction. 22 Gibson and Strong (1996) reported that the study results supported the test-retest reliability of this instrument as well as its internal consistency. They expressed that some support for construct validity of the SFS as a measure of perceived capacity for work had been obtained. They comment that “the SFS depicts tasks that are compatible with those assessed in a functional capacity evaluation (FCB), thus allowing comparison of perceived capacity with the capacity observed in a functional capacity evaluation.” They administered the SFS to the 42 clients presenting for FCE but did not report on any relationship between perceived capacity and actual performance. In the clinical setting this difference is often used as an indicator of symptom magnification. Robinson et al., (2003) evaluated the clinical utility of the Spinal Function Sort with a group of postoperative and non-operative back patients who had completed a functional restoration program. The SFS was administered both before and after the functional restoration program and was found to measure change effectively. They reported that “Overall, the SFS was found in the present study to be sensitive enough to detect improvement in the functioning capacity of a postoperative spinal group as a result of a functional restoration program”. Certified W ork Capacity Evaluators (CWCEs) are trained to use this instrument and it is part of the standard test battery in the Matheson Functional Capacity Evaluation yet the method for scoring this instrument seems to have been arbitrarily developed. The application of Rasch Analysis can determine if the current method of scoring is effective or if items should be scaled differently. Test equating could confirm the assumption that perceived physical work capacity as measured by the SFS is a similar construct to perceived disability 23 as measured by the Oswestry Disability Index, Dallas Pain Questionnaire and the Neck Disability Index. Neck Disability Index (NDI) Vernon and Mior (1991) developed the NDI to assess how neck pain in individuals affects their activities o f daily living. See Appendix 4 for a copy of the NDI. It was adapted from the Oswestry Disability Index for use with populations who have neck pain rather than low back pain. The ten items measure various levels of neck pain, headache, personal care, work, driving, lifting, recreational activities, reading, sleeping and concentration. Each item includes six potential responses, each describing a greater degree of disability, ranging from no disability to total disability. The NDI's total percentage score is calculated by adding the individual item scores (which range from 0 to 5), doubling the total and expressing the result as a percentage. A higher score is indicative of greater perceived disability associated with the neck disorder. Vernon and Mior (1991) reported on the reliability and validity of the NDI after it had been tested on a small cohort of 17 patients. Small sample size seems to be a recurring issue with these measurement instruments. They asserted that the test-retest reliability of r = .89 (over 48 hours) showed response stability, that it was responsive to change in condition as assessed by comparing the percentage of change on a subset of 10 patients before and after treatment and found that it correlated significantly with the Pain Visual Analogue Scale (Huskisson, 1974) and the McGill Pain Questionnaire (Melzack, 1975). They also postulated that the NDI might be assessing two different factors which were represented by tasks that were voluntary vs. obligatory. They reported an internal consistency of a = .80 but did not address the possibility of item redundancy as a possible contributor to the high correlation 24 and did not divulge the inter-item correlations. Inter-item correlation of .9 or more indicates item redundancy (Streiner & Norman, 1989). In 1998, Hains, Whalen and Mior, postulated that the NDI may contain response set bias as all of the items start with the lowest degree of difficulty and progress to the highest. The patient could be responding consistently by selecting the same level on each question regardless of the question. They developed seven variations of the NDI to determine if this was an issue. However, they determined that the responses obtained from the 237 subjects were related to content rather than response set bias. Hains et al. (1998) concluded that, “This study supports the use of the NDI as a homogeneous instrument possessing stable psychometric characteristics that could provide a means of assessing the disability and the response to treatment over time for individual patients suffering from neck pain” (p. 77) Ackelman and Lindgren (2002) when validating a Swedish translation of the NDI reported that “The NDI for the neck pain subjects was well distributed and neither ceiling nor floor-effects could be seen” (p. 286). Cleland, Childs and W hitman (2008) developed a study with 137 mechanical neck pain participants “to examine the psychometric properties including test-retest reliability, construct validity, and minimum levels of detectable and clinically important change for the Neck Disability Index (NDI).” They found that: the NDI and NRS (Numeric Pain Rating Scale) exhibit fair to moderate test-retest reliability in patients with mechanical neck pain. Both instruments also showed adequate responsiveness in this patient population. However, the MCID (minimal clinically important difference) required to be certain that the change in scores has surpassed a level that could be contributed to measurement error for the NDI was twice that which has previously been reported. Therefore the ongoing analyses of the properties of the NDI in a patient population with neck pain are warranted, (p. 73). 25 This publication appeared to upset one developer of the instrument (Vernon) and he responded in the July 2008 Letters to the Editor of the Archives of Physical Medicine and Rehabilitation citing numerous other studies which reported better test-retest reliability than the .50 reported by Cleland et al. He felt that the treatment interval of 2-4 days was too short to show change in the patient and that an interval of 2 weeks would be more appropriate. In a rebuttal to Vernon’s defense of the NDI, Cleland et al. (2008) defended their experimental design in the same issue of the Archives of Physical Medicine & Rehabilitation and stated that “Examination of the psychometric properties is an ongoing process and we urge more investigation into the NDI, as well as continued work in the research community on some sort of standardized approach to examining the psychometric properties of selfreport questionnaires in general, (p. 1416)” Recently, a study by van der Velde et al. (2009) evaluated the measurement properties of the Neck Disability Index using Rasch Modeling for a sample of 521 subjects with neck pain. They reported that the NDI in its original form was not a unidimensional interval-level scale but that this could be, and was, accomplished with the removal of two misfitting items; headaches and lifting. They found disordered thresholds in five of the items (personal care, lifting, headaches, work and recreation) but chose not to correct the disordered thresholds beeause doing so would preclude the possibility of providing a straightforward exchange between the everyday summed ordinal score and its corresponding interval score. They also felt that collapsing the categories would result in a varying number of categories across items which was a significant departure from the original design. They recommended further examination of this instrument with consideration being given to collapsing the categories in a systematic and clinically relevant way. 26 I agree with Cleland et al. (2008) regarding the need for a standardized approach to examining the psychometric properties of self-report questionnaires and I am heartened to see the NDI which was developed and validated with small samples, rigorously examined using a sample of 521 patients. I am concerned that researchers can, and do, alter the questionnaires to suit their needs and apply the values for reliability and validity that were obtained using another form of the instrument as was done by Ozguler et al. (2002) with the Dallas Pain Questionnaire. Measurement in the social sciences needs to meet the standards of the hard sciences when it comes to development and use of measurement tools. We do not change the length or scale of a ruler because it fits better in our pocket, so neither should we change the scale of a pen and paper measurement tool because it suits us at that time. We know that with a twelve inch ruler, each inch contributes equally to the one foot measurement; we need to endeavor to develop instruments that can be relied upon in this fashion in the social sciences. 27 Measurement Theory The four measures (ODI, DPQ, NDI and SFS) to be studied were all developed using classical test theory (CTT). Using this model, item values are assigned rationally rather than empirically. This results in scores that do not proportionally indicate the trait, cannot be compared proportionally over time and cannot be linked to an external standard. Matheson et al., (2008) clarifies the problems with items constructed using classical test theory: While scientific measures rely on proportional values, many self-report instruments used in healthcare do not. Many count each item selected by the patient without assurance that all items have the same unit value. Others add number values assigned to ordinal scale items without assurance that the proportionality indicated by the numbers reflects the item ’s true value. Both types of measures also sum the item scores without the assurance that the measure’s total scores have proportional value. Addition of item scores is used to derive a total score or division of the total by the number possible is used to derive a percent score, suggesting that these instruments have mathematical qualities that they do not have. The absence of proportional value calibration in items limits the ability of such instruments to quantify a patient’s status dependably across the range of reported scores, (p. 46) Item Response Theory (IRT) addresses the issues raised by Matheson and some researchers have begun to develop and refine measurement instruments for the Social Sciences using these models. Using Rasch analysis, Davidson (2008) compared three versions of the Oswestry Disability Questionnaire given to 100 patients at their first admission to one of seven outpatient hospital clinics or one of nine private practice physiotherapy practices in Australia. The initial questionnaire completion was at admission in person and the follow up was done by mail four weeks later. Her results showed unidimensionality on all items except for the “changing degree of pain” that had been added on the chiropractic form. Page, Shawaryn, Cernich, & Linacre (2002) applied Rasch modeling to the Revised Oswestry Disability Questionnaire (RODQ) which was the version developed by the 28 chiropractors. Their findings were as follows: “Several Rasch analyses were performed, with Item 1 Pain deleted and 2 response categories collapsed, creating a better test without increased error. A schema for item administration and evaluation was also developed” (p. 1579). Page et al. (2002) suggest that the revised instrument “boasts good psychometric characteristics, although future researchers may want to subject it to further analysis” (p. 1583). White and Velozo (2002) applied the Rasch Model to original Oswestry Disability Questionnaire responses from 942 patients with the following results: All items from the Oswestry except the pain item fit the Rasch model. Construct validity of the scale using the Rasch model required the structure of the rating scale to be modified from 6 response levels to 4. A hierarchical representation of LBP disability was supported. A comparison of the disability categories based on Likert and Rasch scaling revealed them to be non-equivalent. The new scaling changed the disability categories for 44% of patients, (p. 822) It should be noted that the White and Velozo (2002) results are similar to the Page et al. (2002) results even with a sample size that was ten times that of the latter sample. No further analysis has been done to validate the resulting ODI revision. This is necessary before it is used as a clinical or research measurement tool. A literature search for Item Response Theory or Rasch Modeling with the DPQ, and SPS rendered no results. Item Response Theory (IRT) Item response theory (IRT) is a psychometric theory that consists of a series of mathematical models which relate person and item parameters to the probability of the responses on a discrete outcome, such as a correct response to an item or an endorsement of a category of a trait. The attraction of Item Response models lies in their promise of invariant 29 item and person parameters provided there is data-model fit. That is, estimates of ability are not dependent on the difficulty of items. IRT also provides a basis for estimating several item parameters (i.e., difficulty, threshold, guessing, and category intersection), ascertaining how well data fits a model, and investigating the psychometric properties of assessments. As there is a single person characteristic assumed to account for the responses, the model is described as “unidimensional”. “Compared with classical test theory, IRT generally provides more sophisticated information regarding the psychometric properties of individual assessment items. The application of IRT has been wide, including the measure of personality traits, moods, behavioral dispositions and attitudes, as well as cognitive traits. Moreover, IRT is frequently applied to many health measurements” (Tsutsumi et al., 2008, p. 110). Within IRT there are three probabilistic measurement models: the 1-parameter (IPL), 2-parameter (2PL) and 3-parameter (3PL), named by the number of item parameters estimated in each model. All three models can be derived from the equation below for the 3parameter model: P (0) = Ci + 1 a — 1 + e x p [-1.lai[d - Z?/)] where the three item parameters are Ci = low asymptote of ogive (guessing) hi = median intercept of ogive (difficulty) a, = slope of ogive at inflection (discrimination), and the one person parameter is 0 = ability of a person on the variable 30 To derive the 2-parameter model, “Ci” is held constant eliminating “guessing”. For the 1parameter model, “ai” is held constant for all items and is often scaled to equal one. When “a,” is held constant this implies that all items on a test are equally discriminating. This leaves item difficulty as the sole parameter being estimated. The IPL is expressed as follows: f(^ ) = 1 14- e x p [-1.laiiO - Mathematically, the Rasch Dichotomous Model is identical to the 1-parameter IRT model with a formula of P{e)= l+gOb-R) where e = base of natural logarithm or Euler’s number; 2.7183 pn = person’s ability 6i = item or task difficulty However there are some important differences. As Shaw (1991) explains: This approach seems to imply that the Rasch model is just a stripped-down version of more complicated models which “must be better” because they account for more of the “presumed reality” of traditional test theory. Quite apart from Occam’s razor (that entities are not multiplied beyond necessity), this interpretation is shallow in an essential way. That the Rasch model can be reached by simplifying more complicated models has nothing to do with its genesis or rationale, or with the theory of measurement, (p. 131) 31 The Rasch model is based on measurement principles that provide sample-free item calibrations and test-free person measures on a common linear scale that can be analyzed statistically. Introducing the parameters for item discrimination and guessing violates these principles of measurement as outlined by Shaw (1991) that: 1. the measures of objects be free of the particulars of the agents used to estimate these measures and the calibrations of agents be free of the particulars of the objects used to estimate these calibrations. 2. the measures of objects and calibrations of agents function according to the rules of arithmetic on a common scale so they can be analyzed statistically. 3. linear combinations of measures and calibrations concatenations of objects and agents, (p. 131) correspond to plausible The mathematical elegance (simplicity) of the model allows for superior estimation capabilities. This makes Rasch, with its sole parameter of item difficulty, a more viable proposition for practical testing. In Rasch model thinking, the model is superior and data which does not fit the model is discarded. Rasch Analysis Rasch analysis is a statistical procedure used to transform ordinal-scaled measures into interval-scaled measures that provide good reliability and acceptable quantitative validity measured with fit characteristics. A primary advantage of using Rasch analysis is that the interval scaling scheme establishes standardized distances between points, thus allowing for more accurate interpretation of the levels measured. Items are distributed according to their difficulty and subjects are distributed according to their abilities. This results in a single linear scale that represents the underlying trait in question. Rasch analysis also evaluates item fit and thereby helps to determine which items are most useful in 32 assessing the construct under discussion. It reduces item redundancy and can shorten measurement tests to reduce the time needed for test administration and scoring. Rasch techniques can provide psychometric information that was previously unavailable with CTT techniques. W ithout converting the data into an interval scale, clinicians might mistakenly treat a participant’s total score as a sum of equal-valued items. After the data are converted, a researcher can utilize Rasch analysis to assess several psychometric characteristics, including unidimensionality, item hierarchy, and person reliability and separation statistics (Pomeranz, Byers, Moorhouse, Velozo & Spitznagel, 2008). Patient ability (from least to most able) and item difficulty (from least to most difficult) can be calibrated into a common underlying scale measured in logits (log odds units) (Davidson, Keating & Eyres, 2004). Rasch Analysis looks at data fit and examines the agreement between the model’s predicted responses and the observed responses. Fit statistics are provided that highlight poorly constructed items or indicate that some items do not measure the desired attribute. The researcher looks at how well the data fits the Rasch Model rather than the conventional approach of how well the model fits the data. “The Rasch model is a mathematical description of how fundamental measurement should operate with social/psychological variables. Its task is not to account for the data at hand, but rather to specify what kinds of data conform to the strict prescriptions of fundamental measurement’’ (Bond & Fox, 2007, p. 235). Rasch models include dichotomous and polytomous models. Dichotomous Model The model from which all other Rasch models have grown is the dichotomous model which is expressed in the logarithmic odds on success form as follows: 33 is the probability of examinee n correctly answering item i, Where l-Pni is the probability of examinee n incorrectly answering item i, B„ is the proficiency level of examinee n, and Di is the difficulty level for item i. This is applied to dichotomous or yes/no data to obtain examinee proficiency and item difficulty. Rating Scale Model The first of the polytomous models is an extension of the dichotomous model for items that have more than 2 response choices such as the ODI, DPQ, SFS and the NDI. The Rasch Rating Scale model (RSM) (Andrich, 1978) is the recommended model. It can he expressed as follows: ‘" ( l where Pnik p"’* ) = is the probability of examinee n scoring at level k on scale i, ^-Pm(k-i) is the probability of examinee n scoring at level k-1 on scale i, Bn is the proficiency level of examinee n, and Di is the difficulty level for item i. Fk is the difficulty of the step from level k-1 to k. Essentially it is the dichotomous model with thresholds added between steps. Likert scales are often presented in a format such as SD - strongly disagree, D- Disagree, N - Neutral, A - 34 agree, SA - Strongly Agree or on scales such as the one found on the Spinal Function Sort which ranges from Able to Unable with varying degrees o f reduced ability in between. A five item scale such as this would be modeled as having 4 thresholds. Rasch analysis would determine if there are this many distinct thresholds or if categories could be collapsed. Other instruments might be modeled as a rating scale such as the ODI (version 1) (Fairbank et al., 1980) where there are 6 statements in each section as shown below: Section 3— Lifting 1. I can lift heavy weights without extra pain. 2. I can lift heavy weights but it gives extra pain. 3. Pain prevents me from lifting heavy weights off the floor, but I can manage if they are conveniently positioned, e.g. on a table. 4. Pain prevents me from lifting heavy weights but I can manage light to medium weights if they are conveniently positioned. 5. I can lift only very light weights. 6. 1 cannot lift or carry anything at all. The assumption is that the statements are a series of uniformly increasing steps. Within Rasch modeling, instruments are assessed with regard to unidimensionality, fit and differential item functioning (DIF). Unidimensionality is the extent to which all items on a given scale are measuring the same construct or latent trait. This unidimensionality or local independence is a requirement of all Rasch models. Fitstatistics estimate how much each item adheres to the modeled expectations and indicate if each item on the instrument contributes to the measurement of that single latent trait. Infit statistics give more weight to the abilities of persons closer to the item value. therefore is more sensitive to outliers in the data. The outfit statistic is unweighted and 35 Overall fit is the extent to which the data for the class intervals fits the Rasch model. It is tested with a chi square statistic where a chi square value larger than the selected alpha value e.g. p > .05 indicates no deviation of data from expected. The person separation index is an estimate of the spread of respondents on the variable. It is the adjusted person standard deviation divided by the average measurement error. The person separation index provides information regarding the number of groups the test can discriminate amongst (Wright & Masters, 1982). Differential Item Functioning (DIF) is a method for detecting test items that function differently across subgroups of examinees as delineated by such parameters as age or gender. Uniform DIF is when the items perform similarly across all groups; failure to do so indicates item bias. Partial Credit Model The partial credit model (Wright & Masters, 1982) is a version of the rating scale model wherein the threshold estimates, including the number of estimates, are free to vary from item to item. This may be the better method of analysis for at least some of this data as the item series barely suggest equal spacing between choices. It is expressed by the formula: In where Pnik ( is the probability of examinee n scoring at level k on scale i, ^-Pm(k-D is the probability of examinee n scoring at level k-1 on scale i, Bn is the proficiency level of examinee n, and 36 Dik replaces D, + Fk. in the Rating Scale equation where D, is the difficulty level for item i and Fk is the difficulty level for item i. The replacement of A + Fk with Ak signifies that in the Partial Credit model each set of threshold estimates is related uniquely to its own item instead of for the entire set of items. This model is robust in situations such as the DPQ where there are differing lengths of visual analogue scale as this model does not require the same number of response categories for each item. It allows for an empirical test of whether the distances between response choices are constant. The other three measures, the NDI, ODI and SFS, do have consistent numbers of response categories so the data may fit the rating scale model for the individual analyses of each instrument. However, while the numeric values are the same, for the ODI and NDI each item has varied response choices making the Partial Credit model a superior model for these instruments as well as the DPQ. Therefore, the rating scale model will be used with only the SFS and the partial credit model will be used with the ODI, DPQ and NDI. This is necessary because in Rasch modeling the first tenet is that the data fits the model. Research Questions 1. Does each instrument measure one unidimensional trait? 2. Do all of the items within an instrument fit (belong) on that instrument? 3. Are the scale categories within items appropriate? Limitation - these instruments will not be examined for their ability to operate as one unified whole i.e. test equating. 37 Significance of Proposed Study An improved scoring system of each of the instruments would result in better sensitivity and specificity o f test scores. This would give the examiner more confidence in the scores obtained from these measures. 38 CHAPTER THREE: DESIGN AND METHODOLOGY Subjects An intact data set for 298 individuals who had participated in a Functional Capacity Evaluation at Central Interior Disability Management Services (CIDMS) between 1998 and 2008 was supplied by CIDMS once the University of Northern British Columbia Ethics Committee approval was obtained. Authorization to use this data (pending ethics approval) was obtained from CIDMS and a letter authorizing the use of the data is attached. (Appendix 5) A copy of the “Consent to Evaluate” that was signed by each subject prior to assessment is attached (Appendix 6). Ages of participants ranged from 21 to 64 year old, as depicted in Figure 1, both genders were represented (125 females and 173 males). Table 1 reports the physical occupational demands of the participants which ranged from Sedentary to Very Heavy according to the DOT (1991) classification found in Appendix 7. Age Distribution 70 60 50 4 s 40 4 u 30 4 20 - 10 0 21-24 25-29 30-34 35-39 40-44 Age Figure 1 Age distribution of study sample 45-49 50-54 55-59 60+ 39 Table 1 Demand Level No job Sedentary Physical Occupational Demands Description Unemployed Material handling up to 10 pounds on an occasional basis, up to 1/3 of the day*. Office-type work. Light Material Handling up to 20 pounds on an occasional basis, up to 1/3 of the day*. Laboratory Workers, Lumber Graders, Teachers, etc. Medium Material Handling of 21 to 50 lbs. on an occasional basis.* Mill Workers, Care Aides, etc. Heavy Material Handling of 51 to 100 lbs. on an occasional basis.* Electrician, Manual Laborer, etc. Material Handling of 100+ lbs. on an occasional basis.* Very Heavy Trades sueh as Millwright, Heavy Duty Mechanic, Planer Operator, etc. *Detailed physical demands in Appendix 7. Participants 1 15 69 137 40 36 Diagnoses ranged from multiple soft tissue injuries following motor vehicle accident to chronic conditions such as fibromyalgia. The criteria for administering the questionnaires to these clients was based on their identification, on the Ransford Pain Drawing (Ransford, Cairns & Mooney, 1979), of pain in the neck or back. Their response determined which questionnaire was appropriate; either the Neck Disability Index or the Oswestry Disability Index. If they identified pain in both areas, they were given both. For pain in other areas, they were asked to complete the Dallas Pain Questionnaire and if they identified a loss of ability to perform activities of daily living, they were asked to complete the Spinal Function Sort. Instrumentation The four instruments administered were the Oswestry Disability Index, Dallas Pain Questionnaire, Spinal Function Sort, and the Neck Disability Index. These instruments were chosen as they are part of the standard Matheson FCE battery (Roy Matheson and 40 Associates, 2006). Full descriptions can be found in the Literature review. Samples of each instrument can he found in Appendices 1-4. Procedures A full data set was obtained from Central Interior Disability Management Services in Excel format. There were 298 lines of data with each line representing data obtained for one person on one testing occasion. Some individuals were tested on more than one occasion and in that situation, this was identified. Information regarding gender, age and diagnosis was also provided. The file for eaeh elient was retained by CIDMS and only the completed data set without any patient names or other speeifie identifiers was provided to this researeher. The eriterion for inclusion in this study was that the file eontained any of the four completed instruments -O sw estry Disability Index, Spinal Funetion Sort, Dallas Pain Questionnaire and/or Neck Disability Index . Data Analysis The Oswestry Disability Index and the Neek Disability Index have been analyzed with the Rasch Partial Credit Model using the WINSTEPS computer program (Linacre, 2009). The Partial Credit Model was selected for the analysis of these two questionnaires because, although the numerie values of the rating seale were the same for all items, the individual response choiees differed. Wright and Masters (1982) advise the use of the Partial Credit Model in these situations. In eontrast, the Dallas Pain Questionnaire and the Spinal Funetion Sort ean be and were analyzed using the Rating Scale Model as the scale for each item is identical. The DPQ does have varying lengths of scale and while according to Andrich (1978) the Rating Scale Model is robust in this situation, I found that sinee the items 41 with the longer scales were functioning poorly, it was useful to do the initial analysis using the Partial Credit Model. Once the items were rescaled to equal lengths, the Rating Scale model was used. While the original intent of this work was to equate the four instruments, Linacre (2009) outlines conditions that must exist to equate tests. First, the tests must meet the criteria of unidimensionality with good item fit and ordered thresholds. He states that the latent variable has to be “invariant” across the instruments to be equated or linked. If these criteria are not met, equating should not be attempted. 42 CHAPTER FOUR: RESULTS The Oswestry Disability Index The Oswestry Disability Index (GDI) is a perceived disability scale consisting of ten items. Each item has 6 possible responses in statement form which are arranged in ascending order from least impairment to most impairment (Appendix 1). Data collected from 133 patients were analyzed using WINSTEPS Version 3.68.0 (Linacre 2009). The partial credit model, which treats the category structure of each item separately, was used because, although the numeric values of the rating scale for each item are the same, the response choices differ. (Wright & Masters, 1982). Rasch analysis uses two types of fit statistics, infit and outfit, to analyze the internal validity of items in the scale. When investigating data, Linacre (2009) recommends the following approach to assessing the results. Negative point-measure or point-biserial correlations should be investigated first and if negative correlations are noted look for miskeys and data entry errors. If all pointmeasure correlations are positive, investigate outfit before infit, mean square before z-scores and high values before low values. Positive point-measure correlations indicate that the expectation that individuals with high amounts of the latent trait will score in the higher range on this item, is met. Outfit statistics are more responsive to outliers and high outfit mean-squares can simply be the result of a few random responses by low performers. Infit mean-squares are more responsive to inliers and are sensitive to responses in which estimated person ability values are similar to item difficulty values. High infit mean-squares indicate that the items are mis-performing on the targeted population. Mean squares indicate the amount of distortion in the measurement 43 system. These infit and outfit statistics are reported as mean squares (MNSQ) and standardized z-seores (ZSTD). While a mean square of 1 and a z-score of 0 is ideal, W right et al (1994) found that for rating scales, a range of .6 to 1.4 for the infit and outfit mean squares is reasonable. The expected value is 1.0 and values greater than that indicate unpredictability or noise in the data; values less than 1.0 indicate overfit or redundancy of the item. Linacre (2009) states that “if mean-squares are acceptable, the ZSTD can be ignored”. Diagnostic Measures The initial ODI data run showed the results in Table 2. 44 Table 2 D iagnostic M easures fo r the O swestry D isability Index Pt - Measure (r) OUTFIT MNSQ. INFIT MNSQ. 1. Pain .63 1.26 1.32 2. Self caret .68 0.66 0.62 3. Lifting .59 1.08 1.01 4. Walking .69 0.83 0.85 5. Sittingt .69 0.64 0.65 6. Standing .66 &82 0.87 7. Sleeping* .61 1.43 1.48 8. Sex Life* .67 1.39 1.41 9. Social Life .71 0.87 0.90 10. Travelling .60 0.94 0.89 Item *Equal to or greater than Linacre’s 1.4 upper limit, tEqual to or less than Linacre's 0.6 lower limit. All correlations are between .59 and .71 which is acceptable. All are positive indicating that each item is positively associated with the measure. The item “sleeping” has an outfit mean square of 1.48. The items “sleeping” and “sex life” have problematic mean square infit statistics of 1.48 and 1.43. “Pain” is the next highest but remains within the .6 to 1.4 range recommended by Linacre. The significance of these mean squares is expressed as a ZSTD and values outside the range of -2 to +2 are associated with p < .05. The ZSTD associated with these mean squares were 3.5 and 3.0 respectively. 45 Once overall item statistics have been evaluated, items can be examined using the Modal Probability Curves. This pictorial representation illustrates the category function and boundaries for each item. The modal perspective on category boundaries on the latent variable identifies a mode being between intersections of the category probability curves. This simplifies inference about wbicb category is most likely to be observed to any item at any point along the latent variable. For example in figure 2, Category 1 “I can stand as long as I want but it gives me extra pain” is the most probable response (= .42) for a range o f -3 to -2 logits below the item’s mean difficulty. When all categories are modal, they look like the graph below for item 6 “standing”. Their thresholds are ordered and the performers with the least amount of the latent trait (disability) are most likely to choose “0” and the performers with the most disability are most likely to choose 5. Note that all categories must have a range of difficulty at wbicb the category is modal. 6. S ta n d in g -4 -3 -2 -1 0 1 2 3 M e a s u re re la tiv e to Item difficulty Figure 2 • Category prDDaDi^'. 0 — Category prcbeoiliy: 2 — Category probability 4 • Cetegory probabiüty 1 — Category probatyliy 3 — Category probebility 5 Standing item #6 Oswestry Disability Index. 46 Figure 2 illustrates the probability of responding to any particular category (y-axis) given the differences in estimates between person ability and item difficulty/ endorsability (x-axis). For example, if a person’s ability was 1 logit lower than the difficulty of the item (-1 on the x-axis), the probability of endorsing a “0”, “4” or “5” would be close to zero, or endorsing a “ 1” or “3” would be close to 0.22 and of endorsing a “2” would be close to 0.42. This person is most likely to endorse Category 2 on this item. For the person with higher ability estimates such as +1 to +5 on the x-axis, the most likely response is a 4. The graph shows that each response category is the most probable for some level of the variable. While item 6 “Standing” performed well and fit the Rasch principles, several others did not. Item 1- Pain (Figure 3) did not function as well as item 6. 1. Pain n 5 "6 G) M e a s u re re la tiv e to ite m difficulty Figure 3 Pain item #1 Oswestry Disability Index In the graph depicted in Figure 3, for the individual whose ability is one logit lower than the item difficulty i.e. -1 on the x-axis, the probability of endorsing a “0” “2” “4” and “5” is close to 0 whereas the probability of endorsing “ 1” or “3” is approximately 0.40. Since the 47 categories are supposed to be ordered with increasing amounts of the latent variable, this demonstrates disordered thresholds. The Rasch-Andrich thresholds are the intersections between adjacent categories (Andrich, 1978). disordered. Thresholds for categories 1-2 and 2-3 are Andrich (1978) is adamant that “disordered thresholds are a violation of the principles underlying the Rasch Model and must be eliminated”. The category intervals on the latent variable must correspond with the modal intervals of the categories. Since category 2 is never modal, it must be removed. Similarly category 2 did not function for the Items “Walking” - Pain prevents me from walking more than 16 mile and “Sex Life” - My sex life is nearly normal but it is very painful. Category 1 did not function for “Sitting” - I can only sit in my favorite chair for as long as I like, “Sleeping” - I can sleep well only by using tablets and “Social Life” My social life is normal but increases the degree of pain. To correct the disordered scale problems outlined herein, the ODI was rescaled from six categories to five. Each item was assessed individually to determine which two categories should be collapsed into one and is seen in Table 3. The rescaled instrument performed well. 48 Table 3 Oswestry Categories Collapsed ITEM MALFUNCTIONING CATEGORY COMBINED WITH NEW COLLAPSED CATEGORY 1 Painkillers give complete relief from pain Painkillers give moderate relief from pain Painkillers give moderate to complete relief from pain 2 I do not get dressed, wash with difficulty and stay in bed I need help in most aspects of self care I need help in most aspects of self care Pain prevents me from lifting heavy weights but I can manage light to medium weights if they are conveniently positioned Pain prevents me from lifting heavy weights from the floor but I can manage if they are conveniently positioned Pain prevents me from lifting medium to heavy weights from the floor but I can manage if they are conveniently positioned Pain prevents me from walking more than V2 mile Pain prevents me from walking more than 14 mile Pain prevents me from walking more than short distances 5 I can sit in my favorite chair as long as I like I can sit in any chair as long as I like I can sit as long as I like 6 Pain prevents me from standing at all Pain prevents me from standing more than 10 minutes Pain prevents me from standing more than 10 minutes I can sleep well only by using tablets Even when I take tables I have less than 6 hours sleep Even when I take tables I have less than 6 hours sleep My sex life is nearly normal but is very painful My sex life is normal hut causes some extra pain My sex life is normal but causes some extra pain My social life is normal but increases the degree of pain Pain has no significant effect on my social life apart from limiting my more energetic interests such as dancing etc. Pain has no significant effect on my social life apart from limiting my more energetic interests such as dancing etc. Pain prevents me from traveling except to the doctor or hospital Pain restricts me to short necessary journeys of less than 30 minutes Pain restricts me to short necessary journeys of less than 30 minutes 10 49 Table 4 D iagnostic M easures fo r Rescaled O swestry D isability Index ITEM PT-MEASURE (r) OUTFIT MNSQ INFIT MNSQ 1. Pain .61 1.09 1.10 2. Self care .71 0.79 0.79 3. Lifting .59 1.25 1.26 4. Walking .70 0.75 0.77 5. Sitting .66 0.90 0.92 6. Standing .68 1.00 1.05 7. Sleeping .59 1.14 1.15 8. Sex Life .64 1.12 1.04 9. Social Life .71 0.84 &84 10. Travelling .62 1.09 1.13 The diagnostic measures outlined in Table 4 show positive point measure correlations and that the infit and outfit MNSQ for “Sleeping” and “Sex Life” have improved from 1.48 and 1.43 to 1.04 and 1.15 and the ZSTD has improved from 3.5 and 3.0 to 1.0 and 0.8, indicating improvement in function of these items. W ith the rescaled categories, “Pain” is functioning better with an infit MNSQ of 1.10 reduced from 1.32. and a ZSTD reduced from 2.0 to 0.7 and the Modal Probability Curves for this item are shown in Figure 4. 50 1. P ain 0.7 z 0.6 2 13 o 02 0 ■2 -6 2 0 3 Measure relative to item difficulty Figure 4 ■ C a te g o ry probability; 0 '=■ C a te g o ry probability; 2 ' C a te g o ry probability: 1 — — C a te g o ry probability: 4 C a te g o ry probability; 3 Rescaled pain item #1 Now each category is functioning properly with ordered thresholds. For the example of a person with 1 logit less ability than the item difficulty, the probability of endorsing “0”, “3” or “4” approaches 0 whereas the probability of selecting “ 1” is .35 or “2” is .5. The same was observed for the other items which previously demonstrated disordered thresholds. Item Difficulty To examine item difficulty, the revised scale was assessed with the Rasch Rating Scale model; see Figure 5 . When examining item difficulties, coverage for a wide range of abilites should be evident. The items are arranged on a common scale from easiest to endorse to most difficult. This allows us to see the order of ease of endorsement of the items. It is interesting to note that the easiest to endorse item was standing - i.e. standing tolerance is affected first with back pain. The item most resistant to endorsement was “selfcare” 51 indicating that individuals with back pain do not perceive a loss of ability to perform self care activities as readily. The order of the categories from easiest to most difficult to endorse was Standing, Lifting, Travelling, Pain, Social Life, Sitting, Sex Life, Walking, Sleeping and Self Care. This is consistent with function observed in clients with back pain in the clinical setting. Item Characteristic Curves ■3 2 1 C -2 ■1 G 1 2 7 Measure Figure 5 1. P ain 2. Lifting 5. Sitting 7, S l e e p i n g S. S o c i a l Life 2. S e lf c a r e 4 . W a lk in g €. S tan d in g 8. S e x Life 1C. “ ra v e l li n g Item characteristic curves depicting item difficulty on the rescaled ODI 8. 52 Differential Item Functioning (DIF) Differential Item Functioning (DIF) is an indicator of possible bias and is the result of a lack of invariance across testing situations. For instance one sub-group (i.e. males) with a given level of a latent trait responds differently to an item compared with another subgroup (i.e. females) with a similar level of the latent trait. This was investigated with respect to gender and age and no DIF was found. The results for gender are depicted in Figure 6. ODI PERSON DIF plot (GENDER) ITEM N 0.5 3 TO I -0.5 Li_ Q -1.5 Figure 6 DIF by Gender for the ODI 53 Dallas Pain Questionnaire (DPQ) The Dallas Pain Questionnaire is a popular measure for use with individuals suffering from chronic low back pain. For the purposes of this study, only the responses to the “daily activities” and “work/leisure activities” which comprise the functional activities factor of this instrument are considered. Data collected from 241 participants were analyzed using the Rasch partial credit model. Andrich’s rating scale model is reported to be robust with scales of varying lengths but in this case, the items with the largest scales demonstrated more than one disordered threshold and therefore initial analysis with the M aster’s Partial Credit was done. Once the items were functioning better, the scale was assessed with Andrich’s Rating Scale Model. 54 Table 5 Diagnostic Measures fo r the Dallas Pain Questionnaire PT-MEASURE (r) OUTFIT MNSQ. INFIT MNSQ. 1. Pain .58 1.33 1.30 2. Self care .64 0.95 0.95 3. Lifting* .48 1.45 1.21 4. Walking .61 1.08 0.99 5. Sitting .61 0.99 1.01 6. Standing .65 0.76 0.84 7. Sleeping .62 (L93 0.95 8. Social Life .70 0.84 &86 9. Travelling .68 0.91 0.92 10. Vocational .55 0.92 1.02 Item *Equal to or greater than Linacre's 1.4 upper limit. Diagnostic Measures fo r the DPQ In the initial analysis, all items fit the model except Item 3 “lifting” - which was misfitting with an outfit mean square of 1.45 and a corresponding ZSTD of 3.4. The Category Probability Curves for all items showed disordered thresholds for at least one level of endorsement for each item. Lifting is shown in figure 7. It can be observed that there is no point on the latent variable that category probability 1 or 2 is the most likely to be selected. 55 3. Lifting n 06 -- O) 39 ■29 21 -9 61 Measure relative to item difficulty Figure 7 — C a te g o ry p robability: 0 ■ C a te g o ry probability. 2 — C a te g o ry probability: 4 — C a te g o ry probability. 1 ■ C a te g o ry pro bability: 3 — C ate g o ry ' probability: 5 DPQ Item 3 Lifting Category Probability Curve For the items in the “work/leisure” activities, where there is a seven or eight point scale, the Category Probability Curves demonstrate disordered thresholds on several categories. The “vocational” item is shown in Figure 8. In this study, 144 people selected the final category next to “I cannot work”. 56 10. V o catio n al 0 .9 0.8 0.7 & 2 0.6 Û . OS & O) M 0 .4 Û n 0 .3 0.2 0.1 -29 -1 9 ■9 I 11 21 31 41 SI M easure relative to item difficulty ■ C a te g o ry probability: 0 —- C a te g o ry ' probability: 2 — C ategory' probability: 4 — C a te g o ry probability: 6 ■ C a te g o ry probability: 1 — — C a te g o ry probability: 5 — C a te g o ry probability: 7 Figure 8 C a te g o ry probability: 3 Category probability curve of DPQ item 10 “Vocational” An attempt was made to create a five point scale for all items. For the first six items, the “0” and “ 1” categories were collapsed. Item 7 “Sleeping” consisted of five response options on the original DPQ, so it was left as is. The final three items. Item 8 “Social Life”, Item 9 “Traveling” and Item 10 “Vocational” were modified by collapsing adjacent categories “0” and “ 1”, “2” and “3”, “4” and “5”, “6” and “7” were not collapsed. This improved the scale and with further analysis using the Partial Credit Model, there were no items with mean squares outside the range of .6 to 1.4 however in the first six items, category “ 1” demonstrated a disordered threshold as did category “7” on the final three items. It was then determined that a four point scale might work better. A four point scale was created by collapsing the new category “ 1” with category “2” for items the first seven items and by collapsing categories “6” and “7” in the final three items. This corrected the disordered thresholds in items 2 through 10 but Item I “Pain” remained problematic. Since the rating 57 scale now met the criteria for use of the Andrich Rating Scale Model, analysis using this model where one set of threshold values are applied to all the items on the test was undertaken. Once this analysis was done, it was noted that “pain” with an outfit mean square of 1.66 and a ZSTD of 6.3 as well as an infit mean square of 1.72 and a ZSTD of 7.0, did not fit the scale. The DPQ and Oswestry Disability Index have essentially the same items - the ODI using progressive response choices and the DPQ using a visual analogue scale. White and Velozo (2002) found that the “pain” item did not fit in the Oswestry Disability Index and postulated that it was because “ the responses to the pain item relate to the use of pain medications differing from the responses of the other items that all relate to function (physical, social)” (p. 825). While I did not find as White and Velozo had, that the pain item did not fit the scale on the ODI, it is clear with the fit statistics as reported, it did not fit on this scale. The use of pain medication is not the issue in the DPQ; it is more likely that this item does not fit because pain is a symptom and the other items relate to function/ability. The “pain” item was removed leaving a 9 point scale. The Category Probability Curve for Item 2 “Personal Care” is shown in Figure 9. In the rating scale model, the same set of threshold values are applied to all items, so all item category probability curves are as depicted below. Now each category is functioning properly with ordered thresholds. For the example of a person with 1 logit less ability than the item difficulty, the probability of endorsing “3” approaches 0 whereas the probability of selecting “0” or “2” is .22 . Category 1 is the most likely choice for individuals in this group with a probability of .54. 58 2. P e rs o n a l C are 0 .9 O.S 0.7 S 0.6 2 û - 0.5 & §" S 0 .3 0.2 0.1 0 - 7 - 6 - 5 - 4 - 3 -2 -1 0 1 2 3 4 5 6 7 M easure relative to Item difficulty • C a te g o ry probabilité': 0 Figure 9 — C a te g o ry probabilité': 1 — C a te g o iy 'p ro b a b iit^ ': 2 — C ategor> 'probabiiit> ': 3 Category probability curve for the 4-point DPQ rating scale An improvement from 5.86 to 10.93 in the item separation index was seen following the transformation of the original DPQ to a four point scale with the “Pain” item removed. Item Difficulty Item difficulty is illustrated with the Item Characteristic Curves shown in Figure 9. In contrast to the results on the ODI for similar items, the “Vocational” item was easiest to endorse, followed by “Social Life” and then “Lifting”. “Standing”, which had been the easiest to endorse on the ODI, was fifth on the DPQ. “Personal Care” was hardest to endorse on both scales. Results from test equating could prove interesting. 59 Item Characteristic Curves z. E 2 ü) 1 .75 1 .5 E o 1.25 ü co 1 C.7I C.5 C.2 5 0 1 ■S 4 C Measure 2. P e r s o n a l C a r e 4. W a l k i n g g. S t a n d i n g 8. S o c i a l Life 2. Lifting 5. Sitting 7. S l e e p i n g S. ~ r a v 6 i i n g 1C. V o c a t i o n a l Figure 10 Item characteric curves depicting item endorsement for the 4-point DPQ Differential Item Functioning (DIF) DIF analysis showed no significant difference in scale function for men or women. 60 PACT Spinal Function Sort The PACT Spinal Function Sort is a Pictorial Activity Sort consisting of 48 Likert Scale items with two items repeated as a reliability check. The test-taker grades his or her ability to perform each task on a five point scale from “able” to “unable” . Category scores are added and multiplied by factors from one to four and then summed to give a total score out of a maximum of 200. The result is presented as the individual’s Rating of Perceived Capacity (RPC) which is based on the DOT Physical Demand Characteristics of Work outlined in Appendix 7. Data for 260 respondents were analysed using the WINSTEPS computer program. Since this is a true Likert scale with each item having an equal number of response options, Andrich’s Rating Scale Model was the appropriate model. The initial analysis revealed six misfitting items with mean squares greater than 1.4. These are shown in Table 6. 61 Table 6 Spinal Function Sort Misfitting Items PT-MEASURE (r) OUTFIT MNSQ 02. Retrieve/tool/floor .59 1.75 1.27 49. Paint brush/eye level .60 1.74 1.39 21. Light Bulb Overhead .62 1.66 1.46 19. Wash Dishes Sink .62 1.64 1.23 37. Climb Step Ladder .62 1.59 1.50 22. Install Face Plate .65 1.43 1.03 ITEM INFIT MNSQ Diagnostic Measures fo r the SFS It should also be noted that Item 49 “Paint brush at eye level” is the duplicate of Item 17 included as a reliability check. Item 17 showed mean squares of 1.04 and 1.01 with associated ZSTDs of 0.4 and 0.1 whereas Item 49 had mean squares of 1.39 and 1.74 with associated ZSTD of 2.7 and 4.1. The other reliability check. Items 6 and 50 “ Place/retrieve 5# weight waist to overhead” performed differently as well which questions their use as a reliability check. Item difficulty varied for “Place/retrieve 5#...” as well. Most respondents found it easier to endorse this item when it appeared as item 50 after responding to questions regarding weights ranging from 20 to 100 lbs. rather than as item 6 where it was anchored by items of a light nature. The six items shown in Table 6 were removed and the data were analysed again. In this second analysis, three new items demonstrated mean squares above 1.4. Each time the offending items were removed, new ones cropped up. represent a unidimensional scale was a failure. The attempt to reduce items to 62 Item Difficulty W ith the number of items on the SFS, it is not feasible to look at the ICCs to assess item difficulty. Another pictorial method is the Item Person Map as depicted in Figure 11. P E R S O N S - H A P - IT E M S | .m .### i t e T i 30 S I g iiS is IS it e T i I t e T 32 I t e T 4 I t e T 50 1 I ie T 17 Ite T 19 I t e T 2 I t e T 22 I t e T 27 I t e T It& T 25 Ite T 40 lî ê T IS I t e T 21 Ite T 29 I t e T 31 I t e T 33 I t e T i 37 Ite T ; 5 It e T 6 It e T 20 I t i T i 23 + 1 I ###### ### 26 24 I I t e T Ite m 5 i I t e T ) 39 Ite T 8 1 It e T 10 I t e T IS ■ H 4 I t e T i 35 I t e T 7 1 i t e T i 11 I t e T 9 1 I t e T i 13 I t e T 16 1 Ite m 12 Ite T 14 1 I t e T 28 I t e r 34 It e T 38 I t M 44 I t M 48 .#### H ###;### .m mi .m I t e T + 49 I t e i î 36 .m ,m m s i t e T i 42 I t £ T i 41 IT M 43 ## Itfe U 46 ; I I t e r 45 I l 6T i 47 # t < k 55 > |< freq ii> IS 2. Figure 11 Item Person Map for Spinal Function Sort 63 In Figure 11, items are displayed on the right and people on the left. Each “#” represents two people. The items are ranked from most difficult on the bottom to easiest on the top. It is not surprising to see items “41” to “48” at the bottom. These related to material handling of weights in the 50 to 100 lbs. range. Item 30 “Get into driver’s seat” is the easiest. The Items “6” and “50” despite being identical are at differing levels of difficulty. Differential Item Function (DIF) fo r the SFS Most items were problematic due to their outfit mean squares. Outfit mean squares are sensitive to off-target responses by persons on items that are at the subject’s ability level or on-target responses to items that are distant to the subject’s ability level. Removal of 20 individuals from the data set based on an outfit mean square and ZSTD greater than 2.0 did not improve the functioning of the items in Table 6. To determine if the items were functioning differently for subgroups, a differential item funetion (DIF) analysis was undertaken. Figure 12 visualizes the differences in item DIF size by gender. 64 PERSON DIF plot (GENDER) ITEM \ m f M® # » 3 SO n Figure 12 Spinal Function Sort Item DIF size by gender. Figure 12 shows the size of the item DIF in logits for each group relative to the difficulty of each item. It shows that the items do function differently by gender. This plot shows that for Item #19 “Wash Dishes at Sink” women reported a higher estimation of their ability at this task than the men did. Similarly, for the medium to heavy material handling tasks (items 41 to 48), the men estimated higher ability than the women did. In fact, a pattern emerged during the analysis where males reported lower abilities on traditionally female tasks such as dishwashing (item 19) and kitchen floor sweeping (item 40) despite these tasks being less physically demanding than other tasks that they indicated they were capable of performing. It was interesting to note that the same men, who were restricted in ability with regard to 65 sweeping with a kitchen broom, reported a higher perception of ability when sweeping the push broom which is commonly used in industrial settings as well as in the garage and workshop. One item that contradicted this trend was Item 3 “push and pull a vacuum cleaner”. In this item a man is depicted performing this task. These results may be specific to the population tested. Many of the participants were from more traditional cultures and the bulk of the participants were over 40 years old. The sample of males under 40 was not large enough for a comparison. All participants were from the northern interior of the British Columbia. This instrument may perform more effectively in other geographic regions. 66 Neck Disability Index The 10-item Neck Disability Index (NDI) is the most widely used measure for assessing the effect of neck pain on activities of daily living. It was developed by Vernon and Mior (1991) by adapting five of the scales from the Oswestry Disability Index - Pain, Self Care, Lifting, Sleeping, Travelling (Driving) and developing five new scales identified by literature review and consultation with clinicians. There was minimal input from patients in the development of this scale. Scoring is done exactly as for the GDI with the same resultant disability categories. Data collected from 76 participants were evaluated using the WENSTEPS program in a manner similar to the analysis of the GDI. The initial NDI data is reported in Table 7. 67 Table 7 Diagnostic Measures fo r ND I PT-MEASURE (r) INFIT MNSQ. OUTFIT MNSQ. 1. Pain .74 0.76 0.76 2. Self care .69 0.96 0.96 3. Lifting* .36 1.41 1.85 4. Reading .70 0.71 0.70 5. Headache* .54 1.41 1.40 6. Concentrationt .78 0.66 0.74 7. Work .53 1.20 1.20 8. Drivingt .76 0.63 0.63 9. Sleeping .62 1.03 1.00 10. Recreation .53 1.13 1.18 Item *Equal to or greater than Linacre’s 1.4 upper limit, tE qual to or less than Linacre’s 0.6 lower limit. Diagnostic Measures fo r the NDI The point measure correlations range from .36 to .78. At .36 lifting correlates poorly with the overall measure. It can be noted that the items “Lifting” and “Headache” have outfit and infit mean squares higher than 1.4 with corresponding ZSTD greater than 2. Both items display disordered thresholds when the Modal Probability curves are displayed. 68 3. Lifting O) .31 .51 9 ■21 19 3-9 49 M easure relative to item difficulty — C a te g o o ' probability: 1 • C a te g o ry probability; 3 — C a te g o ry probability: 2 ' C a te g o ry probability: 4 C a te g o ry probability': ! Figure 13 Category Probability Curves NDI “Lifting” item #3 In Figure 13 for the “Lifting” item, no one has chosen category 0 - “I can lift heavy weights without extra pain” . Category 2 “Pain prevents me from lifting heavy weights from the floor...” shows disordered thresholds. On the “Headache” Item of the Neck Disability Index depicted in figure 14, it can be noted that category 2, “1 have moderate headaches which come infrequently” is not functioning well. The fourth category, “1 have severe headaches which come frequently” is also not functioning well. 69 5. H e a d a c h e ja 0.6 Ûm 0 .5 O) 0.2 -51 •31 -21 -11 S 19 29 39 49 M easure relative to item difficulty • C a te g o ry probability: 0 — C a te g o ry probability: 2 — C a te g o ry probability': ' C a te g o ry probability: 1 — — C a te g o ry probability; C a te g o ry probability: 3 Figure 14 Category probability curves for the NDI “Headache” item #5 Several unsuccessful attempts were made to rescale the NDI in a way that would allow the retention of the “Lifting” and “Headache” items. Examination of the items themselves show that the response category labels for each are poorly worded and contain more than one concept, for example, severity and frequency o f headache are contained in the same choice. Items should contain a single statement that allows for degrees of endorsement. With the above-mentioned items removed, a new 8-item NDI scale remained. For this new scale, person reliability improved from .82 to .87 and item reliability improved from .95 to .96. However, disordered categories were seen in the “work”, “sleeping” and “reading” items which differ from the findings of van der Velde et al., (2009) where they found disordered thresholds for “personal care”, “work” and “recreation” items. With this 70 small data set, many of the choices had fewer than the 10 observations suggested by Linacre (2009) so meaningful rescaling could not be performed. Item Difficulty Item difficulty is presented in Figure 15. Similar to the findings on the ODI and the DPQ, personal care is the most difficult to endorse. Otherwise, item difficulty varied depending on the instrument used. Item Characteristic Curves E 0) C O E 8 ' CO 1 0 -21 ,7 3S .2 7.S C.5 7S.1 107.7 Measure 1. Pain 2. S e lfc a re Figure 15 3. R e a d i n g C oncentraticn Item Difficulty for the NDI E. W o r k 7. S l e e p i n g €. D r i v i n g 8. R e c r e a t i o n 122 71 Differential Item Functioning There was no DIF by gender as illustrated by the chart in Figure 16. While there was a difference in “Driving”, it was not significant for this sample. PERSON DIF plot (BY GENDER) ITEM 2 1.5 •F 1 •M *D Of 0.5 3 ro 0) 0 -0.5 1 -1.5 Figure 16 DIF by Gender for the NDI In the analysis of misfitting persons, a pattern emerged indicating that individuals with chronic conditions or multiple injuries were more likely to be misfitting. Sample size was not large enough to divide the participants by diagnosis and run a DIF analysis but this does suggest that the instrument may function differently with different diagnoses. 72 CHAPTER FIVE: DISCUSSION The purpose of this study was to evaluate the psychometric properties o f four of the self-report instruments commonly used in Functional Capacity Evaluations to answer the question “Are the client’s subjective reports reliable?” This is important as these measurement instruments are currently being used by Work Capacity Evaluators to determine reliability of the client’s subjective reports as compared to demonstrated abilities. This can potentially affect the continuation of disability benefits and allotment of funds for rehabilitation as well as monetary awards in litigation. Rasch Modeling, a prescriptive method offering distribution-free person ability and item difficulty estimates on a linear latent variable, was used to evaluate the ODI, DPQ, SES and NDI with surprising results. As a clinician who has used these instruments for the past ten years, I was astounded by the poor performance of the Spinal Function Sort and Neck Disability Index. The Dallas Pain Questionnaire fared somewhat better and the Oswestry Disability Index was the best performer. The Oswestry Disability Index (ODI) Three previous studies have reported on the measurement properties of the Oswestry Disability Index based on a Rasch analysis. Although these studies are discussed in the Literature Review, I provide brief summaries of each previous study for ease of comparison with my study. Davidson (2008) compared three variations of the ODI, Version 1, Version 2 and the Revised ODI, using a convenience sample of 100 individuals, 40% with a duration of current episode greater than six months and 63% who had experienced more than five 73 previous episodes of back pain. She reported that while Versions 1 and 2 met the criteria for unidimensionality with her population, the revised version did not. Page et al. (2002) reported that with a sample of 95 patients and a mean time since initial symptom presentation of 2.3 ± 0.6 wk., the revised version of the ODI did not demonstrate unidimensionality. The authors postulated that: although item 1 purports to measure LB? (low back pain) disability intensity, it does so by asking patients to what extent painkillers reduce their LB ? disability. Many patients in our clinic did not take painkillers, although they reported substantial LBP disability intensity. Moreover, for many patients, the extent to which painkillers reduce LBP is not a direct way to functionally assess LBP disability intensity. The first item, “Pain Intensity”, was removed and disordered thresholds were addressed by collapsing categories “2” with “3” and “4” with “5”. The authors reported improved precision of the instrument. White and Velozo (2002) reported on a sample of 942 patients with low back pain presenting for physical therapy, 70% of whom had experienced symptoms for less than six months and 50% of whom were working at their regular place of employment. They found that all items except the Item I “Pain Intensity” fit the Rasch model and that “construct validity of the scale using the Rasch model, required the structure of the rating scale to be modified from six response levels to four.” In my study, using a sample of 133 patients presenting for functional capacity evaluation, I found that all the items on the ODI Version 1 fit the Rasch Model when the criteria of Mean square fit indices between 0.6 and 1.4 were applied. Granted, the fit indices for items 7 and 8, “sleeping” and “sex life”, respectively were marginally “noisy” with values ranging from 1.39 to 1.48. I considered this to be insufficient evidence to warrant removal of these items, particularly as scale category ordering had not yet been addressed. Two items 74 exhibited fit values ranging between 0.62 to 0.65; constrained but acceptable values. In contrast to the results of Page et al. (2002) and White and Velozo (2002), Item 1 “Pain Intensity” fit well even before category reduction. To correct disordered thresholds, the scale was revised from six categories to five on an item by item basis. Again, this decision to reduce to five categories differs from that of Page et al. (2002) and White and Velozo (2002). However these authors employed a common aggregation of categories across all items. My approach was more pragmatic and data-driven as categories were collapsed on an item by item basis. I judged this to be an appropriate strategy as each item had different descriptors for each scale point - more Partial Credit than Rating Scale. The revised instrument met the criteria for unidimensionality with no borderline indices, either high or low. Rasch analysis purports to be a sample-free measurement model, so this variation in whether the scale ineluding Item 1 fits or does not fit the model was initially of concern to me. One possible explanation is that the latent trait of perceived disability might be different in aeute vs. chronic or working vs. not working populations. Davidson and I both examined populations which were largely chronic i.e. more than five recurrences or longer than six months duration of symptoms whereas White and Velozo (2002) and Page et al. (2002) report on patients with shorter durations of symptoms who were presenting for physiotherapy treatment. Almost 100% of my participants were taking some form of pain medication and pain management was a significant part of their daily activities - often the most significant part. In the clinical setting, my colleagues and I have noted that pain rather than ability becomes the limiting factor for participation of individuals who have chronic pain. Further assessment of the ODI with the two populations would be recommended. 75 Dallas Pain Questionnaire (DPQ) There are no published works to date on Rasch analysis o f the DPQ. Lawlis et al. (1989) developed this instrument with varying scale lengths which they justified by differential weighting of items. They indicated that in pilot studies, some of the items seemed to impact the construct of perceived disability more than others. They gave the ones with the highest impact more segments to reflect this weighting. I do not understand how giving an individual more selection options in a given item accomplishes the authors’ goal as outlined above. Initial Rasch analysis of this instrument revealed a lack of unidimensionality of the scale and disordered thresholds particularly in the items with more segments. When the scale was reduced to 9 items with 4 categories each, the DPQ demonstrated satisfactory internal reliability and construct validity as indicated by the Rasch analysis. In contrast to the ODI findings where the pain item did fit the Rasch model, in this case the pain item did not function as part of a unidimensional scale. One explanation could be that while the ODI item refers to pain management, the DPQ simply asks the individual to indicate the level of pain they are experiencing from “No Pain” at one end of the scale to “Worst Imaginable Pain” at the other. Ozguler et al. (2002) had changed the scale to 10 segments reporting that this new homogeneous instrument functioned well to measure the impact of spinal pain on behavior. While uniform scale length was definitely a step in the right direction, ten segments might be excessive. In my analysis, I found that the scale functioned well with 4 segments. 76 Spinal Function Sort (SES) For me, the poor performance of the SFS when analyzed with the Rasch model was the most disappointing. Within the rehabilitation field, the creator of the SFS, Leonard Matheson, is a well-respected researcher who was a pioneer in functional capacity evaluation and who has contributed a great deal to the advancement of the field. The problems with the SFS - lack of unidimensionality, disordered thresholds and lack of local independence, could not be overcome by eliminating items or persons or rescaling items. This was the only instrument that demonstrated DIF where items functioned differently by gender. It appears that men do not think that they can do dishes even if they are physically capable of much heavier tasks and likewise women do not like to use any tools. These types of task items seem to measure gender roles more than ability. The SFS does not meet the requirement of local independence, a basic tenet of the Rasch model. The order of presentation of the items from lighter to heavier tasks and the clustering of similar tasks influences the response for one item with the response on the similar item. The reliability check which consists of two items repeated at the end of the test is ineffective. Item difficulty of Item 6 and 50 “place or retrieve a 5 lb weight between waist and overhead” was different depending on where it was placed in the test. Respondents score it as an easier task after completing the questions related to material handling of 50 to 100 lbs. In my experience, respondents quite often remember that this item has been presented earlier in the test and flip back to check their previous answer. 77 Neck Disability Index (NDI) Since the conception of this project, a Rasch analysis of the NDI has been published. Van der Welde et al. (2009) reported on a sample of 521 trial subjects fit to the Rasch model. They reported on a lack of fit of the data to the model and disordered response thresholds in “personal care”, “lifting”, “headaches”, “work” and “recreation”. They eliminated two items, “headaches” and “lifting” and developed an eight item scale that demonstrated unidimensionality. They chose not to address the disordered thresholds as: this would have precluded the possibility o f providing a straightforward exchange between the everyday summed ordinal score and its corresponding interval score. Furthermore, collapsing the scale would have resulted in a varying number of categories across items, which represents a considerable change from the original design of the NDI scale. They suggest that the disordered response thresholds be examined in other samples to see if the problem is generic. In my analysis, I also found that “lifting” and “headache” did not form part of a unidimensional construct relating to perceived disability I found disordered thresholds albeit in different items but I also found that I was unable to effectively collapse categories to improve these items. I believe that the disordered thresholds cannot be fixed because the instrument itself, despite its popularity and wide spread use, is poorly designed. The response categories are confusing to respondents as they often contain more than one concept such as pain and function. For instance, in the driving category, two adjacent categories present as follows. “I can drive my car as long as I want with moderate pain in my neck.” “I can’t drive my car as long as I want because of moderate pain in my neck.” It was noted that the response categories designed to measure the highest levels of neck disability such as “I cannot read at all” were rarely or never endorsed by participants. 78 Conclusion The self-report questionnaires included as part of the Matheson Functional Capacity Evaluation Software package did not meet expectations. The original intent of this work had been to equate these measures to see if questionnaire selection and administration could he streamlined but this was not possible due to the poor performance of these instruments when analyzed with a modern psychometric approach, Rasch Modeling. As Linacre (2009) says “if tests don’t make sense separately, they won’t make sense together.’’ Limitations of Design The focus of this measurement thesis is on the internal validity of the tools. There is no content validity from experts in the field in relation to scale reduction decisions. There is no linkage between patient scores to function in the job setting. No follow up with subjects is possible. Recommendations for Practitioners These instruments should never be used to replace clinical observation and judgment. Scores should be used as a contribution toward decisions regarding symptom magnification but only as small part of a bigger picture. Clinicians should become familiar with the measurement properties of any instrument they use and critically evaluate the methods used to obtain the results. Better instruments may be available for use as part of the Functional Capacity Evaluation. It is the responsibility of the clinician to use reliable and valid instruments when measuring and reporting on symptom magnification. 79 Recommendations for Future Research Oswestry Disability Index Pilot testing of the scale as outlined in Appendix 8 as well as testing the new instrument with acute and chronic back pain populations. Dallas Pain Questionnaire Since the new scale length was developed by collapsing categories post hoc, further research is required to test the proposed new four-segment, nine-item homogeneous scale. Spinal Function Sort A better model for an activity sort such as this would be as a computer administered test where pictures are displayed in a random order and gender neutrality is maintained. A test battery for women that depicts all tasks being performed by women and likewise a similar test battery for men would reduce or eliminate bias by matching the test to the gender of the test taker. Further effort to identify tasks that depict an activity with given physical demands that is equally likely to be performed by both genders would improve the instrument or the development of parallel tests specific to gender could be another route to go. W ith the advancement of technology, computer tests can be easily adapted to fit any given situation. Neck Disability Index Although DIF by gender was not significant in my sample nor in the much larger sample used by van der Velde et al. (2009), it would be interesting to see if the NDI functions differently by diagnosis. Van der Velde et al. (2009) excluded patients with neck pain that was not mechanical in nature. They also excluded clients with third-party liability or 80 compensation daim s as well as individuals with co-existing problems. My sample was too small to divide by diagnosis but analysis of acute vs. chronic, multiple areas (i.e. neck and back or neck and shoulder) vs. single area (neck) would be recommended. I was unable to correct the disordered thresholds in this instrument despite numerous attempts to collapse the categories. The NDI is often confusing to respondents and response categories contain more than one concept. It may be unsalvageable. There are a plethora of other measurement tools for neck disability so it would be worthwhile to further investigate alternatives to the NDI. 81 REFERENCES Ackelman, B.H., & Lindgren, U. (2002). Validity and reliability of a modified version of the neck disability index. Journal o f Rehabilitation Medicine, Vol. 34: 284-287. Andrich, D. (1978). Application of a psychometrick rating model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2(4), 581-594. Beurskens, A., de Vet, H., & Koke, A. (1996, April). Responsiveness of functional status in low back pain: a comparison of different instruments. Pain, (55(1), 71-76. Birnbaum, A. (1968)” Some Latent Trait Models and Their Use in Inferring an Examinee's Ability," in F. M. Lord and M. R. Novick, Statistical Theories o f Mental Test Scores, Reading, MA: Addison-W esley. Bond, T.G. & Fox, C M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. Christensen, F.B., Laursen, M., Gelineck, J., Hansen, E.S. and Biinger, C.E. (2001). Posterolateral spinal fusion at unintended levels due to bone-graft migration. Acta Orthopaedoca Scandinavica Vol 72 (4): 354-358 Cleland JA, Childs JD, Whitman JM. (2008). Psychometric properties of the Neck Disability Index and numeric pain rating scale in patients with mechanical neck pain. Archives o f Physical Medicine & Rehabilitation; 89:69-14. Cleland JA, Childs JD, Whitman JM. (2008). Response to Vernon H. Letter to the Editor re Psychometric properties of the Neck Disability Index and numeric pain rating scale in patients with mechanical neck pain. Archives o f Physical Medicine & Rehabilitation; 89:1415-1416. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart & Winston. Cronbach, Lee J. (1951). Coefficient alpha and the internal structure o f tests. Psychometrika, 16(3), 297-334. Davidson, M. (2008). Rasch analysis of three versions of the Oswestry Disability Questionnaire. Manual Therapy, 13(3), 222-231. Davidson, M B., Keating, J.L., Eyres, S. (2004). A low back-specific version of the SF-36 Physical Functioning Scale. Spine Vol. 25(5) 586-594. 82 Fairbank J.C., Couper J, Davies J.B., O ’Brien, J.P. (1980). The Oswestry low back pain questionnaire. Physiotherapy Vol 66: 271-3. Fairbank, J.C.T. & Pynsent, P.B. (2000). The Oswestry disability index. Spine, Vol. 25, No. 2: 2940-2952. Fan, X. (1998). Item response theory and classical test theory: an empirical comparison of their item-person statistics. Educational A nd Psychological Measurement, 58(3), 357-381. Gibson, L. & Strong, J. (1996). The reliability and validity of a measure of perceived functional capacity for work in chronic back pain. Journal o f Occupational Rehabilitation, Vol. 6, No. 3: 159-175. Greenough, C.G. & Fraser, R.D. (1992). Assessment of outcome in patients with low-back pain. Spine, Vol. 17: 36-41. Gronblad, M., Hupli, M., Wennerstrand, P., Jarvinen, E., Lukinmaa, A., Kouri, J., et al. (1993). Intercorrelation and test-retest reliability of the Pain Disability Index (PDI) and the Oswestry Disability Questionnaire (ODQ) and their correlation with pain intensity in low back pain patients. The Clinical Journal o f Pain, 9(3), 189-195. Hains F., Waalen J; Mior S (1998). Psychometric properties of the neck disability index. Journal o f Manipulative and Physiological Therapeutics, Vol. 21 (2), 75-80. Hudson-Cook N., Tomes-Nicholson K., Breen A. A. revised Oswestry disability questionnaire. In: Roland M, Jenner JR, eds. Back Pain: New Approaches to Rehabilitation and Education, (pp. 187-204). Manchester: Manchester University Press. Huskisson, B.C. (1974). Measurement of pain. Lancet, Vol. 2, 1127-1131. Isernhagen, S., (1995). Contemporary issues in functional capacity evaluation, in S. Isernhagen (Ed.): The Comprehensive Guide to Work Injury Management, (pp. 410429). Gaithersburg: Aspen Publishers. Karabatsos, G., (1999) Axiomatic measurement theory as a basis fo r model selection in item response theory. Paper presented at the 32"^ annual conference of the Society for Mathematical Psychology, Santa Cruz, CA. Kopec, J., Esdaile, J., Abrahamowicz, M., Abenhaim, L., Wood-Dauphinee, S., Lamping, D., et al. (1995). The Quebec Back Pain Disability Scale: measurement properties. Spine, 20(3), 341-352. 83 Lawlis,G.F., Cuenas, R., Selby, D., and McCoy, C.E. (1989). The development of the D allas Pain Q uestionnaire. An assessment of the impact of spinal pain on behavior. Spine May; Vol. 14 (5), pp. 511-6. Linacre, J.M. (2009).Winsteps (Version 3.68.0) [Computer Software]. Chicago: Winsteps.com. Lord FM (1952) A theory of test scores. Psychometric Monograph No 7. Psychometric Society, New York. Marty, M., Blotman, F., Avouac, B., Rozenberg, S., & Valat, J. (1998). Validation of the French version of the Dallas Pain Questionnaire in chronic low back pain patients. Revue du Rhumatisme (English Ed.), 65(2), 126-134. Matheson, L.N. (1988). How do you know he tried his best? Journal o f Industrial Rehabilitation Quarterly, 1, 10-12 Matheson L., M ayer J., Mooney V., Sarkin A., Dreisinger T., Verna J., Leggett S (2008). A method to provide a more efficient and reliable measure of self-report physical work capacity for patients with spinal pain. Journal o f Occupational Rehabilitation, Vol. 18 (1): 46-57 Matheson L.N. & Matheson M L., Grant J. (1993). Development of a measure of perceived functional ability. Journal o f Occupational Rehabilitation, Vol. 3: 15-30. Matheson, L.N. (2004). History, design characteristics, and uses of the pictorial activity and task sorts. Journal o f Occupational Rehabilitation, Vol. 14, No. 5,175-195. Matheson, L.N., & Matheson, M L. (1989). Spinal function sort. Rancho Santa Margarita, CA: Performance Assessment and Capacity Testing. Matheson, Roy & Associates (2006). The functional capacity evaluation certification program manual. Keene, NH: Published in-house. Melzack, R. (1975). The McGill Pain Questionnaire: major properties and scoring methods. Pam, VoZ. 7(5), 277-299. Ozguler, A., Gueguen, A., Leclerc, A., Landre, M., Piciotti, M. , Le Gall, S., Morel-Fatio, M., and Boureau, F. (2002). Using the Dallas Pain Questionnaire to classify individuals with low back pain in a working population. Spine Vol. 27, No. 16: 1783-1789. Page, S., Shawaryn, M., Cernich, A., & Linacre, J. (2002, November). Scaling of the Revised Oswestry Low Back Pain Questionnaire. Archives o f Physical Medicine & Rehabilitation, 55(11), 1579-1584. 84 Pollard, C.A. (1984). Preliminary validity study of the pain disability index. Journal o f Perceptual M otor Skills, Vol. 59(3), 974. Pomeranz, J.L., Byers, K.L., Moorhouse, M.D., Velozo C.A., Spitznagel R.J., (2008). Rasch analysis as a technique to examine the psychometric properties of a career ability placement survey subtest. Rehabilitation Counseling Bulletin, Jul; 51 (4): 251-9. Ransford, A., Cairns, D., & Mooney, V. (1979). The pain drawing as an aid to the psychological evaluation of patients with low back pain. Spine, Vol. 1, 127-134. Robinson R.C., Kishino N., Matheson L.N., Woods S., Hoffman K., Unterberg J., Pearson C., Adams L., Gatchel R. (2003). Improvement in postoperative and nonoperative spinal patients on a self-report measure of disability; The Spinal Function Sort (SFS). Journal o f Occupational Rehabilitation, Vol. 13, No. 2, 107-113. Roche, G., Ponthieux, A., Parot-Shinkel, E., Jousset, N., Bontoux, L., Dubus, V., PenneauFontbonne, D., Roquelaure, Y., Legrand, E., Colin, D., Richard, I., Fanello, S., (2007). Comparison of a functional restoration program with active individual physical therapy for patients with chronic low back p a in : a randomized controlled trial. Archives o f Physical Medicine and Rehabilitation, Vol. 88 (10), 1229-35. Shaw, F. (1991). Descriptive IRT vs. Prescriptive Rasch, Rasch Measurement Transactions, Vol. 5:1, 131. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Streiner, D.L. & Norman, G.R. (1989). Health Measurement Scales: A Practical Guide to Their Development and Use. New York: Oxford University Press, Inc. pages. 64-65. Strong, J., Ashton, R., & Large, R. (1994). Function and the patient with chronic low back pain. The Clinical Journal O f Pain, 10(3), 191-196. Thew, Kimberley A. (2007). An examination of the perceptions of functional capacity evaluations in Prince George, British Columbia: A case study. M.A. dissertation. University of Northern British Columbia (Canada). Thurstone, L.L. (1927). The unit of measurement in educational scales. Educational Psychology, 16, 433-451. Journal o f Tsutsumi, A., Iwata, N., Wakita, T., Kumagai, R., Noguchi, H. & Kawakami, N. (2008). Improving the Measurement Accuracy of the Effort-Reward Imbalance Scales, International Journal o f Behavioral Medicine,15:2,\09-119. 85 van der Velde, G., Beaton, D., Hogg-Johnston, S., Hurwitz, E., & Tennant, A. (2009). Rasch analysis provides new insights into the measurement properties of the neck disability index. Arthritis & Rheumatism, 61{A), 544-551. Vernon H., Mior S. (1991). The neck disability index: A study of reliability and validity. Journal o f Manipulative Physiological Therapeutics, Vol 14: 409-15. Vernon, H., (2008). Letter to the Editor in Response to Cleland et al. Psychometric properties of the Neck Disability Index and numeric pain rating scale in patients with mechanical neck pain. . Archives o f Physical Medicine & Rehabilitation;89:\A \41415. Walsh, Thom (2000). Point of view re: the Oswestry disability Index, Spine, Vol. 25, No. 2: 2953. White, L., Velozo, C. (2002). The use of Rasch measurement to improve the Oswestry classification scheme. Archives o f Physical Medicine & Rehabilitation, 83{6), 822831. Wright, B.D., Linacre M., Gustafsson, J-E. and Mrtin-Loff, P. (1994). Reasonable meansquare fit values. Rasch Measurement Transactions, 8(3), 370. Available from http://WWW.rasch.o r g/rmt/rmt8 3 .htm accessed March 29, 2009 Wright BD, Masters GN. (1982). Rating scale analysis. Chicago: Mesa Press. 86 Appendix 1 - Oswestry Disability Index - Version 1.0 This questionnaire has been designed to give the doctor information as to how your back pain has affected your ability to manage in every day life. Please answer every section, and mark in each section only the one box which applies to you. We realize you may consider that two of the statements in any one section relate to you, but please just m ark the box which m ost closely describes yo u r problem. Section 1 - Pain Intensity □ I can tolerate the pain I have without having to use painkillers □ The pain is bad but I manage without taking painkillers □ Painkillers give complete relief from pain. □ Painkillers give moderate relief □ Painkillers give very little relief from pain. □ Painkillers have no effect on the pain and I do not use them. Section 2 - Personal Care (Washing Dressing etc) □ I can look after myself normally without causing extra pain. □ I can look after myself but it causes extra pain. □ It is painful to look after myself and I am slow and careful □ I need some help but manage most of my personal care □ I need help every day in most aspects of self-care □ I do not get dressed, wash with difficulty and stay in bed Section 3 - Lifting □ I can lift heavy weights without extra pain □ I can lift heavy weights but it gives me extra pain □ Pain prevents me from lifting heavy weights off the floor, but I can manage if they are conveniently positioned e.g. on a table. □ Pain prevents me from lifting heavy weights but I can manage light to medium weights if they are conveniently positioned. □ I can lift only very light weights □ I cannot lift or carry anything at all. Section 4 - Walking □ Pain does not prevent me from walking any distance □ Pain prevents me walking more than 1 mile. □ Pain prevents me walking more than Vi mile. □ Pain prevents me walking more than Va mile □ I can only walk using a stick or crutches □ I am in bed most of the time and have to crawl to the toilet. Section 5 - Sitting □ I can sit in any chair as long as I like □ I can only sit in my favorite chair as long as I like □ Pain prevents me from sitting more than 1 hour □ Pain prevents me from sitting more than Vi hour. □ Pain prevents me from sitting more than 10 minutes □ Pain prevents me from sitting at all. 87 Section 6 - Standing □ I can stand as long as I want without extra pain □ I can stand as long as I want but it gives me extra pain □ Pain prevents me from standing for more than 1 hour □ Pain prevents me from standing for more than 30 minutes □ Pain prevents me from standing for more than 10 minutes □ Pain prevents me from standing at all. Section 7 - Sleeping □ Pain does not prevent me from sleeping □ I can sleep well only by using tablets □ Even when I take tablets I have less than 6 hours of sleep □ Even when I take tablets I have less than 4 hours of sleep □ Even when I take tablets I have less than 2 hours of sleep □ Pain prevents me from sleeping at all. Section 8 - Sex Life □ My sex life is normal and causes no extra pain □ My sex life is normal but causes some extra pain. □ My sex life is nearly normal but is very painful □ My sex life is severely restricted by pain. □ My sex life is nearly absent because of pain □ Pain prevents any sex life at all. Section 9 - Social Life □ My social life is normal and gives me no extra pain □ My social life is normal but increases the degree of pain □ Pain has no significant effect on my social life apart from limiting my more energetic interests such as dancing etc. □ Pain has restricted my social life and I do not go out as often □ Pain has restricted my social life to my home. □ I have no social life because of pain. Section 10 - Travelling □ I can travel anywhere without extra pain □ I can travel anywhere but it gives me extra pain □ Pain is bad but I manage journeys over 2 hours □ Pain restricts me to journeys of less than 1 hour □ Pain restricts me to short necessary journeys of less than 30 minutes □ Pain prevents me from traveling except to the doctor or hospital. N am e____________________________________ Date 88 Appendix 2 - Dallas Pain Questionnaire Instructions Mark an “X” along the line that expresses your thoughts from 0% to 100% in each section. Reach each statement carefully. There are words to help you with each statement. If you need help, please ask. Section I; Pain Intensity To what degree do you rely on pain medications or pain relieving substances for you to be comfortable? None 0%(_ Some All the time )100% Section H; Personal Care How much does pain interfere with your personal care%etting out of bed, teeth brushing, dressing, etc.)? I cannot get out of bed None (no pain) 0%( ____ Some )100% Section III; Lifting How much limitation do you notice in lifting? None (I can lift as I did) 0% ( Some I cannot lift anything )100% : Section IV; Walking Compared to how far you could walk before your injury or back trouble, how much does pain restrict your walking now? I can walk the same Almost the same Very Little 0%( _______ I cannot walk _J100% Section V; Sitting Back pain limits my sitting in a chair to: None, pain same as before 0%( _________ Section VI; Standing Some I cannot sit at all )100% 89 How much does your pain interfere with your tolerance to stand for long periods? None Same as before 0 %( ___________ I cannot stand Some _J100% Section VII; Sleeping How much does pain interfere with your sleeping? I cannot sleep at all None Same as before Some )100% 0%(________ X3 = _% Daily Activities Interference) Section VIII; Social Life How much does pain interfere with your social life (dancing, games, going out, eating with friends, etc.)? None Same as before 0%(________ Some No activities total loss )100% Section IX; Traveling How much does pain interfere with traveling in a car? None Same as before 0%( _________ Some Cannot travel _)100% Section X; Vocational How much does pain interfere with your job? None No interference 0% ( : X5 = Some : :______ IWork/Leisure Activities Interference) I cannot work _ ) 100% " " Instructions . Pages This is a test of your current ability to perform work tasks. There are 50 drawings of work tasks in this booklet. Each drawing has a short description of a work task. Look at each drawing and read the description. On the seperate answer sheet, indicate your current level of ability to perform the task in the written description. You do not have to do the task exactly as the drawing. The drawing is meant to help explain the written task description. ' If you can perform the task with no difficulty, circle #1, “Able”. Able Restricted Unable (]T ) 2 3 4 : 5 % s O^ ^ ^ § f ? If you cannot perform the task at all, circle #5, “Unable”. Able Restricted Unable 1 2 3 4 C D . ? S f 5 If you can perform the task, but you have some difficulty, circle #2, #3, or #4, “Restricted”. Able Restricted Unable 1 CD (D ) C D 5 w ? g « Be sure to circle only one number. If you circle #2, this would indicate that you are only slightly restricted^ Able Restricted Unable ® ............................... > .......... ’ , , S I If you circle #4, this would indicate that you are very restricted, almost unable to perform the task. Able Restricted Unable • 1 2 3 C D 5 ? Ro If you don't know whether or not you can perforai the task, circle ® Able 1 Restricted 2 3 which stands for “I don't know”. Unable 4 . 5 re C/j CD) » Work quickly, Do not spend too much time on one drawing. Your first impression is usually the best. s Paged Performance Assessment an d Capacity Testing Product #SFS-TBE-1 J~ 3 ll I © C O P Y R IG H T P A C T 1989 1. Place a glass bottle on the floor. © C O P Y R IG H T P A C T 1989 2. Retrieve a small tool from the floor. Work quickly. Do not spend too much time on any one item. Your first impression is usually the best. P.A.C.T.» SPINAL FUNCTION SO R T Copyfiflht 1893 PcriotmaiKO A&se$swem and CapscSy Tesûfto # Z6 93 Appendix 4 - Neck Disability Index This questionnaire has been designed to give your therapist information as to how your neck pain has affected you in your everyday life activities. Please answer each section, marking only ONE box which best describes your status today. Section 1 - Pain Intensity □ □ □ □ □ □ I have no pain at the moment The pain is very mild at the moment The pain is moderate at the moment The pain is fairly severe at the moment The pain is very severe at the moment The pain is the worst imaginable at the moment Section 2 - Personal Care (Washing Dressing etc) □ □ □ □ □ □ I can look after myself normally without causing extra pain. I can look after myself normally but it causes me extra pain. It is painful to look after m yself and I am slow and careful I need some help but manage most of my personal care I need help every day in most aspects of self-care I do not get dressed, wash with difficulty and stay in bed Section 3 - Lifting □ □ □ □ □ □ I can lift heavy weights without extra pain I can lift heavy weights but it gives me extra pain Pain prevents me from lifting heavy weights off the floor, but I can manage if they are conveniently positioned e.g. on a table. Pain prevents me from lifting heavy weights but I can manage light to medium weights if they are conveniently positioned. I can lift only very light weights I cannot lift or carry anything at all. Section 4 - Reading □ □ □ □ □ □ I can read as much as I want to with no pain in my neck I can read as much as I want to with slight pain in my neck I can read as much as I want to with moderate pain in my neck I can’t read as much as I want because o f moderate pain in my neck I can hardly read at all because of severe pain in my neck I cannot read at all Section 5 - Headache □ □ □ □ □ □ I have no headache at all I have slight headaches which come infrequently I have moderate headaches which come infrequently I have moderate headaches which come frequently I have severe headaches which come frequently I have headaches almost all the time 94 Section 6 - Concentration □ □ □ □ □ □ I can concentrate fully when I want to with no difficulty I can concentrate fully when I want to with slight difficulty I have a fair degree of difficulty in concentrating when I want to. I have a lot of diffieulty in concentrating when I want to. I have a great deal of diffieulty in concentrating when I want to. I cannot concentrate at all. Section 7 -Work □ □ □ □ □ □ I can do as much work as I want to I can only do my usual work but no more I ean do most of my usual work, but no more I eannot do my usual work I can hardly do any work at all I can’t do any work at all Section 8 - Driving □ □ □ □ □ □ I can drive my ear without any neck pain I can drive my car as long as I want with slight pain in my neck I can drive my car as long as I want with moderate pain in my neck I can’t drive my ear as long as I want because o f moderate pain in my neck I can hardly drive at all because of severe pain in my neck I can’t drive my car at all Section 9 - Sleeping □ □ □ □ □ □ I bave no trouble sleeping My sleep is slightly disturbed (less than 1 hour sleep loss) My sleep is mildly disturbed (1-2 hour sleep loss) My sleep is moderately disturbed (2-3 hours sleep loss) My sleep is greatly disturbed (3-5 hours sleep loss) My sleep is completely disturbed (5-7 hours sleep loss) Section 10 - Recreation □ □ □ □ □ □ I am able to engage in all my recreational activities with no neck pain at all I am able to engage in all my recreational activities with some pain in my neck I am able to engage in most but not all o f my usual recreational aetivities because of pain in my neck I am able to engage in a few of my usual reereational aetivities beeause of pain in my neck I can hardly do any recreational aetivities because o f pain in my neck I can’t do any recreational activities at all. Comments:___________________________________________________________________________ Name_ Date 95 Appendix 5 - Letter of Consent Phone: (250)564-3077 Fax; (250)564-3008 Email: lQls.lachhead@cicims.com......................................... ...... 210-1811 Victoria Street Prince George, BC V2L 2L6 Ciotral (mWw xAMâî^AA October 24,2008 University of Northern British Columbia 3333 University Way Prince George, BC V2N 4Z9 Attention : Research Ethics Committee Dear Sirs: As an officer o f Central Interior Disability Management Services, I have authorized your student, Lois Lochhead, Student Number 200002080, to use data collected at our facility for research related to her Master’s Thesis entitled Assessment o f Perceived Functional Ability: Using Rasch Analysis to Evaluate the Measurement Properties o f Four Perceived Pain & Disability Scales. An intact data set will be provided to Ms. Lochhead once Ethics Approval has been obtmned fi’om UNBC. The data was collected by our staff and entered into an Excel spreadsheet in house. This data set c o n t a s no personal identifiers o f the clients who took part in Functional Capacity Evaluations at our facility between 1998 and 2008. Each client signed a consent to evaluate and a sample of this form has been provided to Ms. Lochhead. Original material remains in the offices o f Central Interior Disability Management Services. If you have any questions or concerns, please do not hesitate to contact me. Charles J. Attwater Human Resources Manager 96 Appendix 6 - Consent to Evaluate Phone: {250)564-31)77 Fat (250)564-31X18 E m £ tais.lodihea