"Pick Me, Pick Me, I Want to Be a Counsellor": Assessment of a MEd-Counselling Application Selection Process using Rasch Analysis and Generalizability Theory by Stefanie Sebok B.A., University of Victoria, 2008 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF EDUCATION IN COUNSELLING THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA July 2010 © Stefanie Sebok, 2010 1*1 Library and Archives Canada Bibliotheque et Archives Canada Published Heritage Branch Direction du Patrimoine de I'edition 395 Wellington Street OttawaONK1A0N4 Canada 395, rue Wellington OttawaONK1A0N4 Canada Your file Votre reference ISBN: 978-0-494-75128-2 Our file Notre reference ISBN: 978-0-494-75128-2 NOTICE: AVIS: The author has granted a nonexclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or noncommercial purposes, in microform, paper, electronic and/or any other formats. L'auteur a accorde une licence non exclusive permettant a la Bibliotheque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par telecommunication ou par I'lnternet, prefer, distribuer et vendre des theses partout dans Ie monde, a des fins commerciales ou autres, sur support microforme, papier, electronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriete du droit d'auteur et des droits moraux qui protege cette these. Ni la these ni des extraits substantiels de celle-ci ne doivent etre imprimes ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformement a la loi canadienne sur la protection de la vie privee, quelques formulaires secondaires ont ete enleves de cette these. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. 1+1 Canada ii Abstract The purpose of this research project was to evaluate the effectiveness of the Many-Facet Rasch Model and Generalizability Theory as applied to the application selection committee for the Masters of Education in Counselling Program at UNBC. These two models investigated the items used to score applicants and assessed the rater characteristics of each member on the application selection committee. This evaluation was used to inform the School of Education and provide feedback to refine the selection process in the future. Overall, the applicant selection process at UNBC produced a unitary score that can be used to rank all individuals applying to the counselling program. The 5-point rating scale used to evaluate applicants served as an appropriate measurement tool for assessing applicants. The raters who participated as members on the selection committee were fitting both as groups and as individuals in selecting applicants for the counselling program. To conclude, the Many-Facet Rasch Model and Generalizability Theory served as appropriate measurement tools for describing the details of items, raters, and applicants. in Table of Contents Abstract u Table of Contents iii List of Tables vii List of figures viii Dedication ix Acknowledgements x Chapter One: Introduction 1 Rationale For the Study 1 Statement of the Problem 2 Theoretical Framework 4 Rasch Model 4 Generalizability Theory 4 Research Question 5 Definition of Terms 5 Chapter Two: Review of Literature Rasch Measurement 9 9 Dichotomous Rasch Model 11 Rasch-Andrich Rating Scale Model 13 Rasch-Masters Partial Credit Model 14 Many-Facet Rasch Model 15 The Effect of a Rater's Influence 16 Leniency/Severity effect (or generosity error) 16 IV Central tendency effect 17 Res triction-of-range effect 17 Halo effect 18 Generalizability Theory 19 Relative and absolute decisions 20 Random and fixed facets 20 Decision studies 21 Rasch Meets Generalizability Theory Chapter Three: Research Methods 21 24 Participants 25 Raters 25 Applicant pool 25 Issues of access to applicant pool 26 Ethical Considerations 26 Instruments 27 Procedures 28 Measures used for Analyzing Data 29 Rasch analysis 29 Generalizability analysis 29 Chapter Four: Results Many-Facet Rasch Analysis 31 31 Applicant pool 34 Items 36 Raters 37 Applicants 38 Generalizability Analysis 40 Variance components for items 41 Variance components for raters 41 Residual variance component 42 G-Facets analysis 42 Decision studies 45 Part-time Applicants Revisited Rasch revisited 48 49 Full-time and Part-time Applicants Revisited 51 Probability 55 G-Theory revisited Chapter Five: Discussion and Conclusions 56 58 Items 59 Raters 60 Applicants 63 Student Raters 64 Conclusions 65 Limitations of the Design 67 Recommendations for the Application Selection Committee 68 Faculty agreement 68 Northern perspective 68 VI Interviews 68 Applicant waitlist 69 Good measurement practice 69 Recommendations for Future Research 70 References 72 Appendix A 76 Appendix B 77 Appendix C 78 Appendix D 79 Appendix E 81 Appendix F 83 Vll List of Tables Table 1 Items Measurement Report 33 Table 2 Full-time Applicants Items Measurement Report 34 Table 3 Part-time Applicants Items Measurement Report 35 Table 4 Rater Measurement Report 38 Table 5 Applicant Measurement Report 39 Table 6 Estimated G-Study Variance Components 41 Table 7 G-Facets Analysis 43 Table 8 Estimated G-Study Variance Components for Full-Time Applicants 44 Table 9 Estimated G-Study Variance Components for Part-Time Applicants 44 Table 10 D-Studies for Raters 46 Table 11 D-Studies for Items 46 Table 12 Estimated G-Study Variance Components for Combined Part-Time Applicants 49 Combined Part-time Applicants Items Measurement Report 50 Combined Part-time Applicants Rater Measurement Report 51 Combined Prince George and Northwest Applicants Rater Measurement Report Table 16 Estimated G-Study Variance Components for Combined Prince George and Northwest Applicants 54 Table 13 Table 14 Table 15 57 Vlll List of Figures Figure 1 Wright variable map for relationships among facets for Prince George applicants Figure 2 Figure 3 Figure 4 Figure 5 32 Alternative D-Studies for determining the optimal number of raters and items 47 Wright variable map for relationships among facets for Prince George and Northwest applicants 53 Probability curves for the 5-point rating scale used to evaluate MEdCounselling applicants from both the Prince George and Northwest Campuses 55 Scale structure for applicants from Prince George and Northwest campuses 56 ix Dedication For my mom who provided the encouragement, the belief, and the love to get me here. X Acknowledgements Peter "the Great". Thank you for being the best supervisor, colleague and friend that a person could ever ask for. Being your student has been the most statisfying experience and the memories we share will stay with me forever. Continue to be the great educator you are by sharing your visions and bringing out the best qualities in the students you teach. Linda. Thank you for all the time you invested in my thesis project. I really appreciate that you were able to be there with me in Whitehorse for my thesis defence. Kenneth. Thank you for being a part of my thesis project. I found your questions at the defence to be both challenging and insightful. Robert. Thank you for supporting my thesis work and for sharing your knowledge and expertise of Rasch. I am very honoured to say you were my external examiner. I would like to acknowledge the School of Education at the University of Northern British Columbia for allowing me to access the data used in this analysis. John. I appreciate how supportive you were throughout my time in the Masters of Education program. Your calm demeanour and insightful words are great examples of how professors can encourage their students. Serena. You were my partner in crime throughout this Masters of Education program. Thank you for being there whenever I needed you. You are a very gifted counsellor and a truly wonderful friend. To my amazing and talented editors: Jeff, Dee, Brandon, and Michael. I appreciate all the time each of you spent to help make my work a masterpiece. Finally, I want to thank my family and friends who have always been there for me and who continue to encourage me to reach for the stars. 1 Chapter One: Introduction Rationale for the Study People all over the world make decisions about which individuals should be promoted at a job, which individuals are most in need of financial assistance or medical attention, and which individuals are outperforming others in a classroom setting. Each day, people make judgements and decisions based on some kind of formal or informal assessment criteria. When people are making judgements, there is often a bias, whether identified or not, that influences how each individual personally views and interprets the dynamics of a situation. Johnson and Johnson (2003) suggest that when people work together as a group there is more opportunity for the bias that influences their decisions to be identified and addressed. For this particular reason, high stakes decision making usually involves a group approach rather than an individual approach. Within the academic setting, a group approach to decision making is often used to evaluate the quality and content of work that individuals are trying to research and publish. Groups are also used to assess the effectiveness of particular programs, courses, and instructors. Committees consisting of a variety of individuals with a wealth of experience are constructed so that the operations of a university can be carried out with confidence and ease. Through the use of committees, university policies can be changed or implemented, instructors can be hired or fired, and prospective students can be offered an opportunity to study at an institution. The purpose of this research study was to assess the overall effectiveness of the MEdCounselling applicant selection process as it currently exists at the University of Northern British Columbia (UNBC) through an analysis of the effectiveness of the items used by the i 2 application selection committee to score applicants and rater characteristics of each of the group members on the application selection committee. This evaluation will inform the School of Education and provide feedback that could be used to further enhance and refine the quality of its application selection process in the future. Statement of the Problem Since UNBC first started admitting counselling students, there has been no formal assessment of the overall application process. There has been no attempt to collect empirical evidence to support the validity of the instruments and rating scales used by the selection committee to assess applicants who apply to the counselling program. Therefore, the question of whether the applicant selection process at UNBC produces a unitary score based on the rating scales that can be used to rank all individuals applying to the counselling program remains unanswered. Furthermore, there has never been any formal assessment of the 5-point rating scale used to evaluate the applicants. Without any further investigation, it is indeterminate whether the items' levels of difficulty are appropriately matched to the population of applicants that apply to the MEd-Counselling program each year. In the present study, given that item characteristics were assessed, it logically followed that rater variation must also be addressed (Smith, E. V., 2004). The Many-Facet Rasch Model (Linacre, 1989) can be used to address the query of whether the raters on the selection committee are behaving in a way that demonstrates their ratings of the individuals applying the counselling program fit the model. There is also the question of how the student raters behave in comparison to the faculty raters. If the Rasch model is going to be employed to investigate the rating behaviour of the participants in this study, then further assessment of the Rasch model needs to be conducted to see if it is a viable method of identifying and 3 compensating for rater differences. The evaluation of the Many-Facet Rasch Model (Linacre, 1989) will be conducted through a comparison with Generalizability Theory, another measurement methodology that has been shown by the literature to investigate applicant data, item characteristics, and rater behaviour. As UNBC continues to develop and establish itself as a top educational institution, the process of obtaining a seat is going to become more competitive, especially in graduate programs where there are usually a limited number of seats available. Often there are more than enough suitable candidates for these programs; therefore, the selection process becomes about deciding which applicants have the strongest qualities and would be the best fit for the program. Consequently, the selection committee faces added pressure to carefully determine who should be offered a seat in the program. Demonstrated reliability of the instruments and scales used to evaluate applicants on the pre-admission criteria would reassure members of the selection committee, and thus, allow them to better perform as raters. UNBC was fortunate enough to use a fully crossed design for rating applicants. This means that every member of the application selection committee was responsible for evaluating all applicants on all of the items in the pre-admissions criteria. In situations where a fully crossed design might not have been feasible, information about alternative measures, that could be used to adjust for the variability and bias that would exist, would have been convenient. In conclusion, evaluating the procedure that was used to select individuals for a competitive program was worthwhile because it provided evidence that UNBC, as an academic institution, is doing everything possible to ensure that the best-suited applicants are being granted letters of admission. 4 Theoretical Framework Rasch Model. The Danish mathematician Georg Rasch (1901-1980) created the first Rasch model that researchers in the field of measurement use today. In the 1950s, he was asked to analyze data that was collected from children for the purposes of testing intelligence. Rasch decided to use the multiplicative Poisson model because he felt that it would suitably fit the data (Rasch, 1980). He later applied the Poisson model to measure other data sets. Rasch was able to make a mathematical connection between statistical probabilities and objective measurement, which lead him to develop his own model which uses log-odds transformations and implements "additive" measurement (Linacre, 2010). For further clarification about the fundamentals of additive measurement see Appendix A. The Rasch model employs the principles of interval measurement to objectively measure the data by taking raw scores of ordinal nature and performing a series of logarithmic transformations to produce data that supports linearity (Linacre, 2010). According to Linacre (2010), if the data does not fit the Rasch model, then another model should be used. This is a revolutionary way of thinking because it gives priority to the model rather than to the data. Generalizability Theory. In 1972, Cronbach, Glaser, Nanda, and Rajaratnam first introduced the concepts of Generalizability Theory (G-Theory) by extending the work previously done by Hoyt in 1941 (Kieffer, 1999). Using traditional ANOVA methods, GTheory evaluates multiple sources of measurement variance separately through the process of one single analysis (Atilgan, 2008). In G-Theory, "the object of measurement cannot be a facet" (Kieffer, 1999, p. 161) because facets are defined as measures that create unknown error variance. Any sources of variance that do not come from the individuals themselves can be considered a facet. G-Theory looks at how each facet individually contributes to 5 variation in the measurement of a person's overall score in order to obtain a better account of the person's true ability, and thus, makes inferences that can be generalized back to the population. Not only does G-Theory look at these facets individually, but it also looks at the interactions between each facet using an analysis of variance (Shavelson & Webb, 1991). Research Questions The purpose of this study was to answer the following questions: 1. What are the strengths and weaknesses of the Rasch Model and Generalizability Theory when applied to an application selection process? 2. Does the applicant selection process at UNBC produce a unitary score based on the rating scales that can be used to rank all individuals applying to the counselling program? 3. Is the 5-point rating scale used to evaluate applicants an appropriate measurement tool? 4. What is the rating behaviour of the raters that participated in selecting applicants for the MEd-Counselling program both as individuals and as groups? 5. Is the Rasch model a viable method for dealing with rater differences? Definition of Terms Many of the terms that will be used throughout this study have specific meaning as they are applied to the Rasch model. The following definitions were taken from the book, Applying the Rasch model: Fundamental measurement in the human sciences by Bond and Fox (2007). Ability Estimate: The location of a person on a variable, inferred by using the collected observations. In this study it would be the applicant's raw score on all of the evaluation items. 6 Calibration: The procedure of estimating person ability or item difficulty by converting raw scores to logits on an objective measurement scale. DIF (Differential Item Functioning): The loss of invariance of item estimates across testing occasions, such as items functioning differently for Prince George applicants and Terrace applicants or differently for males and females. DIF is evidence of item bias. Facet: An aspect of the measurement condition. In Rasch measurement, the three facets are person ability, item difficulty, and rater severity. In Generalizability Theory, the two facets are item difficulty and rater severity; person ability is not a facet as it is considered the object of measurement. Infit mean square: One of the two alternative measures that indicate the degree of fit of an item or a person (the other being standardized infit). Infit mean square is a transformation of the residuals, the difference between the predicted and the observed, for easy interpretation. Its expected value is 1. As a general rule, values between 0.70 and 1.30 are regarded as acceptable. Values greater than 1.30 are labelled as "misfitting" and those less than 0.70 as "overfitting." Item difficulty: The level of resistance to successful performance of the object of measurement on the latent variable. An item with a high level of difficulty should produce a low marginal score. An estimate of an item's underlying difficulty is calculated from the total number of persons in an appropriate sample who succeeded on that item. Latent trait: A characteristic or attribute of a person that can be inferred from the observations of the person's behaviours. 7 Logit: The unit of measurement that results when the Rasch model is used to transform raw scores obtained from ordinal data to log odds ratios on a common interval scale. The value of 0.0 logits is routinely allocated to the mean of the item difficulty estimates. Many-facets Model: In this model, a version of the Rasch model developed in the work of J. M. Linacre (1989), facets of testing situation in addition to person ability and item difficulty are estimated. Rater, test, or candidate characteristics are often-estimated facets. Missing data: Data to which there are non-responses for items. Typically, these are items that an applicant did not answer (in this case did not submit), items that were not administered to the applicant, or items that were not judged by a rater. Noise: Randomness in the data as suggested by the Rasch model or excessive unpredictability in the data, perhaps due to excessive randomness or multidimensionality. Outfit mean square: The measure of degree of fit that is sensitive to outliers, unexpectedly correct responses on hard items or unexpectedly incorrect responses on easy items. Generally, values between 0.60 and 1.40 are generally regarded as acceptable. Raters: Faculty and students who evaluate candidates' test performances in terms of performance criteria. Residual: The residual values represent the difference between the Rasch model's theoretical expectations and the actual performance. Unidimensionality: A basic concept in scientific measurement where one attribute of an object is measured at a time. The Rasch model requires a single construct to be underlying the items that form a hierarchical continuum. 8 Part of the problem with having multiple raters assess and evaluate applicants is the ambiguity that is involved in the process. Rating and scoring the performance of individuals is a difficult task because presenting individuals with a rating scale and asking them to use it in the same way that they would use measuring cups is unrealistic. For this reason, existing rating scales need to be evaluated to see if they are appropriate measurement tools for what the institution is hoping to measure in potential applicants. Examining the rating behaviour of those individuals involved in the process is the other half of evaluating rating scales because information about a rating scale's effectiveness comes from how individuals are interpreting the scale. The ultimate goal of this study was to look at the applicants, items, and rater behaviour to demonstrate that the School of Education at UNBC is engaging in an application selection process that results in selecting the top-ranking applicants who have the background, knowledge, and experience to become good counsellors in the future. 9 Chapter Two: Review of the Literature Measurement is assessing and recording observations of things that happen all around us. Everybody uses measurement at some point in their lives. Measurement is observing how fast a car is going, how much of a particular medication is being administered, or how many onions should be added to make the perfect spaghetti sauce. Measurement occurs when a person is concerned with the outcome of what they are doing. Many people will use methods of measurement to provide security and structure to their lives. Measuring something makes it credible and important, as people often measure things that matter in some way or another. Those who are interested in applied forms of measurement know that there are many different types of measurement models available, including factor analysis, general linear models, regression, item response theory, and psychometrics. Often, the challenge with measurement is finding the measurement model that fits with what it is that an individual is interested in measuring. In this study, the researcher was interested in measuring human performance, rating behaviour, and item difficulty and fit. Using a model that employed a multi-facet approach seemed most appropriate to use for the intent of this research. After investigating different measurement models that could be used to answer the research questions put forth by the researcher involved in this study, it appeared that Generalizability Theory and the Many-Facet Rasch Model were the most appropriate. Rasch Measurement Rasch measurement can be used to measure any aspect of a situation. The greatest advantage of employing the Rasch model is how flexible and successful it has been in analyzing data over the years (Fox & Jones, 1998; Kim & Wilson, 2010; Liu, Minsky, Ling, & Kyllonen, 2009; MacMillan, 2000a; McHorney, Haley, & Ware, 1997). One thing that 10 allows the Rasch model to be so effective is that the model controls the researcher's thinking (Linacre, 2010). Due to the expectation that the data must fit the model, large sample sizes are not required for Rasch analysis. Researchers are required to examine how well the data fits the model by examining the differences between the observed scores and the expected scores (Lochhead, 2009). McHorney, Haley, and Ware (1997) explain that when data are missing, researchers can use the expected score information to calculate the missing data. The robust nature of the Rasch model seems highly appropriate to application selection because university programs never know exactly how many applicants are going to apply to the program each year. The principles of Rasch measurement are advantageous in situations, like applicant selection, where it may not be feasible for all raters to evaluate every applicant on each item of measurement. Fox and Jones (1998) explain that a Rasch analysis produces a set of Infit and Outfit values, also referred to as Fit statistics, for all facets of a data set. This means that for each applicant, item, and rater involved in this research study estimated parameters were calculated. These Fit statistics are useful to researchers as they provide information about which applicants, items, or raters behaved unexpectedly (Fox & Jones, 1998). Fit statistics have a mean-square value of 1.0 and have a positive infinite range (Linacre, 1995). Meansquare values that are greater than 1.0 are labelled "underfit", while mean-square values less than 1.0 are labelled "overfit" (Linacre, 2010). There is no agreed upon range for meansquare values; however, Wright and Linacre (1994) state that a reasonable range would be between 0.6 and 1.4, although it can vary depending on what exactly is being measured. Engelhard (1992) suggests that an acceptable range for Infit and Outfit statistics is 0.5 to 1.5. Linacre (1999) warns researchers to observe extreme caution with Fit statistics that are 11 greater than 2.0 because Fit statistics that are greater than 2.0 are so unexpected that there is hardly any useful information that can be reliably inferred. R. M. Smith (2004) as well as others are deliberately conscious of paying attention to the f-tests, \t\ > 2 that accompany Infit and Outfit values as they feel the interpretation means more than a specific number generated by the computer software. Infit mean-square statistics provide information about the unexpected inliers and examine the region where a person's ability generally is. Outfit mean-square statistics provide information about the unexpected outliers and would be more sensitive to situations where a person answered a really easy item incorrectly or a really difficult item correctly (Linacre, 2010). If for some reason the data were problematic, Fit statistics would be the first place the researcher would detect the problem. The Rasch model is more descriptive of the problems that may not be observed if the researcher was using another model. For instance, an ANOVA could be used to provide the same sort of analysis; however, an ANOVA would have a difficult time adjusting for incomplete data and raw scores that have not been standardized (Lunz, Wright, & Linacre, 1990). Linacre (1997) suggests that even incomplete data are not a problem for the Rasch model because usually a best estimate can be calculated. It appears that at the present time, the Rasch model is the optimum tool used for measurement in the human sciences, more specifically Education and Health Sciences (Bond & Fox, 2007). Dichotomous Rasch Model. When Georg Rasch (1980) initially described the basis of his model in 1953, he used correct/incorrect, true/false, and yes/no examples to illustrate a person's responses to individual items. Rasch hypothesized that statistics generated from the testing process could be filtered down to person ability and item difficulty. He believed that the probability of getting a correct answer was equal to the ratio between person ability and 12 item difficulty. The dichotomous Rasch model is known for being the basic starting point for Rasch analysis. The dichotomous Rasch model for person ability and item difficulty is shown below: loge(Pni/l-Pm) = B„ - D, where loge(Pm/l-Pm) = log-odds of person n succeeding on item i Pm = the probability that person n correctly answers item / 1-Pm = the probability that person n incorrectly answers item i B„ = ability of person n D[ = difficulty of item / (Linacre, 2010) Georg Rasch employed a natural log-odds ratio that represents numeric values using two symbols, 1 (indicating success) and 0 (indicating failure). He used this idea to create an additive system, rather than a multiplicative system that was used by mathematicians previously (Wright, 1997). The Rasch model transforms qualitatively ordered data into interval data using the mathematical principles of logarithms. Logit scores are the units produced by this conversion. When Pm =1-Pm (the chance of success on an item = the chance of failure on an item) the logit value will have a mean of 0.0. The logit scale is an interval scale of measurement where each individual logit unit has meaning because of the measurable differences that can be observed (Bond & Fox, 2007). This transformation from ordinal data to interval data is necessary because researchers interested in measurement need observable measures, not raw scores to make inferences about the data. Ordinal and interval data both reflect logical order. Hence, the main difference between ordinal and interval is that interval measures have equal units of measurement, and those equal units of 13 measurement imply equal differences in value (Hurlburt, 2006). The reason why inferences cannot be made from ordinal data is because raw scores include unwanted parameters (Wright, 1997). Through Rasch analysis these raw scores are calibrated so that the logit scores create a distribution of person abilities and item difficulties that can be compared and measured (Lochhead, 2009). Rasch-Andrich Rating Scale Model. In 1978, David Andrich took the principles of the dichotomous Rasch model and created the rating scale model, which he believed was made possible through a series of Rasch dichotomies (Linacre, 2010). This extension of the dichotomous Rasch model was designed to deal with items that have more than two response options. A common application of this model is observed with the use of Likert scales (Strongly agree, Agree, Neutral, Disagree, Strongly disagree) where the distance between each response is designed to be the same (Linacre, 2010). The Rasch-Andrich rating scale model for person ability and item difficulty is shown below: loge(Pnij/Pni(j-l)) = B n - D i - F j where Pnij = probability of person n scoring at level j on scale i Pm(j-i) = probability of person n scoring at level y'-i on scale / Bn = ability level of person n Di = difficulty of level of item i Fj = difficulty of the step from level y'-i toy (Linacre, 2010) Andrich (1996) explained that the scaled responses are similar to dichotomous measurement; however, the only difference is that the rating scale model partitions responses into intervals across the latent unidimensional linear construct. 14 Rasch-Masters Partial Credit Model. By 1982, Geoff Masters advanced the rating scale beyond the work of Andrich to reflect partial credit for partial correctness, specific to each individual item (Linacre, 2010). According to Masters (1982), the rationale for creating partial scoring is to provide as much information as possible about a person's overall ability. Think of being asked to solve a math problem that requires five steps. Say a person calculated all the steps correctly with the exception of a slight calculation error in the final step causing them to get the final answer wrong. If the question was scored using a dichotomous model, the person would get the question wrong because they failed to get the final answer correct. With a partial credit model, the person would be able to get 4/5, suggesting that he or she was close to getting the final answer correct. In this version of the rating scale, each individual item has the freedom to vary in its number of estimates (Lochhead, 2009). The Rasch-Masters partial credit model for person ability and item difficulty is shown below: loge(Pmj/Pmo-i)) = B n - D 1 - F 1 J where PniJ = probability of person n scoring at level j on scale i Pm(J_i) = probability of person n scoring at level j-1 on scale / Bn = ability level of person n D, = difficulty of level of item / Fy = difficulty of the step from level j-1 toj in the rating scale equation specific to the measure of difficulty for item / (Masters, 1982) This equation shows that each item has its own rating scale that is specific to the difficulty level of that particular item. If the researcher was looking solely at how the 15 applicants were performing on various items, then a partial credit model could be used. However, because the applicants were being rated by faculty and students on these various items the Many-Facet Rasch Model needs to be employed. Many-Facet Rasch Model. Research over the last couple of decades has demonstrated that the Many-Facet Rasch Model (MFRM) can be successfully applied in various settings (Chang & Chan, 1995; Engelhard, 1992; Kim & Wilson, 2010; Linacre & Wright, 2004; MacMiIlan, 2000a; Smith & Kulikowich, 2004). The MFRM is an extension of the original Rasch measurement model as it goes beyond person ability and item difficulty to measure other factors that interact with a testing situation. The MFRM was developed from the work of John Michael Linacre (Bond & Fox, 2007). An example of what a three-facet Rasch model would look like is featured below: Loge(Pmjk/PmJ(k-l)) = B„ - D, - Cj - Fk where Pmjk = probability of examinee n being graded k by judge j on item i Pnij(ki) = probability of examinee n being graded k-1 by judgey on item i Bn = performance measure of examinee n T>i = difficulty of item i Cj = severity of judge j Fk = difficulty of grading Step [category] k relative to Step [category] k-1 (Lunz, Wright, & Linacre, 1990) Using the Many-Facet Rasch Model, Chang and Chan (1995) examined the functional independence measure of patients in a stroke rehabilitation program; while MacMiIlan (2000b) applied it towards assessing Curriculum Based Measurement reading scores. 16 Nevertheless, one of the places where the Many-Facet Rasch Model seems to be most successful is with studies that have observed raters and judges (Engelhard, 1994; Linacre, 1997; Lunz, 1999; Linacre, Wright, & Lunz, 1990; O'Neill, 1999). It is worthwhile to include rater behaviour in the assessment of a person's overall ability because the severity of individual raters could have a significant impact on how a person's ability is articulated. The Effect of a Rater's Influence Human beings all perceive the world subjectively through their own set of standards. This subjectivity is significant in the context of looking at rater behaviour because an individual's rating of another person's ability can be difficult to measure. Some studies like the one conducted by Liu and colleagues (2009) have put forward the unsubstantiated view that as long as raters share a similar background and are motivated by what they are doing, then rater variability is not an issue. Other studies carried out by Engelhard (1994) and O'Neill (1999) clearly demonstrate that each rater differs significantly based on his or her own personal standards of excellence. Research in this area has demonstrated that providing training to raters does not alter rater evaluations because each rater's level of severity seems to be engrained in his or her personal view of what it is being assessed (Lunz, Wright, & Linacre, 1990). However, some researchers state that training raters to become aware of the effects, biases, and errors involved in the process of rating minimizes rater errors in a variety of settings (Edward W. Wolfe, personal communication, April 27, 2010). Leniency/Severity effect (or generosity error). The leniency is used to describe the behaviour of individuals who rate above the midpoint of the scale, while severity is used to describe the behaviour of those who rate below the midpoint of the scale (Myford & Wolfe, 2004a). Myford and Wolfe (2004a) describe these effects as tending to occur when raters 17 know, or have been able to identify in some way with, whom they are rating. There are a few ways that a researcher could detect rater effects from a Rasch analysis. Myford and Wolfe (2004b) suggest looking at the following statistical indicators for signs of rater leniency or severity effects: fixed chi-square values for raters, rater separation ratio and reliability, Wright variable map, rater severity, and rater fair average measures. Alternatively, Engelhard (1994) suggests looking at the Fit statistics for each rater; Fit statistics could be used to determine how much each rater would need to be calibrated. Central tendency effect. Central tendency effect is when a rater avoids using the outermost categories (Linacre, 2010). It can also occur when a rater overuses the middle categories (Myford & Wolfe, 2004a). Central tendency effects often occur because the rater is afraid to make a mistake. The problem with raters that overuse the middle categories is that every applicant they rate looks average and thus, their ratings become information poor. Myford and Wolfe (2004b) suggest looking at the following statistical indicators for signs of central tendency effects: fixed chi-square values for applicants, applicant separation ratio and reliability, rater Fit statistics, rating scale category thresholds and probability curves for raters. Restriction-of-range effect. Range restriction is closely related to central tendency in that it occurs when all of the raters avoid using certain categories or have scores that are clustered together, not necessarily around the midpoint (Myford & Wolfe, 2004a). One of the problems with restriction-of-range effects is that the rating scale is not fully represented, which means that the item to which an applicant is being rated is using a different rating scale than initially intended. Myford and Wolfe (2004b) suggest looking at the following statistical indicators for signs of restriction-of-range effects: the standard deviation across all 18 applicants for a specific item, ANOVA interactions, frequency counts for how many times each rater used each point on the scale, and probability curves. Halo effect. Halo effect can occur when the evaluation given by the first rater influences the ratings given by the next rater (Linacre, 2010). Engelhard (1994) defines the halo effect as the rater assessing a person's ability holistically rather than on an item-by-item basis. Myford and Wolfe (2004a) suggest that out of all the rater effects, the halo has been most studied and received the largest amount of attention in the literature. Halo effects result when a rater is unable to separate independent aspects of a person's behaviour. Myford and Wolfe (2004b) suggest looking at the following statistical indicators for signs of restrictionof-range effects: fixed chi-square values for items, item separation ratio and reliability, rater Fit statistics, rater-by-item interaction analysis. Myford and Wolfe (2004a) have identified other rater effects that are less likely to be encountered such as randomness, inaccuracy, logical error, contrast error, influences of rater biases, beliefs, attitudes, and personality characteristics, influences of rater/applicant background characteristics, proximity error, primacy error, and order effects. These effects are also less prevalent in the research literature probably because they are more difficult to measure. With a better understanding of how the effects of a particular rater can bias the evaluation of individuals, researchers are better able to attain a more accurate account of a person's true abilities. Without some measure adjusting for rater variation, a person's ability score could be heavily influenced by whether the person rating them was severe or lenient (Lunz et al., 1990). It would be ideal if all raters on a committee could rate each individual separately in a full crossed design; however, in many cases that is not the most feasible approach 19 considering the quantity of applications certain programs receive and the fast turn around time that is generally required. According to Linacre (1997), fully crossed designs are not essential: "the only requirement on the judging plan is that there be enough linkage between all elements of all facets that all parameters can be estimated without indeterminacy with one frame of reference" (p. 1). Therefore, in situations where it is not feasible to have every rater evaluate all applicants, Rasch analysis would still be possible as long as there is some overlap among raters. Generalizability Theory Over the last decade there have been countless studies carried out that have assessed and analyzed a person's ability using Generalizability Theory (Atilgan, 2008; Oosterveld & ten Cate, 2004; Pedersen, Hagtvet, & Karterud, 2007; Winne et al., 2006). Atilgan (2008) used G-Theory to assess and score applicants who would later be selected as students for music education programs. Oosterveld and ten Cate (2004) used G-Theory to assess the applicant selection procedure for students wishing to attend medical school. G-Theory describes how reliable one person's score is when generalized to the greater universe of all scores. GStudies are designed to provide researchers with information about the sources of variation that contribute to error in measurements by providing estimates for each source of error (Shavelson & Webb, 1991). By examining all of the identified main effects and interactions of facets that are involved in a study individually, researchers are able to account for the unexplained sources of variability, and thus produce a G coefficient that reflects the true amount of variance associated with a person's score (Shavelson & Webb, 1991). Every GTheory analysis produces a set of G coefficients, one for relative decisions and one for absolute decisions. 20 Relative and absolute decisions. Relative and absolute decisions are made based upon how the researcher wishes to generalize his or her findings. Relative decisions involve interpreting an individual's overall placement in relation to other individuals (Kieffer, 1999; Shavelson & Webb, 1991). Within this particular study which looked at the application selection process of MEd-Counselling applicants, all the G-Study decisions would be relative given that the researcher was looking at how well applicants, raters, and items place in relation to each other. Absolute decisions are made when an individual's level of performance is determined by achieving a minimal level regardless of how it sits in relation to other individuals (Kieffer, 1999; Shavelson & Webb, 1991). A good example of an absolute decision within G-Theory analysis is how individuals are evaluated for their learner driver's license; once an individual gets 40 questions correct, they pass regardless of how many other people also got 40 questions correct on that day. Random and fixed facets. A crucial concept associated with G-Theory is the distinction between random and fixed facets. A facet is considered random if the researcher is willing to interchange one for another (Kieffer, 1999). For example, say that a researcher interested in examining rater behaviour has selected two second year counselling students to rate applications from the population of all second year counselling students. If the researcher was willing to exchange those two students initially selected for two other students from the population of second year counselling students, then the rater facet would be considered random. Theoretically in this case, the rater behaviour of the students selected could be generalized to the population of which the students were drawn from. A facet is described as fixed when the researcher has included exactly those in the area of interest and is not willing to interchange them for another (Kieffer, 1999). Using the example mentioned previously, if 21 the researcher was inflexible with the students who were selected as raters, then the rater facet would be considered fixed. Gender (male and female) is another example of a fixed facet as there are no other genders of interest available to the researcher. Decision studies. Within the framework of G-Theory the researcher takes on the role of a decision maker. The first part of a G-Study is to estimate variance components that would allow the researcher to examine the differences between an observed score and the universe of all possible scores (Matt, 2010). The second part of a G-Study is to make decisions about the optimal measurement condition which would allow the researcher to sufficiently generalize the results obtained from the G-Study (Pedersen, Hagtvet, & Karterud, 2007). Decision studies provide indicators for how a study could be refined and better developed in the future. Within the context of this study, the decision studies would reflect the number of raters and items that would be required to make the results of the study reliable based on the variance components of the G-Study. Rasch Meets Generalizability Theory Linacre (1989) was one of the first researchers to put the Many-Facet Rasch Model (MFRM) together with Generalizability Theory (G-Theory). MacMillan (2000b) introduced the combination of Classical Test Theory, MFRM, and G-Theory. MacMillan found that all approaches were effective in detecting rater variability. However, the MFRM identified more variation than G-Theory did. G-Theory considers not only the facets, but also interactions among the facets whereas the MFRM assumes there are no interactions and treats each facet independently (MacMillan, 2000b). Countless researchers have been exploring the relationships between multiple methods to enhance the ability to fully answer research questions. The combination of the MFRM and G-Theory became popularized 22 around 2004/2005. Using the MFRM and G-Theory combination, Smith and Kulikowich (2004) assessed the problem-solving skill level of fourth grade students. They found it useful to use both MFRM and G-theory with the same data set because although both measurement models provide information about variability that exists among facets, the approach used to obtain such information differed greatly (Smith & Kulikowich, 2004). Linacre and Wright (2004) published a book chapter that explored the construction of the MFRM and G-Theory by using data that required judges to rate examinees on various items. Linacre and Wright (2004) concluded: Generalizability theory (G-Theory) and Many-facet Rasch measurement (MFRM) appear to be competing methodologies aimed at solving the same empirical problems. But this is not the case. Though the data sets specified by the two methodologies may be similar, or even identical, their purposes are fundamentally different, (p. 312) Furthermore, G-Theory attempts to correct for variability in future sets of data collection, while the MFRM attempts to correct for variability in the current set of data (Linacre & Wright, 2004). Given the nature of the present study, the Rasch model seems to fulfill the needs of the institution better than G-Theory at this time. Sudweeks, Reeve, and Bradshaw (2005) suggest that G-Theory is a holistic approach to analyzing data as it examines main effects and interactions, while the MFRM is a narrow approach that focuses on the individual facets for analysis. Both G-Theory and the MFRM have strengths when analyzing data, and Sudweeks and colleagues (2005) propose that using both forms together yields a more comprehensive analysis of what is really occurring within a given data set. Most recently, Kim and Wilson (2010) used data from individuals assessing student writing ability to expand on the notion that the MFRM and G-Theory are more than alternative methodologies to measuring data. Although both measurement methodologies will provide 23 information about the applicants, items, and raters, Kim and Wilson (2010) suggest that the researcher needs to clearly outline what exactly it is that they are interested in studying. If the researcher is interested in looking at how groups are behaving, then G-Theory would be most helpful; however, if the researcher is concerned with individual performance, then the MFRM or another item response model should be employed. Precise and accurate measurement requires the right set of tools. Many measurement models, like the Rasch model, have evolved to meet the needs and demands of the facets that individuals want to measure. Although the expansion of research and its associated literature has grown to the point where researchers can measure almost anything, error and bias are still likely to occur. The more complex the object that researchers are set out to measure, the more creative researchers need to be in their measurement approach. Combining different measurement methodologies is one way in which researchers can gain a more holistic approach and capture the true representation of what they wish to measure. 24 Chapter Three: Research Methods The Many-Facet Rasch Model (MFRM) and Generalizability Theory (G-Theory) are two highly researched and well established measurement methodologies that will be used to assess the MEd-Counselling application selection process. These two methodologies were particularly applicable considering the intended purpose of this study. To begin the analysis, it was interesting to look at how well the items used as pre-admissions criteria were measuring an applicant's overall level of ability. Another worthwhile area to examine was the rating behaviour of the members on the application selection committee from both individual and group perspectives. The information generated from the evaluation of specific raters would determine how much each rater influences a particular applicant's chances of being offered a letter of acceptance. Research has found that most institutions are willing to accept that there are differences in scoring from one rater to another and they may even try to correct for this by selecting raters who share a similar background (Liu, Minsky, Ling, & Kyllonen, 2009). However, studies examining interrater reliability argue that regardless of any attempts made, there still exist significant differences between raters (Lunz, Wright, & Linacre, 1990). The present study tried to examine exactly where these differences were located. For instance, looking at whether rater differences were exhibited among genders (male and female), education level (faculty and students) or discipline (counsellors and non-counsellors) could have important implications for determining which individuals should sit on the selection committee in the future. Finally, using the MFRM and G-Theory to measure these differences alleviated any concerns about the overall effectiveness of the application selection process and ensured fairness for all applicants applying to the program. 25 Participants Raters. The participants for this study were three faculty members and two graduate students. All of the participants agreed to take part in this study when they signed a consent form to participate as a rater (to view the consent form please see Appendix B). The faculty sample consisted of three faculty members from the School of Education. All three faculty members teach in the counselling program, although only two out of the three faculty members were educated as counsellors. The graduate students consisted of two graduate students currently in the MEd-Counselling Program. The two graduate students were both near completion of their degrees and are expected to have graduated before the successful applicants enter the program in September 2010. The researcher is one of the two students; the other student was approached after agreement between the supervisor and the researcher. In order to ensure anonymity of applicants for the purposes of this research, all application packages were stripped and coded for the participants by the Chair of the selection committee. The five participants agreed to examine the data from individuals who applied to the MEd-Counselling program for intake in September 2010. Following the UNBC's established procedure, all applications were collected by the University's registrar's office. Applicant packages of individuals who met the GPA requirement (as well as those individuals who did not meet the GPA requirement, but were specifically requested by the Counselling Coordinator) were then forwarded to the Counselling Coordinator in the School of Education who checked them over and prepared them for the selection committee. Applicant pool. The probable applicant pool consisted of applicants from two campuses: Prince George and the Northwest region. The population applying was roughly 80% female and 20% male. Applicants ranged in age from 22 years old to 55 years old. Most applicants 26 obtained a Bachelor's degree in Psychology, Social work, Criminology, or Education. Finally, the level of relevant work experience of those applying to the program ranged from volunteer experience in the helping arena to those who have been employed in the counselling field for over 30 years. Issues of access to applicant pool. There were no issues of access to the applicant data that the participants rated as consent was inherent with the application (implied consent). Individuals who applied to the University have given consent for the information they supply to be used for research purposes when they apply to the program ("UNBC Graduate Calendar," 2009). Ethical Considerations Ethical issues are always a critical consideration when conducting research. The challenge of ethics with this research was ensuring that the principle of confidentiality was honoured for the protection of the applicant data set. To ensure that confidentiality was maintained throughout this research study, identification numbers rather than names were used to code all of the applicant packages. The Chair of the committee stripped and coded the data so that all raters were given application packages with no identifiers. The average number of applications that the Counselling Program in the School of Education had typically received in past years was roughly 30-40, suggesting that there should be no added risk of successful applicants being later identified by the student raters. Both of the student raters used in this study have successfully completed the counselling ethics course at UNBC and are fully expected to uphold the values of confidentiality set out by the licensing organization, Canadian Counselling and Psychotherapy Association; this includes maintaining confidentiality in both therapeutic and research settings. Furthermore, 27 both of the students who rated applications have completed all course requirements and are expected to graduate before September 2010; therefore, neither of the student raters would be expected to encounter any of the successful applicants as a student in the future. Instruments An application package consisted of (a) a grade-point-average (GPA), (b) relevant degree information (c) written evidence of involvement with people in appropriate settings, (d) a written personal statement, and (e) three letters of reference. This pre-admissions criterion was developed by faculty members from UNBC's School of Education. Corey, Corey, and Callanan (2003) recommended all these pre-admissions criteria as suitable for screening potential students for a counselling program. For the present study, the application packages contained all of the same information that was collected in the previous years to select counselling students. Members of the Counselling Program application selection committee rated all of the information, with the exception of GPA, on a series of 5-point scales. Every member of the selection committee, as well as both of the student raters, rated and scored the entire application package for each applicant that applied to study at the Prince George campus. An overall score based on all of the application criteria was used to rank the applicants. The rank ordering information generated throughout this study was given to the faculty members to use for their final decisions about who would be offered a seat in the program. Although the two student raters were participants in the research and rated the applicants for the purpose of this study, the final decisions as to letters of offer came from the admissions committee that did not include the student raters. 28 Procedures This study was reviewed and supported by the UNBC Research Ethics Board. The Research Ethics Board stated that ethics approval was not required as they found this study to be a typical program evaluation that would not interfere with the established protocol for selecting MEd-Counselling applicants (See Appendix C). Upon receiving this decision both verbally and in writing, the researcher contacted the Chair of the application selection committee who held all of the applications that were received. The Chair of the application selection committee had these applications photocopied with all identifiers removed for each of the committee members. All members of the selection committee, as well as the two student raters who participated in this research study, were given a brief training session where the established selection process was explained. The researcher and her supervisor described the 5-point scales used for rating the applicants (See Appendix D). The researcher also briefly explained to the participants some of the most common rating behaviours that have been shown to be problematic. Each rater participant in this study was given copies of all the application packages that needed to be evaluated. The students' application packages were exactly the same as the ones given to the faculty selection committee members. All applications were read and scored within a two-week period as agreed upon by the selection committee. Once all five raters finished scoring every application package, the packages were returned to the Chair of the selection committee. Each applicant GPA and all rater scorings were entered into an EXCEL file by the researcher. 29 Measures used for Analyzing Data There are different statistical programs that could be used to analyze the data, such as Microsoft EXCEL, SPSS, or FACETS. For the purpose of this study, the data obtained from the raters was compiled using EXCEL and analyzed using FACETS (Linacre, 1996) and EDUG (Swiss Society for Research in Education Working Group, 2010). Rasch analysis. The research design was a fully crossed three-facet Rasch analysis examining person ability by rating the quality of their counselling application, the difficulty of the items on the 5-point scale, and the severity of all the raters on the selection committee. The many-faceted Rasch rating scale model used for this analysis of the data set was the same one as described earlier in the literature review. This measurement model was used for the analysis because it allowed the researcher to examine interactions between multiple facets. Each facet was examined to see the level of influence it had on the probability of a particular applicant scoring the way they did on specific items by various raters. Generalizability analysis. In addition to the Rasch analysis, the research data was examined using a fully crossed two-facet Generalizability analysis. For the purpose of this study, item difficulties and rater behaviour are the two facets that were analyzed. The applicants are described as the objects of measurement in G-Theory; therefore, a three-facet Rasch study and two-facet G-Theory study are for all intents and purposes the same analysis. 30 A two-facet G-Theory design is associated with six sources of variability that can be examined as follows: Source of Variability Type of Variation Persons (p) Universe-score variance (object of measurement) Raters (r) Constant effect for all persons due to stringency of raters Items (/) Constant effect for all persons due to their behavioural inconsistencies from one item to another p* r Inconsistencies of raters' evaluation of particular persons' behaviour p */ Inconsistencies from one item to another in particular persons' behaviour Constant effect for all persons due to differences in raters stringency from one item to another p * r* i, e Residual consisting of the unique combination of p, r, o; unmeasured facets that affect the measurement; and/or random events (Shavelson & Webb, 1991, p. 9) G-Theory allowed the researcher to partition different sources of variability that exist within measurement situations. This process was mathematically possible using the same logic and mechanics of a factorial ANOVA (Shavelson & Webb, 1991). 31 Chapter Four: Results Many-Facet Rasch Analysis The results of the many-facet Rasch analysis are shown in Figure 1. The far left column titled "Measr" is the logit scale used to measure all of the facets within the design. The second column is the distribution of the applicants; most of the applicants were situated within the 0 to 2 region on the logit scale, indicating that they were proficient applicants. The third column contains the program status: full-time or part-time studies. The fourth column is the rater facet. Notice that all of the raters were positioned around the 0 logit mark. Those raters above 0 logits would be considered more severe, while those below 0 logits would be considered less severe. The raters will be examined individually later in this section. The fifth column represents the item difficulties; more difficult items are in the positive logit region and the less difficult items are in the negative logit region. The item difficulties will also be discussed later in this section. The final column "S.l" shows the ratings on the 5-point scale. 32 Measr +Student -Program 3 + -Rater -Items S.l + (5) 2 + •k -k -k -k k k k k k k k k k k k k k k k k k k k k k 1 + kkkk k k k k k k k kk * k k k 0 * * + -1 + * PGFT Rl PGPT * R2 R4 R5 R3 8 1 2 6 10 + (D + Figure 1. Wright variable map for relationships among facets for Prince George applicants. 33 The many-facet Rasch analysis was completed using FACETS software, version 3.03 (Linacre, 1996). According to Linacre (2010), the data need to fit the Rasch model if they are to support linearity. A unidimensional Rasch analysis operates under the assumption that multiple observations can be viewed as one theoretical construct (Bond & Fox, 2007). By conducting a many-facet Rasch analysis, the researcher was able to see that all of the criteria items used to assess prospective MEd-Counselling students (degree, writing ability, goals, work experience, referee quality and suitability) fit within a unidimensional construct. A summary of the item characteristics and their facet statistics are located in Table 1. Table 1 Items Measurement Report Obsvd Fair Model Inf it Outfit Average Average Measure S.E. MnSq ZStd MnSq ZStd 3.8 2.96 .26 .08 0.6 -4 0.6 -4 1 Degree 3.8 4.1 3.02 .19 .08 1.0 0 1.0 0 2 Writing Ability 3.38 -.24 .09 0.9 0 0.9 0 3 Fit of Goals 3.4 2.40 .84 .07 1.5 4 1.5 4 4 Work Experience 4.4 3.82 -.87 .10 1.2 1 1.1 1 5 Rl:Suitability 3.9 3.08 .13 .08 0.9 0 1.0 0 6 Rl:Quality 4.3 3.66 -.63 .09 1.2 2 1.2 2 7 R2:Suitability 3.7 2.84 .38 .08 0.8 -2 0.8 -2 8 R2:Quality 4.2 3.46 -.35 .09 1.2 1 1.2 1 9 R3:Suitability 3.8 2.93 .30 .08 0.8 -2 0.8 -2 10 R3:Quality Adj S.D. .48 Separation 5.67 Reliability Fixed (all same) chi-square: 320.7 Random (normal) chi-square: 9.0 d.f.: 9 d.f.t 8 Nu Items .97 significance: .00 significance; .34 Recall from the literature review that Engelhard (1992) suggests that an acceptable range for Infit and Outfit statistics is 0.5 to 1.5. R. M. Smith (2004) as well as others are deliberately conscious of paying attention to the t- tests, |f| > 2. High Infit and Outfit statistics may be viewed as an indication of multidimensionality. The work experience item had the highest Infit and Outfit value of 1.50 with a t = 4.0; this value borders on what is considered an acceptable range for Infit and Outfit statistics. Typically, one misfitting item would not normally be considered an indication of a lack of unidimensionality or model fit. However, on-going discussion among the primary researcher and the other raters indicated differing views of the work experience and writing ability items. All raters were aware that there were two types of applicants, those applying for full-time studies and those applying for part-time studies. For this particular reason, the applicant pool was divided according to full-time or part-time status and the item analysis repeated. Applicant pool. By dividing the applicants into two groups, those wishing to pursue fulltime studies and those wishing to pursue part-time studies, it became clear that there are two populations existing within the applicant sample. The item summary for the full-time students is featured in Table 2 and the item summary for the part-time students in Table 3. Table 2 Full-time Applicants Items Measurement Report Obsvd Fair Average Average 3.9 3.12 .08 3.9 3.15 .04 4.1 3.35 -.24 3.0 1.96 4.4 3.8 Model Infit Outfit S.E. MnSq ZStd MnSq ZStd .10 0.5 0.5 -5 1 Degree .10 1.0 0 1.1 0 2 Writing Ability .10 1.0 0 1.0 0 3 Fit of Goals 1.42 .08 1.2 1 1.2 1 4 Work Experience 3.76 -.89 .12 1.2 1 1.2 1 5 Rl:Suitability 3.06 .16 .10 1.0 0 1.0 0 6 Rl:Quality 4.3 3.68 -.76 .11 1.2 1 1.2 1 7 R2:Suitability 3.7 2.81 .46 .09 0.9 0 0.9 -1 8 R2:Quality 4.2 3.59 -.61 .11 1.2 2 1.2 1 9 R3:Suitability 3.7 2.91 .34 .10 0.9 0.9 -1 10 R3:Quality Adj S.D. .64 Measure Separation 6.33 -1 Reliability Fixed (all same) chi-square: 454.5 Random (normal) chi-square: 9.0 -5 d.f.: 9 d.f.: 8 Nil Items .98 significance: .00 significance: .34 The misfit for the work experience item now disappeared in both of the two analyses. However, a new feature is apparent. The full-time applicants found the work experience 35 item to be the most difficult item (logit score = 1.42); they were rated lowest on this item while the part-time applicants found the work experience item to be the least difficult (logit score = -1.38) and were rated highest on this item. An item's level of difficulty, for the purpose of this study, was defined by the logit measure, which indicates the difficulty of endorsement by the raters on each particular item. The "Obsvd Average", shown in the far left column, gives the average of the raw observed scores. The second column titled "Fair Average" is the interval based adjustment of the observed average score as calculated from the linear transformation of the raw score. Table 3 Part-time Applicants Items Measurement Report Obsvd Fair Model Infit Outfit Average Average Measure S.E. MnSq ZStd MnSq ZStd Nu Items 3.5 2.41 1.00 .17 1.0 0 1.0 0 1 Degree 3.7 2.56 .80 .17 1.0 0 1.0 0 2 Writing Ability 4.2 3.28 -.27 .19 0.9 0 0.9 0 3 Fit of Goals 4.6 3.91 -1.38 .24 1.1 0 1.1 0 4 Work Experience 4.5 3.71 -1.02 .22 1.2 1 1.2 0 5 RlzSuitability 4.0 3.01 .14 .18 1.2 1 1.2 1 6 Rl:Quality 4.2 3.32 -.35 .19 1.3 1 1.3 1 7 R2:Suitability 3.9 2.86 .37 .18 0.7 -2 0.6 -2 8 R2:Quality 3.9 2.86 .37 .18 1.1 0 1.1 0 9 R3:Suitability 3.9 2.88 .34 .18 0.8 -1 0.8 -1 10 R3:Quality Adj S.D. .70 Separation 3.63 Reliability Fixed (all same) chi-square: 124.5 Random (normal) chi-square: 8.9 d.f.: 9 d.f.: 8 .93 significance: .00 significance; .35 Among the part-time applicants, meeting the degree requirements appeared to be the most difficult item (logit score = 1.00) followed closely by level of writing ability (logit score = 0.80). In contrast, degree requirements and writing ability were of average difficulty (0.08 and 0.04 logits) for the full-time applicants. This seeming lack of invariance of item difficulty was judged to be legitimate population difference rather than a lack of fit of the 36 data to the Rasch model. Examination of the fit statistics separately for applicants applying to full-time or part-time studies (Table 5) yielded the same number of applicants who had Infit and Outfit mean squares that did not fall within what would be considered the acceptable range for fit statistics. When applicants were viewed as two distinct populations, both of the outliers, applicant 17 (Infit = 1.70, t = 3; Outfit = 1.70; t = 2) and applicant 27 (Infit = 2.10, t = 4; Outfit = 1.90, t = 3), were from the population applying for full-time studies and all of the applicants applying to part-time studies fit within the 0.50 to 1.50 range. However, when all of the applicants were considered as one population there were still two outliers, applicant 27 (Infit = 1.80, t = 3; Outfit = 1.70, t = 2) who applied for fulltime studies and applicant 48 (Infit = 1.80, t = 3; Outfit = 1.70, t = 2) who applied for parttime studies. Although examining the full-time and part-time applicants separately revealed some interesting results, there was no strong evidence to suggest that the two groups must be analyzed separately. The data demonstrate sufficient fit to the specified three-facet Rasch rating scale model. Therefore, the results featured below were generated from the 49 MEdCounselling applicants competing for a seat at the Prince George campus. Items. The Mean Square Fit indices have been discussed in relation to unidimensionality. The items are now described in more detail. The items, "Rl, R2, R3, Suitability," are ratings of the suitability of the referees who provided a reference for the various applicants. Most referees were rated as well suited to comment on the appropriateness of the applicants. Conversely, the raters interpreted the referees' comments relatively severely, producing Fair Average measures of 2.84 to 3.08. Overall, the "appropriateness of the first degree," the "writing ability," and the "fit of the applicant's stated goals" with the nature of the counselling program were all items of average difficulty. The fixed (all same) chi-square of 37 320.7, df= 9 was found to be statistically significant (p < .005). This information solely suggests that the items are uniquely different in their level of difficulty from one another. Furthermore, all of the item scores together produced a separation ratio of 5.67 and a reliability coefficient of 0.97. The separation ratio produces a measure of the test items with the test error calculated in the measure (Fisher, 1992) and the reliability coefficient produces a measure of consistency for the item differences. Since the separation ratio and reliability coefficient are both high, these results indicate that each of the ten items used to evaluate applicants vary in their level of difficulty, thus capturing a wide range of applicant suitability to the MEd-Counselling program. Raters. The rater measurement report shown in Table 4 describes the behaviour of each of the four raters. All of the Infit and Outfit statistics for the raters fell within an acceptable range. Myford and Wolfe (2004) suggest that in situations that involve high-stakes decision making, the fit mean square indices should be more stringent, adjusted to 0.8 to 1.2 in this case forjudges. The Infit scores for the raters ranged from 0.80 to 1.30 and the Outfit scores ranged from 0.80 to 1.20. However, one student rater and one faculty rater both displayed ratings (Infit = 0.8; Outfit = 0.8) that would be "cramped" or "information poor" by the t-test criterion: -3 and -2 respectively for Infit scores and -2 and -3 respectively for Outfit scores. The other student rater demonstrated opposite rating behaviour (Infit score = 1.30, t = 4 and Outfit score = 1.20, t = 3), suggesting that the ratings given by this rater would be more erratic. Table 4 Rater Measurement Report Obsvd Fair Model Infit Average Average Measure S.E. MnSq ZStd MnSq ZStd 3.8 3.06 .15 .06 1.0 0 1.1 1 1 Faculty Counsellor (New) 3.9 3.18 .01 .06 0.8 -3 0.8 -2 2 Faculty Counsellor 3.9 3.11 .09 .06 1.0 0 1.0 0 3 Faculty Non-Counsellor 0.8 -3 4 Student Counsellor I 1.2 3 5 Student Counsellor II Outfit 4.0 3.26 -.09 .06 0.8 -2 4.0 3.32 -.16 .06 1.3 4 Adj S.D. .10 Separation 1.63 Reliability Fixed (all same) chi-square: 18.1 Random (normal) chi-square: 4.0 d.f.: 4 d.f.: 3 Nu Rater .73 significance: .00 significance: .26 The most severe rater (Rl) had a measure of 0.15 and the most lenient rater (R5) had a measure of -0.16, a spread of ±0.16 logit off the mean with 2 S.E. being 0.12. This disparity suggests that the raters were fairly homogeneous when it came to rating the applicants. However, the fixed (all same) chi-square of 18.1, df = 4, was statistically significant (p < .005) which indicated rater differences. The separation ratio of 1.63 and the reliability coefficient of 0.73 indicate that the raters were somewhat reliably different. It does not matter for this study because the researcher used a fully crossed design, but in situations where the design is not fully crossed, it would be ideal if the reliability coefficient was lower. The lower the reliability coefficient, the more confident the researcher can be in the results, as a reliability coefficient of zero indicates that there is no difference between any of the raters (Sudweeks et al., 2005). Applicants. The data used for this study consisted of raw scores from 49 MEdCounselling applicants across ten items (rated on a scale of 1-5). The data from the applicants showed the results in Table 5. Table 5 Applicant Measurement Report Obsvd Fair Average 4.6 4.0 4.2 4.2 3.9 4.0 4.1 4.1 3.6 3.8 4.5 4.2 3.8 4.2 3.9 3.9 3.1 4.1 3.8 4.1 3.6 4.1 2.9 3.7 4.1 3.3 4.2 3.8 4.0 3.6 3.5 3.8 3.6 4.3 4.3 3.5 3.9 3.9 4.5 4.0 4.2 4.1 4.0 3.9 4.1 4.1 4.0 3.9 3.9 Average 4.66 3.99 4.26 4.26 3.89 3.99 4.09 4.15 3.63 3.77 4.50 4.17 3.77 4.26 3.95 3.91 3.16 4.09 3.77 4.13 3.62 4.15 2.89 3.67 4.07 3.28 4.26 3.77 4.05 3.63 3.52 3.85 3.60 4.34 4.30 3.52 3.87 3.95 4.51 4.01 4.19 4.09 3.99 3.91 4.15 4.17 3.99 3.95 3.93 Adj S.D. .52 Separation Inf:it Model Measure 2.69 1.16 1.64 1.64 .99 1.16 1.33 1.45 .59 .80 2.20 1.48 .80 1.64 1.09 1.02 -.03 1.33 .80 1.41 .56 1.45 -.33 .65 1.30 .11 1.64 .80 1.26 .59 .42 .92 .53 1.82 1.73 .42 .96 1.09 2.24 1.19 1.52 1.33 1.16 1.02 1.44 1.48 1.16 1.09 1.05 2.79 Fixed (all same) a) chi-sguare: 436.6 S.E. .26 .19 .20 .20 .18 .19 .19 .20 .17 .18 .23 .20 .18 .20 .18 .18 .15 .19 .18 .19 .17 .20 .15 .17 .19 .16 .20 .18 .19 .17 .16 .18 .17 .21 .21 .16 .18 .18 .23 .19 .20 .19 .18 .18 .19 .20 .18 .18 .18 MnSq 0.7 0.9 1.5 1.3 1.3 0.9 1.0 0.9 0.7 1.0 0.7 0.9 0.8 1.2 0.6 1.0 1.5 1.0 0.9 0.6 1.0 0.8 1.0 0.8 0.5 0.7 1.8 1.5 1.1 0.8 0.9 0.6 1.4 0.6 0.8 0.5 0.8 1.2 1.2 1.2 0.7 1.5 1.3 0.7 1.1 1.3 0.8 1.8 1.4 Reliability d.f.: 48 Random (normal) chi-sguare: 47.5 d.f.: 47 ZStd -1 0 2 1 1 0 0 0 -1 0 -1 0 -1 0 -2 0 2 0 0 -2 0 0 0 0 -2 -1 3 1 0 -1 0 -2 1 -2 -1 -2 -1 0 0 0 -1 1 1 -1 0 1 -1 3 1 Outf:it MnSq 0.7 0.9 1.4 1.2 1.3 0.9 1.0 0.9 0.8 1.0 0.7 0.9 0.8 1.3 0.6 1.1 1.4 1.0 1.0 0.6 1.0 0.9 1.1 0.9 0.6 0.7 1.7 1.4 1.2 0.8 0.9 0.6 1.4 0.6 0.8 0.6 0.8 1.1 1.2 1.1 0.7 1.4 1.3 0.7 1.1 1.2 0.8 1.7 1.4 .89 significance: .00 significance: .45 ZStd -1 0 1 0 1 0 0 0 -1 0 -1 0 0 1 -2 0 2 0 0 -2 0 0 0 0 -2 -1 2 1 0 0 0 -2 1 -2 -1 -2 -1 0 0 0 -1 1 1 -1 0 1 0 2 1 Nu Applicant 1 11001 2 11002 3 11003 4 11004 5 11005 6 11006 7 11007 8 11008 9 11009 10 11010 11 11011 12 11012 13 11013 14 11014 15 11015 16 11016 17 11017 18 11018 19 11019 20 11020 21 11021 22 11022 23 11023 24 11024 25 11025 26 11026 27 11027 28 11028 29 11029 30 11030 31 11031 32 11032 33 11033 34 11034 35 11035 36 11036 37 11037 38 21038 39 21039 40 21040 41 21041 42 21042 43 21043 44 21044 45 21045 46 21046 47 21047 48 21048 49 21049 The Infit scores for the applicants ranged from 0.50 to 1.80, while the Outfit scores ranged from 0.60 to 1.70. As previously discussed, only two out of the 49 applicants had Infit and Outfit scores that did not fall within what would be considered the acceptable range for fit statistics. The applicant's ability measures ranged from -0.33 to 2.69 logits, mean =1.14, SD = 0.56. The fixed (all same) chi-square of 436.6, df = 48, is statistically significant (p < .005). The separation ratio of 2.79 and the reliability coefficient of 0.89 indicate that the applicants are moderately heterogeneous, but nevertheless separable. As discussed earlier, these results are likely due to the combination of two groups: the full-time applicants and the part-time applicants. Generalizability Analysis The Generalizability analysis was completed using EDUG software, version 6.0 (Swiss Society for Research in Education Working Group, 2010). The design consisted of a two factor fully crossed design involving applicants (the object of measurement), raters (four individuals), and items (ten separate items). According to the Generalizability analysis, this study is considered to be a two-facet fully crossed design, with the two facets being the items and the raters. The applicants are not defined as a separate facet because they are considered to be the "object of measurement." With respect to the Rasch analysis, this study would be considered a three-facet fully crossed design where the items, raters, and applicants are all considered to be distinct facets. Using some of the same principles of traditional ANOVA, G-Theory uses variance components to represent the amount of error that comes from generalizing from a facets score to a universal score (Swiss Society for Research in Education Working Group, 2010). These variance components are shown in Table 6. 41 Table 6 Estimated G-Study Variance Components Components Source Variance Component % df SE Persons (P) 0.095766 11.3 48 0.022414 Raters (R) 0.004273 0.5 4 0.003557 Items (1) 0.066511 7.9 9 0.037895 P*R 0.077503 9.2 192 0.007936 P*l 0.257369 30.4 432 0.021688 R*l 0.041624 4.9 36 0.010971 P*R*I 0.303342 35.8 1728 0.010314 Coef_G relative 0.86 Coef_G absolute 0.85 Variance components for items. The variance component for items reflects differences between each of the ten items. The variance component for the main effect of items was 0.067, accounting for approximately 8% of the total variance. Similarly, the variance component for the main effect of applicants was 0.096, accounting for approximately 11% of the total variance. Ideally, the variance component for applicants should be higher, indicating a more heterogeneous population. The more heterogeneous a population, the higher the value of the G coefficient. The reasoning for this correspondence is that in homogenous populations, raters have a harder time differentiating between applicants and thus, typically produce lower G coefficient values. Nevertheless, the variance component for the applicant-by-item interaction was 0.26 (31%), which indicates that applicants were behaving differently from one item to the next. Variance components for raters. The variance component for raters was 0.0043 (0.5%), indicating little variation due to rater differences. The rater-by-item interaction was also a comparatively small percentage 4.9% (0.042) of the overall variance, indicating that the raters used the scale consistently on each item. Interestingly, the variance component for the 42 raters-by-applicants interaction was 0.078 (9.2%), suggesting that the raters used the same standards, but disagreed on how they employed the standards for each of the applicants. These results were further reinforced through written comments and personal communication between the primary researcher and each of the raters in the study. Residual variance component. The residual variance component was generated based on the information from one person, one item, one rater and other sources of error. The residual should theoretically be small compared to all of the other variance components in a G-Study. Unfortunately, in this study, the residual variance component accounts for the largest portion of variance: 0.30 and 35.8% of the total variance. This residual value means that after accounting for the variation in the main effects and multiple two-way interactions, 36% of the variance remains unaccounted for. This result would be more worrisome if the design had not been fully crossed. In a less than fully crossed design the two way interactions would become sources of error, depending on the nature of the lack of complete data. G-Facets analysis. The results from this study produced a G coefficient of 0.86, which indicates the amount of variance associated with each applicant's score based on the universal score. The Swiss Society for Research in Education Working Group (2010) suggests that an acceptable G coefficient is one that is greater than or equal to 0.80. According to these standards, this study produced a G coefficient that adequately supports the precision of the measures produced. The relative G coefficient was used as opposed to the absolute G coefficient because the behaviour of each individual applicant is viewed in relation to the behaviour of all the other applicants. The G-facets analysis showed relative G coefficient values that ranged from 0.81 to 0.85 for individual items and 0.79 to 0.86 for 43 individual raters. Table 7 presents the G coefficient values associated with the G-Facets analysis for each individual rater and item. Table 7 G-Facets Analysis Facet Raters Items Level Coef_G rel. Coef_G abs. Faculty Counsellor (New) 0.836198 0.830243 Faculty Counsellor 0.790445 0.779478 Faculty Non-Counsellor 0.858087 0.848327 Student Counsellor I 0.817899 0.809834 Student Counsellor II 0.852909 0.849115 Degree 0.841459 0.828426 Writing Ability 0.834080 0.819554 Fit of Goals 0.830586 0.814528 Work Experience 0.815632 0.801002 R1 :Suitability 0.840492 0.833948 R1 :Quality 0.832698 0.817089 R2:Suitability 0.836034 0.829761 R2:Quality 0.813943 0.798169 R3:Suitability 0.845146 0.835270 R3:Quality 0.816160 0.800983 Given that the Rasch analysis revealed interesting results when the applicants were divided according to full-time or part-time studies, separate G-Studies for full-time and part-time applicants were conducted. The G-Studies for the full-time students are featured in Table 8. 44 Table 8 Estimated G-Study Variance Components for Full-Time Applicants Components Source Variance Component % df SE Persons (P) 0.120499 13.5 36 0.031017 Raters (R) 0.011866 1.3 4 0.008122 Items (1) 0.122905 13.8 9 0.066886 P*R 0.070837 7.9 144 0.008364 P*l 0.222118 24.9 324 0.021929 R*l 0.062909 7.0 36 0.016184 P*R*I 0.282136 31.6 1296 0.011075 Coef_G relative 0.89 Coef_G absolute 0.88 The G-Studies summary for the part-time students is featured in Table 9. Table 9 Estimated G-Study Variance Components for Part-Time Applicants Components Source Variance Component % df SE Persons (P) 0.013894 2.1 11 0.012038 Raters (R) 0.005114 0.8 4 0.006905 Items (I) 0.081811 12.1 9 0.051668 P*R 0.073803 10.9 44 0.015521 P*l 0.155859 23.1 99 0.030207 R*l 0.060311 8.9 36 0.019347 P*R*I 0.284411 42.1 396 0.020161 Coef_G relative 0.48 Coef_G absolute 0.47 When the applicants were divided according to whether they requested full-time or part-time program status, a G coefficient of 0.89 was observed for the full-time applicants and a G coefficient of 0.48 was observed for the part-time applicants. These results demonstrate that the applicants who applied for full-time studies were viewed as moderately heterogeneous and the applicants who applied for part-time studies viewed as homogeneous. Notice that the 45 full-time applicants produced a variance component of 0.120499 (13.5%) while the parttime applicants produced a variance component of 0.013894 (2.1%). As mentioned previously, homogeneity presents a problem in Generalizability analyses because the goal of G-theory is to describe the reliability of generalizing from a person's observed score to a universe of scores. If the sample selected for a G-Study is not representative of the population, then it becomes difficult to generalize the results back to the population. The other consideration that arises from separating the applicants according to program status is that there were only 12 applicants that applied for part-time studies. Without a suitable number of persons in the sample, G-Theory is not able to produce a G coefficient that supports the precision of the measures. Decision studies. D-Studies were conducted as part of this study. D-Studies use data from a G-Study to provide information about the optimal conditions for future research designs (Shavelson & Webb, 1991). The results from the D-Studies reflect results for raters when items are fixed at ten and for items when the raters are fixed at five. Table 10 presents the D-Studies for the desired number of raters. Table 11 presents the D-Studies for the desired number of items. 46 Table 10 D-Studies for Raters G-study Option 1 Option 2 Option 3 Option 4 Option 5 Level Level Level Level Level Level p 49 49 49 49 49 49 R 5 1 2 3 4 6 1 10 10 10 10 10 10 0.015501 0.077503 0.038751 0.025834 0.019376 0.012917 0.860690 0.552703 0.711924 0.787548 0.831724 0.881149 Rel. Err. Var. Coef_G rel. Rounded Abs. Err. Var. Coef_G abs. 0.86 0.55 0.71 0.79 0.83 0.88 0.016355 0.081776 0.040888 0.027259 0.020444 0.013629 0.854130 0.539401 0.700793 0.778431 0.824078 0.875413 Rounded 0.85 0.54 0.70 0.78 0.82 0.88 G-study Option 1 Option 2 Option 3 Option 4 Option 5 Level Level Level Level Level Level Table 11 D-Studies for Items p 49 49 49 49 49 49 R 5 5 5 5 5 5 1 10 6 8 12 14 16 0.031804 0.053006 0.039755 0.026503 0.022717 0.019877 0.714167 0.59986 0.666537 0.749891 0.777677 0.799907 Rel. Err. Var. Coef_G rel. Rounded Abs. Err. Var. 0.71 0.6 0.67 0.75 0.78 0.8 0.040026 0.066711 0.050033 0.033355 0.028590 0.025016 Coef_G abs. 0.665022 0.543621 0.613633 0.704345 0.735406 0.760561 Rounded 0.67 0.54 0.61 0.7 0.74 0.76 The results of the D studies revealed that the most desirable measurement condition for the applicant evaluation process would be obtained by using four raters and ten items, although 47 it could be argued that three raters and ten items would also be acceptable. By keeping the number of raters fixed and varying the number of items, it would take sixteen items to reach an adequate measurement condition. The application selection committee has previously established the criteria they feel would predict individuals who would be best suited to the MEd-Counselling program; therefore, adjusting the number of items would be redundant and impractical in this testing situation. Figure 2 shows the results for each of the D-Studies in relation to one another. Decision Studies 1 0.95 •Rater Study •Item Study 0.5 Figure 2. Alternative D-Studies for determining the optimal number of raters and items. 48 The graphic representation of the D-Studies that examined all of the most feasible options for the desirable number of raters and items in this study suggests that a larger G coefficient is produced when the number of raters is adjusted more than when the number of items is adjusted. To achieve a G coefficient that is approximately 0.80, either a minimum of three raters or a minimum of sixteen items is required. As stated earlier, adjusting the number of items would not be an ideal option because adding six irrelevant items would decrease the probability that all of the items would be measuring one unidimensional construct, not to mention that five raters would still be required. Conversely, it would be reasonable to adjust the number of raters, especially in this case, considering that it would involve removing one or possibly two raters and still having only ten items. Part-time Applicants Revisited In the earlier discussion of the small sample size generated from the G-Studies that examined the part-time applicants, the researcher asked if any of the original five raters would be willing to evaluate the Northwest regional applicants. Four out of the five original raters agreed: Faculty Counsellor (New), Faculty Non-Counsellor, Student Counsellor I, and Student Counsellor II. The reason they were not initially rated with the Prince George campus applicants was twofold. First, the Northwest applicants were not competing with the Prince George applicants for seats. Second, the Northwest applicants were not rated because it was a non-competitive process in that every applicant that applied was offered a seat in the program as a part-time student. Nevertheless, the part-time applicants at the Prince George campus and the part-time applicants at the Northwest campus should be similar; therefore, the two samples were combined together to produce a part-time sample of 33 applicants. Rasch and G-Theory analyses were both conducted on the combined part-time applicant sample. The G-Studies summary for the combined part-time applicants is featured in Table 12. Table 12 Estimated G-Study Variance Components for Combined Part-Time Applicants Components Source Variance Component % df SE Persons (P) 0.083924 9.4 32 0.024424 Raters (R) 0.017857 2.0 3 0.012625 Items (1) 0.047478 5.3 9 0.035126 P*R 0.061512 6.9 96 0.008883 P*l 0.341691 38.5 288 0.034140 R*l 0.065597 7.4 27 0.019381 P*R*I 0.270177 30.4 864 0.012984 Coef_G relative 0.85 Coef_G absolute 0.81 When the applicants from the Prince George campus and the Northwest campus were combined based on part-time program status, a G coefficient of 0.85 was produced. Returning to Table 8 where a G coefficient of 0.89 was produced for a sample of 37 fulltime applicants, the results from the full-time and part-time split are now comparable with each other. The combined part-time applicants produced a variance component of 0.083924 (9.4%) for persons, which is much better than the variance component of 0.013894 (2.1%) for persons when there was only a sample size of 12. The results from Table 12 demonstrate that when examined together, all the applicants who applied in 2010 for part-time studies either in Prince George or in the Northwest were moderately heterogeneous. Rasch revisited. Further results for the combined part-time applicants are found with the Rasch analysis rater and item reports. The Rasch summary for the items measurement report for the combined part-time applicants is depicted in Table 13. 50 Table 13 Combined Part-time Applicants Items Measurement Report Obsvd Fair Average Average 3.7 Model Infit Outfit Measure S.E. MnSg ZStd MnSg ZStd Nu Items 2.65 .73 .11 0.7 -3 0.7 -3 1 Degree 3.7 2.62 .76 .11 0.9 0 0.9 0 2 Writing Ability 4.3 3.32 -.33 .12 0.9 0 0.9 0 3 Fit of Goals 4.5 3.73 -1.05 .14 1.4 3 1.3 2 4 Work Experience 4.3 3.43 -.52 .13 1.1 1 1.2 1 5 Rl:Suitability 4.1 3.07 .09 .12 1.1 0 1.1 0 6 Rl:Quality 4.1 3.18 -.09 .12 1.2 2 1.3 2 7 R2:Suitability 4.0 2.98 .23 .12 0.9 0 0.9 0 8 R2:Quality 4.1 3.17 -.08 .12 1.1 1 1.1 1 9 R3:Suitability 4.0 2.95 .27 .12 0.8 -1 0.8 -1 10 R3:Quality Adj S.D. .51 Separation 4.18 Reliability Fixed (all same) chi-square: 173.1 Random (normal) chi-square: 9.0 d.f.: 9 d.f.: 8 .95 significance: .00 significance: .35 In regard to the Rasch analysis, increasing the sample size of the part-time applicants produced results for items that were similar to the results for items when the part-time sample was 12. The work experience item was still considered to be the least difficult item (logit score = -1.05) and meeting the degree requirements (logit score = 0.73) and writing ability (logit score = 0.76) still appeared to be the most difficult items for part-time students. The reliability coefficient increased to 0.93 from 0.95, affirming that each of the ten items used to evaluate applicants varied in their level of difficulty. All of the items still have Infit and Outfit values that fall within the 0.5 to 1.5 range, which support unidimensionality and model fit. Based on these results it appears as though the Rasch model was not as sensitive to sample sizes as G-Theory. The Rasch summary for the rater measurement report for the combined part-time applicants is depicted in Table 14. 51 Table 14 Combined Part-time Applicants Rater Measurement Report Obsvd Fair Model Infit Average Average 4.2 Measure S.E. MnSq ZStd MnSq ZStd Nu Rater 3.28 -.26 .08 1.3 4 1.4 4 1 Faculty Counsellor (New) 4.1 3.18 -.10 .13 0.8 -1 0.8 -1 2 Faculty Counsellor 4.2 3.23 -.18 .08 1.0 0 1.0 0 3 Faculty Non-Counsellor 4.0 3.01 .19 .08 0.8 -2 0.8 -2 4 Student Counsellor I 3.9 2.90 .35 .08 1.0 0 1.0 0 5 Student Counsellor II Adj S.D. .22 Separation 2.33 Rel lability Fixed (all same]1 chi-square : 42.1D Random (normal) chi-square: 4.1 Outfit d.f.: 4 d.f. : 3 .84 significance : .00 significance: .25 The results for the combined part-time applicants produced some interesting findings. The most severe rater (R5) had a measure of 0.35 and the most lenient rater (Rl) had a measure of -0.26. Notice by returning to Table 4 that the student raters (R4 and R5) were seen as the two most lenient raters. The findings shown in Table 14 produced the opposite results, the student raters (R4 and R5) are now seen as the two most severe raters (logit scores = 0.19 and 0.35). These results suggest that the faculty raters were more severe in their evaluation of the full-time applicants and more lenient with their evaluation of the part-time applicants. Accordingly, these results suggest that the students were more severe in their evaluation of the part-time applicants and more lenient with their evaluation of the full-time applicants. Myford and Wolfe (2004a) would suggest that the severity of the students' evaluations occured because they can better align themselves with the full-time applicants. Full-time and Part-time Applicants Revisited The results section of this paper began with a Wright variable map for relationships among facets for Prince George applicants (Figure 1). Given that incorporating the Prince George part-time applicants together with the Northwest part-time applicants seemed to 52 balance out the research design, it seemed logical to conduct an analysis on all of the applicants, full-time and part-time from both the Prince George and Northwest campuses. The data are shown in Figure 3. 53 |Measr|+Student |-Rater IS.l -Items [ + (5) 3 + ** 2 + ** I ** I ** I * ** * I * ** I ********* | ***** I i i *** ****** ********* + I **** *** ** ** * Rl -1 + R2 R3 R4 1 2 3 7 9 8 10 R5 + (D Figure 3. Wright variable map for relationships among facets for Prince George and Northwest applicants. 54 The results of the many-facet Rasch analysis for all of the Prince George and Northwest applicants together are shown in Figure 3. The second column, "Student," describes the distribution of all 70 applicants. Consistent with the results generated in Figure 1, most of the applicants were situated within the 0 to 2 region on the logit scale indicating that they were proficient applicants. The fifth column, representing the item difficulties, was also consistent with the results generated in Figure 1. This means that difficult items like work experience and degree requirements were still challenging, and easy items like finding suitable individuals to provide references were still simple. The fourth column, titled "Rater," brought about interesting results: all of the raters were positioned at the 0 logit mark. Given that it is difficult to tell the exact variability of the raters on the Wright variable map, a combined Prince George and Northwest applicants rater measurement report was produced and is shown in Table 15. Table 15 Combined Prince George and Northwest Applicants Rater Measurement Report Obsvd Fair Average Average Measure 4.0 3.21 -.02 .05 1.2 2 1.2 3 1 Faculty Counsellor (New) 3.9 3.18 .01 .06 0.8 -3 0.8 -3 2 Faculty Counsellor 4.0 3.17 .02 .05 1.0 0 1.0 0 3 Faculty Non-Counsellor 4.0 3.17 .02 .05 0.8 -3 0.8 -3 4 Student Counsellor I 4.0 3.21 -.02 .05 1.2 3 1.2 2 5 Student Counsellor II .00 Separation .00 Reliability Adj S.D. Model Infit 3.E. MnSq ZStd Fixed (all same) chi-square: .8 d.f.z 4 Outfit MnSq ZStd Nu Rater .00 significance: .94 The beginning of this section (under the subheading "Raters") stated that it would be ideal if the reliability coefficient was low. These results suggest two things. First, there were no significant differences between the five raters across the combined Prince George-Northwest 55 sample. Second, the raters can actually be interchanged as their rating behaviour does not differ enough to be worrisome. Probability The Rasch analysis includes a report of probability curves that looks at how well each of the five categories of the rating scale functions. The probability curves for the 5-point scale used in this study are presented in Figure 4. Probability Curves -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 1111 111 5551 11 11 55 55 55 11 11 55 1 444444 1 33333333 11 333 4444 44 3311 333 222 44 444 444444 111 -2.0 -1.0 444 55 444 3** 55 33 555 444 4| 333 2***5 33333 555***11 2222222 ****************5555555555555 -3.0 55 33 4411 2222 333 3333333 55 444 333 22222**2222*22 44 22222 33 n*** 22222 444** 3**4 133 55 3333333 11111111*******'"*'************** I 0.0 1.0 2.0 3.0 Figure 4. Probability curves for the 5-point rating scale used to evaluate MEd-Counselling applicants from both the Prince George and Northwest campuses. The probability curves showed some overlap and disordering in the steps between categories one, two, and three. Linacre (2010) would recommend asking whether the categories are different enough to merit a separate categorical point on the rating scale. The lower threshold of the second probability curve indicated that the raters were unable to clearly distinguish between those rating categories. Perhaps the second point on the rating scale should be collapsed into either the first point or the third point category to create more distinctive boundaries between each of the categories. Examination of the mode in the probability curves figure (Figure 4) showed no separation between category one and category three. In this case, looking at the median or mean thresholds provided a better interpretation about where the second category is aligned on the logit scale in relation to the other categories. Figures 5 provided a scale structure for each of the five categories used to scale and rate all of the applicants. Scale structure Measr:-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 + + + + + + + Mode: 45 n 34 45 (")> 45 5> + + + + + + + Measr:-3.0 -2.0 -1.0 (K0 1^0 2_ L 0 3J) Figure 5. Scale structure for applicants from Prince George and Northwest campuses. G-Theory revisited. Unfortunately, G-Theory was unable to supply any information about the categories used in the rating scale. However, a G-Theory analysis was conducted for the combined Prince George and Northwest applicants and the G-Studies summary for the combined Prince George and Northwest applicants is featured in Table 16. 57 Table 16 Estimated G-Study Variance Components for Combined Prince George and Northwest Applicants Components Source Variance Component % df SE Persons (P) 0.052949 3.0 69 0.037410 Raters (R) 0.284887 16.1 4 0.170891 Items (1) 0.031622 1.8 9 0.020051 P*R 0.776699 43.8 276 0.065885 P*l 0.280245 15.8 621 0.019512 R*l 0.034006 1.9 36 0.008830 P*R*I 0.313438 17.7 2484 0.008890 Coef_G relative 0.25 Coef_G absolute 0.20 The G coefficient warrants special consideration here given that this G-Study has produced a value of 0.25. One may remember that R2: Faculty Counsellor did not rate the Northwest part-time applicants. G-Theory does not handle the issue of missing data well. In this particular case, the EDUG software treated R2's missing data as "0" in the data set; the "0" in the data set lead to an analysis of what appeared to be a 6-point scale to which all of the part-time Northwest applicants were given exactly the same ratings of "0" according to R2's ratings. The variance component for the main effect of raters was 0.284887 and accounted for 16.1% of the total variance, which was by far the largest amount of variance seen among raters in any of the G-Studies conducted with these raters. The Rasch analysis seen earlier (Figure 3) used the same missing data set as this analysis and it was not problematic. Rasch detected the missing data and generated an output based on the estimates of the data that R2 previously rated. 58 Chapter Five: Discussion and Conclusions The primary goal of this study was to assess the overall effectiveness of a graduate student applicant selection process as it currently exists at UNBC. This process included an analysis of the items, raters, and applicants. The Many-Facet Rasch Model and Generalizability Theory were chosen because of their ability to provide relevant and credible information about rater, item and applicant consistencies. The Rasch analysis was able to show that all of the items fit within a unidimensional construct. It also informed the researcher that the rating behaviour of the participants was acceptable by providing measures that reflected the severity or leniency of each of the raters in this study. Conversely, the G-Theory analysis was able to account for the proportion of variance that each of the facets contributed. Furthermore, G-Theory provided information about alternative research designs that would be best employed in the future. Using both the Many-Facet Rasch Model and Generalizability Theory showed that the two methodologies complement each other well in their abilities to describe the elements of variability within data. Both methodologies provided the researcher with information about item, rater, and applicant characteristics in a way that could be used to make inferences about the findings of this study. The secondary goal of this study was to investigate the variability of the second year MEd-Counselling students who were acting as raters alongside the faculty raters. When the sample size was adequate, the results indicated that the student raters behaved no differently than the faculty raters. A complete applicant report for all 70 applicants from both the Prince George and Northwest campuses can be found in Appendix E. 59 Items. By conducting the Rasch analysis, the researcher showed that all of the items differed from each other in degree of endorsement (difficulty); however, they were not different enought that the items could not be measured as a single unidimensional construct. In analyzing the items used to rate applicants applying to the MEd-Counselling program, the Rasch model conveyed the range of each item's level of difficulty. The Rasch analysis provided precise information about the data. For instance, the Rasch analysis indicated to what degree the work experience item was potentially misfitting by displaying large Mean square Fit values (Infit = 1.5; t = 4 and Outfit = 1.5; t = 4). These values were later discovered to be caused by the combined full-time and part-time samples of applicants. The Rasch analysis was also able to show that the relevant degree item was too predictable by the low Infit and Outfit scores that were produced (Infit = 0.6; t = -4 and Outfit = 0.6; t = -4). Based on this information it would be worthwhile to remove this item from the rating scale and have it coded by an administration assistant. An assistant would likely give the same value as the application selection committee. The FACETS program (Linacre, 1996) was able to provide Fit statistics (Infit and Outfit values) for each item, rater, and applicant involved in this study. The Rasch analysis produced a separation ratio (5.67) and reliability index (0.97) for the items; these statistics mean that the items were highly separable, but also that the differences between the items was over five times greater than the error associated with the measurement model. The G-Theory analysis indicated specific variance components for each of the facets and possible combination of interactions identified in this study. For example, the G-Theory analysis demonstrated that the items differ slightly in difficulty by showing that 7.9% of the variance was due to the items facet. The G-Theory analysis produced individual G 60 coefficients for each of the ten items used to measure the applicants, both by relative and absolute standards. In conducting a Rasch analysis and a G-Theory analysis for the full-time and part-time applicants, the researcher was able to show a seeming invariance for items across each sample of applicants drawn from larger populations. Raters. There were five raters in total who participated in this study: three faculty members and two students. As far as the rater analysis was concerned, the Rasch analysis was most informative in providing information about how each rater behaved individually because of its ability to transform the raw scores into logit scores and place them on an interval scale of measurement. The rater measurement report shown in Table 4 produced a 0.31 logit spread among the five raters. A 0.31 logit spread is low, considering the diversity of knowledge and experience among all five raters. The study conducted by Sudweeks and colleagues (2005) produced a logit spead of 0.51 among raters. The Rasch analysis showed that the most severe rater (0.15 logits) was a new faculty member who had a counselling background, but no previous experience with this task at this institution. The next severe rater (0.09 logits) was the other faculty member, one who did not have a counselling background, but had considerable experience with this process. The faculty member who had a counselling background and was familiar with the application process was situated in the middle of the five raters (0.01 logits). Overall, the two student raters were the most lenient of all the raters (-0.09 and -0.16 logits), with the first being overly constrained and the other being somewhat erratic according to t statistics (Infit = -2 and 3; Outfit = 4 and 3). Verbal communication between the two student raters revealed that, although they have limited experience in the field of counselling, neither of them felt qualified enough to be rating the applicants on items concerning reference suitability and work experience. An 61 interesting finding related to the behaviour of the raters was discovered when analyses were conducted separately based on the full-time and part-time status. The students rated the parttime applicants more severely than the full-time applicants. Meanwhile, the faculty members rated the full-time applicants more severely than the part-time applicants. Perhaps, as Myford and Wolfe (2004a) suggested, these effects occurred because the raters (full-time students) were able to identify in some way with the applicants whom they were rating. Further investigation of this issue, as revealed by one of the student raters, suggested that although work experience is essential in the field of counselling, a strong level of writing ability is necessary in order to meet the demands of the UNBC MEd-Counselling program. Personal communication between the researcher and the faculty member without a counselling background resulted in the faculty member asserting that part-time applicants bring a wealth of knowledge to the program, having worked in the helping profession for an extended amount of time. The student rater who believed that a strong level of writing ability is critical admitted that the work experience component is a crucial one; but the rater argued, that once admitted into the program, both the full-time and the part-time students are required to perform at the same level academically, given that each student is working towards fulfillment of the requirements for a Master of Education degree. Nevertheless, the results generated from the sample of combined Prince George and Northwest applicants suggested that the variation in rating behaviour may not have been just a difference of opinion, but rather a result of a small sample size for the part-time applicants. When the sample sizes for the full-time and part-time applicants were above 30, no significant differences were found between any of the raters (Table 15). The three faculty raters had logit scores of -0.02, 0.01, and 0.02 that produced a 0.04 logit spread. This result is comparable with the study conducted by Smith and Kulikowich (2004) that produced a 0.04 logit spread, but for only two raters. The two student raters had logit scores identical to two of the faculty raters, which produced an overall 0.04 logit spread for all the raters. The five raters behaved in a way that made it difficult to distinguish between each of them, suggesting that this process may only require one or two raters to evaluate and select the successful applicants. The Generalizability analysis suggested that all of the raters used the rating scale similarly (0.5%) and the relatively small (4.9%) rater-by-item interaction supports the use of a less than fully crossed rater by applicant design. However, the 9.2% variance component for the rater-by-applicant interaction indicated otherwise. This was not seen as an issue when the full-time and part-time applicants were analyzed separately (Table 8 and Table 12). This claim is further supported by the Rasch analysis results. Likewise, a large applicant-by-item interaction (30.4%) yielded similar variance component values for the analysis of the two separate populations. However, the applicant-by-item interactions for full-time (24.9%) and for part-time (38.5%) were still larger than a researcher would hope for. At any rate, as all applicants needed to respond to all items, this interaction does not act as a source of error in this design. The G-Facets analysis (Table 7) provided relative G coefficients for each of the raters, which indicated how accurate each rater's scoring behaviour was in relation to universal scoring behaviour. Among the decision studies made throughout this analysis, the ideal condition would be four raters across ten items. This design would produce a G coefficient of 0.83 and a relative error variance of 0.019. The G-Study analysis reinforced the Rasch results of two populations of applicants existing within the sample that applied to the MEd- 63 Counselling program: full-time and part-time based on revealed sources of variation. Given that G-Theory looks at how well a single person's score can be generalized across the universe of scores, it is important to have a high percentage of variance accounted for by persons. Larger variance components for persons indicate that the sample is more heterogeneous, and thus more representative of the universal population. In retrospect, the more informative data came from conducting both the Rasch and GTheory analyses, then presenting the information to each of the raters, and asking them why they felt they rated the applicants on the items the way that they did. Engaging in personal communication with each of the raters enhanced the quantitative data by adding the unique qualitative perspective of each of the raters. Applicants. From the perspective of a researcher and counselling practitioner, it is more justified to view all of the applicants together even though the Rasch and G-Theory analyses revealed the presence of two distinct applicant populations. The separation of 2.79 and the reliability of 0.89 for the Prince George applicants, as well as the separation of 2.57 and the reliability of 0.87 for the Prince George and Northwest applicants, indicates that there was relatively strong person heterogeneity among the applicants. When all of the applicants from both the Prince George and Northwest campuses were considered as one population there were only three outliers: applicant 27 (Infit = 1.70, t = 2; Outfit = 1.60, t = 2) who applied for full-time studies, applicant 48 (Infit = 1.60, t = 2; Outfit = 1.60, t = 2) who applied for part-time studies in Prince George, and applicant 66 (Infit = 1.60, t = 2; Outfit = 1.70, t = 2) who applied for part-time studies in the Northwest. Even though each population presents different aspects, the data from the Rasch analysis suggests that how the applicants are being rated on the items is measurable as a single unidimensional construct. This is ideal 64 considering that, once offered a seat in the program, all successful applicants will be working towards completing the same degree requirements regardless of their full-time or part-time status. It was still useful to examine these two populations independently of one another because they are non-competitive with each other. The university has a specific number of seats in the program available for full-time students and a specific number of seats available in the program for part-time students; therefore, even though all of the applicants were assessed together, competition for letters of acceptance was based on whether the applicant indicated the intention of full-time or part-time study. Student Raters. Based on the information presented in this study, there is a case to be made for permanently incorporating student raters as part of the application selection process. The two student raters, who completed all course requirements and practicum, brought an alternative perspective to the application selection committee. Corey, Corey, and Callanan (2007) state in relation to professional competence and training that: A number of programs have both faculty members and graduate students on the reviewing committee. If many sources are considered and if more than one person makes a decision about whom to select for training, there is less likelihood that people will be screened out on the basis of the personal bias of one individual, (p.321) The application selection committee may consider having a student on the application selection committee to lighten the workload of faculty members during such a busy time of the semester. There was no strong evidence that the student raters behaved in such a way that would be worrisome. If the committee would like to feel more confident and secure in assessing the ability of student raters, then the report of unexpected responses (see Appendix F) produced by the FACETS program (Linacre, 1996) could be examined. When the fulltime and part-time applicants from both the Prince George and Northwest campuses were 65 analyzed together, there was actually no difference between the rating behaviour of the students and the rating behaviour of the faculty members. It is certainly a viable option to have second year counselling students acting as raters alongside the faculty raters. As a matter of fact, the Social Work hiring committee at UNBC has an undergraduate student who sits on the committee and has the same level of influence as any one of the other members on the committee. Conclusions After exploring the relationship between the Many-Facet Rasch Model and Generalizability Theory, it appears that each methodology has its prevailing strengths and weaknesses. The strengths of the Rasch model included greater detail when focusing on the individual elements of each facet and supplying error indicators for each element as well as a remarkable ability to handle small sample sizes and missing data. These strengths suggest that the Rasch model is robust to violations that many other models are unable to withstand. Some of the weaknesses of the Rasch model relate to its simplicity. The Rasch model is not overly complicated, which has some researchers convinced that it is not a viable model. Also, the lack of concrete rules relating to things such as sample size and Fit statistics has been a documented source of frustration for researchers. The strengths of the G-Theory included the ability to provide variance components for each facet's main effect and all possible interactions, the freedom to make relative or absolute decisions, and the decision studies feature that displays reliability measures for various designs. Some of the weaknesses of G-Theory have to do with its inability to compensate for small sample sizes and missing data. When it comes to the preference for one methodology over the other, the research questions should guide the preferred approach used for the analysis. In asking whether the applicant selection process at UNBC produced a unitary score that can be used to rank all individuals applying to the counselling program, the Rasch analysis triumphed over G-Theory because it was able to produce Fit statistics for each individual item. For the question of whether the 5-point rating scale used to evaluate the applicants served as an appropriate measurement tool, the Rasch and G-Theory analyses were both able to generate satisfactory results. The Rasch analysis produced Fit statistics, severity measures, a separation ratio, reliability score, and probability curves that provided information about the rating scale's performance. The G-Theory analysis generated variance components for items, applicants-by-items, and raters-by-items. The 8% variance accounted for by the main effect of items further supports the 5-point rating scale used to evaluate the applicants. In considering the rating characteristics of the participants chosen to rate on the selection committee, both the Rasch and G-Theory analyses suggested that the raters were suitable as groups (0.31 logit spread, 0.73 reliability index, and p < .005; 0.5% variance for rater main effect, 9.2% variance for applicant-by-rater interaction, and 4.9% for rater-by-item interaction). However, only the Rasch analysis provided information about how the raters behaved as individuals (Fit statistics and severity measures for each rater). Finally, the Rasch model is a viable method for dealing with rater differences because of its ability to produce severity measures, observed averages and fair averages for each rater. In conclusion, based on the nature of this study, the Rasch analysis seemed to be more advantageous than G-Theory. This advantage comes from Rasch analysis' ability to perform logarithmic transformations of ordinal raw data to interval measures, and its ability to produce Fit statistics for individual items, raters, and applicants that alert the researcher to 67 possible violations within the data. The greatest benefit of using Rasch analyses, which became even more apparent as this study progressed, was the Rasch model's ability to handle large amounts of missing data and relatively small sample sizes. Limitations of the Design One limitation that is unique to this particular study is that the counselling coordinator, who regularly holds this position and serves on the selection committee, is on sabbatical this year. This absence means that the selection committee for the September 2011 intake will have no data on this particular rater unless it could be taken from a previous year when she chaired the selection committee. The large residual variance component of 35.8% for the 49 applicants from the Prince George campus is a source of concern, considering the design and the amount of information that was accumulated by simultaneously employing two different measurement models. The researcher's ability to employ a fully crossed design was beneficial in explaining the findings from the Generalizability analysis; however, a fully crossed design may not always be a realistic option in the future. The gender dynamic consisted of two males and three females. There were three faculty raters and two student raters on this application selection committee. Given that four out of the five raters have a counselling background, further analysis regarding gender, age, academic status, and level of counselling experience would have made this study more informative. 68 Recommendations for the Application Selection Committee Since the results of the analysis appeared to indicate some variation in the selection of applicants that would be best suited to pursue a Masters of Education in Counselling degree, the research findings warrant addressing and possibly adding the following components. Faculty agreement. Based on the Fit statistics (Infit = 1.50; Outfit = 1.50) and G coefficient (0.816) values for the work experience item, it would be worthwhile to explore other steps that can be taken to ensure that this item stays constant across different samples of applicants applying to the MEd-Counselling program. One consideration would be to create a category that looks at the fit of prospective students to Education faculty members. The term "fit" is used here to imply that the applicant has provided evidence that they would potentially have the same opinion as some of the current faculty members in regard to theoretical approach or research interests. Northern perspective. UNBC is interested in training counsellors who have a passion for their work; especially those who want to work in the North. Another option for the committee is to create a Northern experience item that would allow members on the application selection committee to assess an applicant's suitability and fit not only with the program, but with the university and the community that is established in Prince George. This would allow the raters the opportunity to judge the quality of an applicant's work and lived experience in the North, which might adjust the high Fit statistics (Infit = 1.50; Outfit = 1.50) and low G coefficient (0.816) values for the current work experience item. Interviews. Through personal communication between the researcher and one of the faculty counsellors, another method of evaluating applicants would be through phone or inperson interviews. In the field of counselling where persona and aura play a vital part in the therapeutic relationship, it would certainly benefit the School of Education to pre-screen applicants through interviewing. An interview would allow the raters to clarify any information that may have been ambiguous to the raters. This would likely sort out some of the issues with the work experience item and reduce the number of unexpected responses generated from the raters (see Appendix F for the complete list). Applicant waitlist. Over the years, since the program first began, the number of applications the university receives from individuals wishing to enter the MEd-Counselling program has steadily increased. This year, the university received almost 50 applications from prospective students. According to the Wright variable map displayed in Figure 1, the applicants are outperforming the items. This was observed by the large number of students sitting above the 1.0 logit mark. These data are strong in the sense that the university was able to select among top quality applicants, but weak in the sense that there were a large number of quality applicants that were not offered a seat in the MEd-Counselling program. At this time there is no current waitlist policy for applicants that were not offered a letter of acceptance. Perhaps the Masters of Education program should look at drafting a protocol for those applicants who meet the criteria, but were not fortunate enough to be accepted into the program due to the relative competition. Good measurement practice. The 4.9% rater-by-item interaction produced by the GTheory analysis was low, but not unsubstantial. As part of good measurement practice, the names and other personal identifiers of the applicants were blanked out in an attempt to protect confidentiality and alleviate anything that could bias the effects of a particular rater. Continuing to engage in this process in the future might reasonably be assumed to help keep the rater-by-item interactions low. As mentioned previously, in the methods chapter, the 70 researcher provided the raters with instructions for how to use the rating scale and information about the most commonly identified rater effects, biases, and errors as well as the consequences of committing such rating errors. This practice seemed to work well as the results of 0.5% variance accounted for by the main effect of raters with the G-Theory analysis and a 0.04 logit spread from the Rasch analysis demonstrated. Recommendations for Future Research In the future, assessing the MEd-Counselling program, using the same design and participants would be ideal. This would allow the researchers to examine rater drift (the rating patterns and behaviour over time). One of the most common facets analyzed with both Rasch analysis and Generalizability Theory is occasions. The data for this study was gathered on one occasion, which does not allow for the opportunity to examine item difficulty, rater behaviour, or applicant quality over any period of time. As mentioned previously, the MEd-Counselling coordinator is currently on sabbatical, which means there was no data suggesting where she fits with the other raters on the application selection committee. Replicating this study to include the MEd-Counselling coordinator next year could provide useful information about how she would have fit with the raters used in this study. The application selection committee could also investigate and experiment with other potential items, such as adding a supplementary item to settle the variability of the work experience item or removing the relevant degree item from the rating scale and having the relevant degree coded by one person. Another recommendation for the future, given that the data now currently exists, is to try using Rasch analysis and Generalizability Theory in a design that is not fully crossed to see 71 what degree of overlap is necessary to have raters review and score all of the application packages received each year with accuracy and precision. The final recommendation, given that the data from this study has been made available to the School of Education, is to qualitatively and quantitatively examine the successful applicants to see what prompted them to apply to the MEd-Counselling program at UNBC. A great follow-up study to this one would be to collect data on how well each successful applicant performed in the program and compare it with the ranking they had when they entered the program. In conclusion, both the Many-Facet Rasch Model and Generalizability Theory have strengths. Each methodology was designed with an idea of the optimal conditions that would warrant the use of that particular methodology. Research in the area of measurement requires researchers to make judgements as to whether the measurement context is appropriately suited to the methodology. Sometimes one methodology is not sufficient enough to adequately measure all of the questions that a researcher has. Therefore, with any analysis, it may be necessary to find two or more measurement models that can be combined to make the most out of the information situated within the data. 72 References Andrich, D. (1996). Measurement criteria for choosing among models with graded responses. In A. von Eye & C. C. Clogg (Eds.), Categorical variables in developmental research: Methods for analysis, (pp. 3-35). San Diego, CA: Academic Press. Atilgan, H. (2008). Using generalizability theory to assess the score reliability of the special ability selection examinations for music education programmes in higher education. International Journal of Research & Method in Education, 31(1), 63-76. doi: 10.1080/174372708011919925 Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2n ed). Mahwah, NJ: Lawrence Erlbaum Associates. Chang, W. & Chan, C. (1995). Rasch analysis for outcome measures: Some methodological considerations. Archives of Physical Medicine and Rehabilitation, 76(1), 934-939. Corey, G., Corey, M. S., & Callanan, P. (2007). Issues and ethics in the helping professions (7th ed). Belmont, CA: Thomson Books/Cole. Engelhard, G. (1992). The measurement of writing ability with a many-facet Rasch model. Applied Measurement in Education, 5(3), 171-191. Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-facet Rasch model. Journal of Educational Measurement, 31(2), 93-112. Fox, C. M., & Jones, J. A. (1998). Uses of Rasch modeling in counselling psychology research. Journal of Counseling Psychology, 45(1), 30-45. Hurlburt, R. T. (2006). Comprehending behavioral statistics (4th ed). Belmont, CA: Thomson Wadsworth. Johnson, D. W., & Johnson, F. P. (2003). Joining together: Group therapy and group skills (8th ed). Boston, MA: Pearson Education. Kieffer, K. M. (1999). Why generalizability theory is essential and classical test theory is often inadequate. Advances in Social Science Methodology, 5(1), 149-170. Kim, S. C , & Wilson, M. (2010). A comparative analysis of the rating in performance assessment using generalizability theory and the many-facet Rasch model. In M. L. Garner, G. Engelhard, W. P. Fisher, & M. Wilson (Eds.), Advances in Rasch measurement volume 1 (pp. 304-327). Maple Grove, MN: JAM Press. Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA. 73 Linacre, J. M , Wright, B. D., & Lunz, M. E. (1990). A facets model for judgmental scoring. Retrieved from http://www.Rasch.org/memo61 .htm. Linacre, J. M. (1995). Categorical misfit statistics. Rasch Measurement Transactions, 9(3), 450. Linacre, J.M. (1996). FACETS: A computer program for analysis of examinations with multiple facets, version 3.03. Chicago: MESA. Linacre, J. M. (1997). Judging plans and facets. Retrieved from http://www.Rasch.org/m3.htm. Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3(2), 103-122. Linacre, J. M., & Wright, B. D. (2004). Construction of Measures from many-facet data. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp.296-321). Maple Grove, MN: JAM Press. Linacre, J. M. (2010). Rasch measurement: Core topics. Retrieved from http ://courses. statistics. com/index .php3. Liu, O. L., Minsky, J., Ling, G., & Kyllonen, P. (2009). Using the standardized letters of recommendation in selection: Results from a multidimensional Rasch model. Educational and Psychological Measurement, 69(3), 475-492. doi: 10.1177/0013164408322031 Lochhead, L. (2009). Assessment ofperceived functional capacity: Using Rasch analysis to evaluate the measurement properties of four perceived pain & disability scales. (Master's thesis). University of Northern British Columbia, Prince George. Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. Lunz, M. E. (1999). A longitudinal study of judge leniency. Popular Measurement, 47(1), 46-47. MacMillan, P. (2000a). Simultaneous measurement of reading growth, gender, and relative age effects: Many-faceted Rasch applied to CBM reading scores. Journal of Applied Measurement, 1(4), 393-408. MacMillan, P. D. (2000b). Classical, generalizability, and multifaceted Rasch detection of interrater variability in large, sparse data sets. The Journal of Experimental Education, 68(2), 167-190. 74 Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149174. Matt, G. E. (2010) Generalizability theory. Retrieved from http://www.psvchology.sdsu.edu/facultv/matt/Pubs/GThml/GTheory GEMatt.html. McHorney, C. A., Haley, S. M., & Ware, J. E. (1997). Evaluation of the MOS SF-36 physical functioning scale (PF-10): II. Comparison of relative precision using likert and Rasch scoring methods. Journal of Clinical Epidemiology, 50(4), 451-461. Myford, C. M., & Wolfe, E. W. (2004a). Detecting and measuring rater effects using manyfacet Rasch measurement: Part I. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 460-517). Maple Grove, MN: JAM Press. Myford, C. M., & Wolfe, E. W. (2004b). Detecting and measuring rater effects using manyfacet Rasch measurement: Part II. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 518-574). Maple Grove, MN: JAM Press. O'Neill, T.R. (1999). Adjusting for rater severity over time. Popular Measurement, 47(1), 46-47. Oosterveld, P., & ten Cate, O. (2004). Generalizability of a study sample assessment procedure for entrance selection for medical school. Medical Teacher, 26(1), 635-639. doi: 10.1080/01421590400004874 Pedersen, G., Hagtvet, K. A., & Karterud, S. (2007). Generalizability studies of the global assessment of functioning: Split version. Comprehensive Psychiatry, 48(1), 88-94. doi: 10.1016/j.comppsych.2006.03.008 Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, II: University of Chicago Press. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage Publications. Smith, E. V. (2004). Evidence for the reliability of measures and validity of measure interpretation: A Rasch measurement perspective. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp.93-122). Maple Grove, MN: JAM Press. Smith, R. M. (2004). Fit analysis in latent trait measurement models. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp.73-92). Maple Grove, MN: JAM Press. 75 Smith, E. V., & Kulikowich, J.M. (2004). An application of generalizability theory and many-facet Rasch measurement using a complex problem-solving skills assessment. Educational and Psychological Measurement, 64(4), 617-639. doi: 10.1177/0013164404263876 Stewart, I. (2006). Letters to a young mathematician. New York, NY: Basic Books. Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(1), 239-261. doi: 10.1016/j.asw.2004.11.001 Swiss Society for Research in Education Working Group. (2010). EDUG user guide, version 6.0. Neuchatel, Switzerland: Edumetrics. Winne, P. H., Nesbit, J. C , Kumar, V., & Hadwin, A. F., Lajoie, S. P., Azevedo, R. A., & Perry, N. E. (2006). Supporting self regulated learning with g-study software: The learning kit project. Technology, Instruction, Cognition and Learning, 5(1), 105-113. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. Wright, B. D. (1997). Fundamental measurement for outcome evaluation. Physical Medicine and Rehabilitation, 11(2), 261-288. 76 Appendix A: The Biblical Approach to Understanding Additive Measurement This passage was taken from Ian Stewart's (2006) "Letters to a young mathematician" "The flood has receded and the ark is safely aground atop Mount Ararat; Noah tells all the animals to go forth and multiply. Soon the land is teeming with every kind of living creature in abundance, except snakes. Noah wonders why. One morning two miserable snakes knock on the door of the ark with a complaint. 'You haven't cut down any trees.' Noah is puzzled, but does as they wish. Within a month, you can't walk a step without treading on baby snakes. With difficulty he tracks down the two parents. 'What was all that with the trees?' 'Ah,' says one of the snakes, 'you didn't notice which species we are.' Noah still looks blank. 'We're adders, and can only multiply using logs.'" This joke is a multiple pun: you can multiply numbers by adding their logarithms. 77 Appendix B: Participant Consent Form "Pick Me, Pick Me, I Want to Be a Counsellor": Assessing MEd-Counselling Applicants Using Rasch Analysis and Generalizability Theory You are invited to participate in the evaluation of the Masters of Education-Counselling program at the University of Northern British Columbia. This evaluation is directed by Stefanie Sebok, MEd Candidate, in collaboration with the School of Education at the University of Northern British Columbia. Stefanie is a student in the School of Education at the University of Northern British Columbia and you may contact her by phoning (250) 9605671 if you have any questions. The purpose of this research project is to provide information about the applicants who have applied to the MEd-Counselling program for intake in September 2010, to evaluate the effectiveness of the items used by application selection committee to score applicants, and to assess the rater characteristics of each member on the application selection committee. This evaluation will inform the School of Education and provide feedback to refine the selection process in the future. All information we ask you to provide is confidential and is only seen by the researcher and her supervisor, Dr. Peter MacMillan. No names will appear on any outputs; numerical codes or coded initials will be used instead of names for presentation purposes. All information that you provide will be stored in a locked filing cabinet until the study is completed. Once this study has been completed, all application packages will be returned to the Chair of the application selection committee. There are no known risks from participating in this research. Your participation in this study will help increase knowledge about ways to enhance the overall MEd-Counselling application selection process in the future. Your participation in this study is voluntary. You are free not to participate and there are no negative consequences for not participating. If you decide to participate and then change your mind, you may withdraw at any time without any consequences or any explanation. If you withdraw from the study, your data will be removed from the analysis. The Research Ethics Board (REB) at the University of Northern British Columbia has reviewed this study and granted permission to move forward as this study constitutes standard program evaluation. If you have any concerns as a participant in this study you may contact the Office of Research at UNBC (250) 960-5820, or raise any concerns you might have by contacting Stefanie Sebok or Dr. Peter MacMillan, Supervisor, at the University of Northern British Columbia at (250) 960-5828. Your signature below indicates that you understand the above-noted conditions of participation in this study and that you have had the opportunity to have your questions answered by the researchers. I (PRINT NAME) agree to participate in the evaluation of the MEd-Counselling application selection process. Signature Date 78 Appendix C: Research Ethics Letter UNIVERSITY OF NORTHERN BRITISH COLUMBIA RESEARCH ETHICS BOARD MEMORANDUM To: CC: Stefanie Sebok Peter MacMillan From: Henry Harder, Chair Research Ethics Board Date: April 1,2010 RE: They were too Rasch with my application Thank you for your application regarding the above noted project. The committee has discussed the information and is supportive of your involvement with the data analysis of and felt that the project falls outside the purview of the ethics committee as there are no human subject participants. The committee appreciates the work you have done on this application and for allowing the UNBC REB to review your project and we are happy to perform our due diligence when a researcher becomes involved in a project. Regards, Henry Harder Appendix D: Assessment Criteria Guide for Assessing MEd in Counselling Applicants: Intake 2010 These criteria will serve as a guide to informing the discussion of applicants. The final decision pertaining to acceptance of applicants into the MEd Counselling specialization will be made by committee vote. Grade Point Average (GPA) (maximum 4.33) *GPA is part of the applicant's overall score; however, the raters do not affect the score given for GPA Relevant educational degrees (maximum 5 points) • Graduate degree in Psychology, Social Work, Child/Youth Care, Education - 5 pts • Relevant undergraduate degree in Psychology, Social Work, Child/Youth Care, Education or graduate degree in Nursing, Health Sciences, First Nations Studies, Criminology - 4 pts • Undergraduate degree in Nursing, Health Sciences, First Nations Studies, Criminology or graduate degree that has some relevance - 3 pts • Undergraduate degree that has some relevance or graduate degree that has little opportunity for written expression - 2 pts • Undergraduate degree that has little opportunity for written expression - 1 pt Statement of academic/research interests (5 + 5 = 10 points maximum) • Competence in writing - (1 pt = very poorly written, 2 pts = poorly written, 3 pts = acceptable, 4 pts = well written, 5 pts = very well written) • There is a fit between goals of applicant and the program - (1pt = not a very good fit, 2 pts = not a good fit, 3 pts = acceptable, 4 pts = a good fit, 5 pts = a very good fit) • An application may be rejected if writing is not of suitable quality or goals are not compatible with the program Relevant employment/volunteer work (maximum 5 pts) • 5+ years of full time work experience in a helping profession - 5 pts • 1 -5 years of full time work experience in a helping profession or 5+ years of part time work experience in a helping profession - 4 pts • Less than 1 year of full time work experience in a helping profession or 1 -5 years part time work experience in a helping profession or 5+ years of volunteer experience in a helping profession - 3 pts • 1 -5 years volunteer experience in a helping profession or less than 1 year part time work experience in a helping profession - 2 pts • Less than 1 year volunteer experience in a helping profession - 1 pt References (5 + 5 = 10 points each for an overall maximum of 30 points) • Referee's suitability for the Counselling program - (1 pt = very unsuitable, 2 pts = unsuitable, 3 pts = acceptable, 4 pts = suitable, 5 pts = very suitable) • The best references are those from a university/college instructor, employment supervisor, referral agent that has observed the applicant in some extended capacity • Quality of the reference based on professional judgment of relevant counselling qualities - (1 pt = poor, 2 pts = satisfactory, 3 pts = good, 4 pts = very good, 5 pts = excellent) • An application may be rejected based on issues of serious concern in the references 80 Rater # Applicant's Name GPA Degree (Max 4.33) (Max 5) Statement Interest/ Research Employment References (Max 5) (Max 10 pts each for a total of 30 pts) Total (Max 10) C= conditional Max 5 pts Max 5 pts for suitability of each ref for writing competence Max 5 pts for quality of each ref Max 5 pts for fit of goals Veto power Veto power Note: Students who are required to submit TOEFL results may also be asked to participate in a telephone interview. Rank 81 Appendix E: Report for Prince George and Northwest Applicants Obsvd Score 232 199 212 212 194 199 204 207 181 188 224 208 188 212 197 195 157 204 188 206 180 207 144 183 203 163 212 188 202 181 175 192 179 216 214 175 193 196 224 199 208 203 198 194 206 207 198 196 195 157 183 124 168 177 168 177 165 152 154 174 167 171 162 184 Fair Obsvd Obsvd Count Average Avrge 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 40 40 32 40 40 40 40 40 40 40 40 40 40 40 40 4.6 4.0 4.2 4.2 3.9 4.0 4.1 4.1 3.6 3.8 4.5 4.2 3.8 4.2 3.9 3.9 3.1 4.1 3.8 4.1 3.6 4.1 2.9 3.7 4.1 3.3 4.2 3.8 4.0 3.6 3.5 3.8 3.6 4.3 4.3 3.5 3.9 3.9 4.5 4.0 4.2 4.1 4.0 3.9 4.1 4.1 4.0 3.9 3.9 3.9 4.6 3.9 4.2 4.4 4.2 4.4 4.1 3.8 3.9 4.3 4.2 4.3 4.1 4.6 4.65 3.99 4.26 4.26 3.89 3.99 4.09 4.15 3.64 3.77 4.50 4.17 3.77 4.26 3.95 3.91 3.16 4.09 3.77 4.13 3.62 4.15 2.90 3.68 4.07 3.28 4.26 3.77 4.05 3.64 3.52 3.85 3.60 4.34 4.30 3.52 3.87 3.93 4.50 3.99 4.17 4.07 3.97 3.89 4.13 4.15 3.97 3.93 3.91 3.94 4.59 3.90 4.21 4.44 4.21 4.44 4.14 3.81 3.86 4.36 4.19 4.29 4.06 4.61 Measure Model S.E. 2.61 1.12 1.60 1.60 .96 1.12 1.30 1.41 .57 .78 2.14 1.44 .78 1.60 1.06 .99 -.03 1.30 .78 1.37 .55 1.41 -.32 .63 1.26 .11 1.60 .78 1.23 .57 .41 .90 .52 1.76 1.68 .41 .93 1.03 2.14 1.12 1.44 1.26 1.09 .96 1.37 1.41 1.09 1.03 .99 1.03 2.40 .97 1.52 2.00 1.52 2.00 1.38 .84 .91 1.83 1.47 1.67 1.24 2.48 .26 .18 .20 .20 .18 .18 .19 .19 .17 .17 .23 .19 .17 .20 .18 .18 .15 .19 .17 .19 .17 .19 .15 .17 .19 .15 .20 .17 .19 .17 .16 .18 .17 .21 .20 .16 .18 .18 .23 .18 .19 .19 .18 .18 .19 .19 .18 .18 .18 .20 .27 .22 .22 .24 .22 .24 .21 .20 .20 .23 .22 .23 .21 .28 Infit MnSq ZStd Outfit MnSq ZStd Nu Student 0.7 1.0 1.5 1.4 1.4 0.9 1.0 1.0 0.8 1.1 0.7 0.9 0.8 1.2 0.5 1.1 1.5 1.0 0.9 0.6 1.1 0.9 1.2 0.8 0.5 0.7 1.7 1.6 1.1 0.7 0.9 0.5 1.5 0.6 0.7 0.6 0.8 1.0 1.1 1.0 0.6 1.4 1.1 0.6 1.0 1.1 0.7 1.6 1.3 0.8 1.2 0.6 0.7 0.9 1.4 0.5 1.2 1.4 0.9 1.0 0.9 1.0 0.9 0.8 0.7 1.0 1.4 1.3 1.4 0.9 1.0 1.0 0.8 1.1 0.7 0.9 0.8 1.2 0.5 1.1 1.5 1.1 0.9 0.7 1.1 0.9 1.2 0.8 0.5 0.7 1.6 1.5 1.1 0.7 1.0 0.5 1.4 0.6 0.7 0.6 0.8 1.0 1.1 1.0 0.6 1.3 1.2 0.7 1.0 1.1 0.8 1.6 1.3 0.7 1.4 0.7 0.7 0.9 1.4 0.5 1.2 1.4 0.9 0.9 0.9 1.0 0.9 0.9 1 11001 2 11002 3 11003 4 11004 5 11005 6 11006 7 11007 8 11008 9 11009 10 11010 11 11011 12 11012 13 11013 14 11014 15 11015 16 11016 17 11017 18 11018 19 11019 20 11020 21 11021 22 11022 23 11023 24 11024 25 11025 26 11026 27 11027 28 11028 29 11029 30 11030 31 11031 32 11032 33 11033 34 11034 35 11035 36 11036 37 11037 38 21038 39 21039 40 21040 41 21041 42 21042 43 21043 44 21044 45 21045 46 21046 47 21047 48 21048 49 21049 50 22050 51 22051 52 22052 53 22053 54 22054 55 22055 56 22056 57 22057 58 22058 59 22059 60 22060 61 22061 62 22062 63 22063 64 22064 -1 0 2 1 1 0 0 0 -1 0 -1 0 -1 0 -3 0 2 0 0 -2 0 0 0 -1 -2 -1 2 2 0 -1 0 -2 1 -2 -1 -2 -1 0 0 0 -2 1 0 -2 0 0 -1 2 1 -1 0 -1 -1 0 1 -2 0 1 0 0 0 0 0 0 -1 0 1 1 1 0 0 0 -1 0 -1 0 0 1 -2 0 2 0 0 -1 0 0 0 0 -2 -1 2 2 0 -1 0 -2 1 -2 -1 -2 -1 0 0 0 -2 1 0 -1 0 0 -1 2 1 -1 1 -1 -1 0 1 -2 0 1 0 0 0 0 0 0 82 | Obsvd I Score Obsvd Obsvd Fair | Model | Infit Count Average Avrge {Measure S.E. [MnSg ZStd Outfit | MnSg ZStd j Nu Student Continued | | | | | | 132 147 164 160 154 156 | Obsvd | Score 186.2 22.7 32 40 40 40 40 40 4.1 3.7 4.1 4.0 3.9 3.9 4.15 | 3.69 | 4.11| 4.01| 3.86| 3.91| 1.39 .65 1.33 1.16 .91 .99 .24 j1 1-3 .19 || 1.6 .21 1 1-1 .21 j| 0.9 .20 |1 1-1 .20 | 0.8 1 2 0 0 0 0 Obsvd Obsvd Fair | Model || Infit Count Average Avrge |Measure S.E. JMnSg ZStd 46.8 5.1 4.0 0.3 4.00| 0.32| 1.19 .54 .20 | 1.0 .03 j 0.3 -0.2 1.5 1.5 1.7 1.1 0.9 1.1 0.8 1 2 0 0 0 0 | 65 22065 | 66 22066 | 67 22067 | 68 22068 | 69 22069 | 70 22070 1 Outfit MnSg ZStd j Nu Student j 1.0 0.3 | | -0.1| Mean (Count: 70) 1.4 j S.O. RMSE (Model) .20 Adj S.D. .51 Separation 2.57 Reliability .87 Fixed (all same) chi-sguare: 549.9 d.f.: 69 significance: .00 Random (normal) chi-sguare: 68.0 d.f.: 68 significance: .48 83 Appendix F: Unexpected Responses Report for Prince George and Northwest Applicants |Cat Step 1 1 1 2 3 1 3 3 1 2 1 1 1 2 3 1 3 3 1 2 |Cat Step Exp. Resd StRes| Nu Stude N Ra Nu It | 4.0 3.6 3.7 4.2 4.7 3.6 4.8 4.7 4.0 4.2 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3.0 -2.6 -2.7 -2.2 -1.7 -2.6 -1.8 -1.7 -3.0 -2.2 Exp. Resd 3 11003 5 05 5 11005 5 05 49 21049 4 04 27 11027 1 01 39 21039 1 01 50 22050 1 01 51 22051 1 01 51 22051 1 01 3 11003 3 03 29 11029 3 03 4 4 4 4 1 1 6 6 7 7 4 4 5 5 7 7 4 4 9 9 StRes| Nu Stude N Ra Nu It |