NOTE TO USERS Page(s) missing in number only; text follows. Page(s) were scanned as received. 134 -198 This reproduction is the best copy available. UMI SCHOOL DISTRICT 57 M ATHEM ATICS ACHIEVEM ENT TESTS: THEIR R E LIA B ILIT Y AND V A LID IT Y by Robert B.Sc., U niversity o f Calgary, 1972 DipLEd., U niveisity o f Alberta, 1973 B.Th., U niversity o f Ottawa and Saint Paul U niversity, 1978 THESIS SUBMITTED IN PAR TIAL FULFILM ENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF EDUCATION in CURRICULUM AND INSTRUCTION THE U NIVERSITY OF NORTHERN BRITISH CO LUM BIA A pnl2004 © Robert Bagnall, 2004 1 ^ 1 Library and Archives Canada Bibliothèque et Archives Canada Published Heritage Branch Direction du Patrimoine de l'édition 395 W ellington Street Ottawa ON K 1A 0N 4 Canada 395, rue W ellington Ottawa ON K 1A 0N 4 Canada Your file Votre référence ISBN: 0-494-04626-0 Our file Notre référence ISBN: 0-494-04626-0 NOTICE: The author has granted a non­ exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non­ commercial purposes, in microform, paper, electronic and/or any other formats. AVIS: L'auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par télécommunication ou par l'Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriété du droit d'auteur et des droits moraux qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. Canada SCHOOL DISTRICT 57 M ATH ACHIEVEM ENT TESTS: TH E IR R E LIA B ILITY AND V A LID ITY Abstract This study examines the re lia b ility aitd va lid ity o f locally developed math achievement tests administered to grade 5 and grade 7 students in the Prince George School D istrict —School D istrict 57. School D istrict 57 has collected test scores on locally developed mathematics achievement tests 6)r samples o f students in grades 5, 7 and 9 since 1995. This in& rm ation has been gathered, through the direction o f the school board, as an ongoing commitment to evaluate the educational programs in the district. This study uses test data from the year 2000 and compares math achievement test scores, student math grades fa r the school year 1999-2000 and Foundation S kills Assessment results fo r grade 4 students in May, 2000 to establish the va lid ity o f the d istrict mathematics achievement tests. Test items are examined using classical item analysis, an assumption 6ee item analysis and item response logistic models. This study provides im portant in& rm ation about the mathematics achievement tests to the teachers and administrators o f the school district. It demonstrates that item d ifh culty and discrim ination fo r most test items are suitable fa r use in the achievement tests. It further shows those test items w hich w ould best be used An other assesanent purposes. Because the analysis o f test items involved three models o f item analysis, this study provides an opportunity fo r comparison o f the three models. This study concludes that the tests exhibit internal consistency and that procedures used A)r m arking m ultipoint test items were sufBcient to povide rater re lia b ility . This study further concludes that the Tables o f Specifications, w hich show content va lid ity and comparisons o f test scores w ith teacher produced grades, w hich show concurrent va lid ity, demonstrate the overall va lid ity o f these mathematics achievement tests. TABLE OF CONTENTS Abstract ii Table o f Contents üi L is t o f Tables vi L ist o f Figures v iii CHAPTER I - INTRODUCTION Background 1 3 A) Tgffs Cwrigz/Zw ConczzrrgMf DgZfmfAzAow and Lzmdadons q f tAg &udy Dg^wdon q/^ZgmK Summary 5 7 7 7 8 9 9 10 10 11 13 CHAPTER 2 -LITE R A T U R E REVIEW M ath Achievement R%at ff MztA v4cAzgvgmgnt? JVadonal Cowncd q/^TgacAgrs q/"AAdkgToadcs (W C T ^ BirzdsA Co/m»Z»A% MatA W ggrafgd Rgsowgg Pactogg O rA grfacfors Factors A & c tin g Test Measurement RglzaAzAfy FaAdfty Conftrwct Falzdzfy Confgnt Falzdzfy Crdgnon Rg/afgd Falzdrty Item Analysis AAadgAv CZayncaZ Analysts Pgzn Rgj(ponsg 7%gory Agm CAaragtgrtsdc Cwwgs Ong Paramgfgr Logtsdc Model Two Paramgfgr Zogtsdc M odel TArgg Paramgfgr Zogtsttc Model Summary o f6 e Thesis Topic 14 14 14 15 16 17 19 19 22 22 24 25 26 26 26 29 30 32 33 34 35 The Problem Agm 7%e RgsgwcA Qugstion Rg/m&zZf(y in Contnbuüoa o f this Study to the Literature 35 CHAPTER 3 - METHOD R e s^ch Design Afeaswes D A M TG raffeJ DAM T Grade 7 PlSd-Pbwndadon vLî,;ess/?%e/d Term and Pear End G^adef Proeedwres Teft Cansiracdan Praeesf DfcAaramaas and ATaZfÿaint Aem bearing Pracea^a^ Dafa Cadeedan Data yëfWysÂr Campider Programs Used Peman Tes(gr(^ P zgsf^s vfseaZ —d fa ^w am eter madeZ yfseaZ —ZAreeparameter madeZ &P&S 36 36 36 38 38 39 41 41 42 42 44 45 47 48 48 53 55 56 57 57 CHAPTER 4- RESULTS Analysis o f Test Items AZeZAadk a/^^na^szs Grade 5 PesaZts CZasdeaZ^naZysZs Using Peman vfssan^dan Free ^naZysis Using PesZgraf PascA Pern Pespanse v4naZysis ZTsing PigsZ^s Lagwde Pern Pespanse v4naZysis Udng Twa Parameter v4seaZ Lagisde Pem Pespanse vinaZysis Z^ing T%ree Parameter vIscaZ Pammary q/^Grade 5 Pem Pejpanse vfnaZysis Grade ZPesaZts CZassicaZ vfnaZysis ZTsing Peman y^ssanpdan Pree /(naZysis ZTsing Testgrqf PascA Pem Pespanse vdnaZysis ZZsi?%^Pigsteps Zagisde Pem Pe,panse yfnaZysis Using Twa Parameter y4scaZ Pagisdc Pem Pespanse vInaZysis ZTsing T%ree Parameter yfscaZ Eammary a/^Gradle 7 Pem Pespanse y^naZysis R eliability o f C riterion Measures W er-rater PeZiaW ify Content-Related V a lid ity Ta6Zesa/^^$Deci/Tcadans Criterion-Related V a lid ity Strength a f ZAe PeZadansAps 58 58 58 58 58 62 65 68 70 72 75 75 79 83 87 89 90 95 96 101 101 105 108 IV A dditional V a lid ity Evidence 109 CHAPTER 5 - DISCUSSION Summary Conclusions ReAoôz/ity mwf Pa/idify Lim itations Data Co/Icction Im plications fo r Future Research Im plications fo r Future Practice 111 111 116 116 121 121 122 122 Reference L is t 124 Appendix A . Letters o f Approval 127 Appendix B . Data by School; Data Breakdown o f Term, Final and FSA Results 130 Appendix C. Grade 5 D M AT and Answer Key; Grade 7 D M AT and Answer Key 133 Appendix D . Letters to Principals 199 Appendix E. Summaries o f Order o f D ifG culty and D iscrim ination 205 Appendix F. Statistical Formulae 208 Appendix G. Tables o f SpeciGcation 209 List of Tables Table 1 q/^GrWe J DA&4TTiremy [/ym g Aeman, Afgaywef q/^D /^c wZ(y Dzfcn/M/Moffo» 59 Table 2 J(oycA q/^Grodie J DA64T CWng AownaTy q/^fAe AfeaM/rg^ —fgri^ow am ï ^gi?M 66 Table 3 J&zycA vffW ygig q/^GfO& J DAdMT LWog AfgaM/rgf q / ^ D z f c n m m o f f o n oW F ;f 67 Table 4 Zogzf/fc Agm ^gjponfg q/^Groijk J DA64T LWog Two oW TTygg fw am gfgr A(W gk, M eam rgf q/^D^cz/7(y, ZXfgn/wTMffoTt ff«g& f-g« gffm g aW 2% 69 Table 5 Zfg/yw m fAg Gradle J DA^TJE% Af6f(i»gfact q / ^ m fAg Zqgisfic Afgm j(gapowg iiw w [yfgf 73 T ^ le 6 CAzyffco/ q^Gradlg 7 DA64TZfgmj [/ym g Agmart A fg a w g f q/"Dzj^caZfy amf DzfcnmzMafzaM 76 Table 7 j(agcA vfaafy^w q/^GraJIg 7 D M 4T CMog fAg Afgaawgf - fg rjo w aw f Afgow 84 S^ammazy q/^ Table 8 jZaacA v4»a/yfM q/^Gradle 7 DA64T Gymg^zgyfgpf, Mzaawga^ q/" ZX^c«/(y, Dücrzmfaafzon aW 7^;/ 86 Table 9 la g ijffg Aem TZgapoMfg v4aa(yM:$ q/^Gradg 7 DA64T Gf%»g Two aW 7%rgg faram gfgr Atadk/f q /^ c a f, ATgaawgj q/^Dzj^caZfy, jDfATZoaaa/zan, ffagzfo-gagfjm g a»J 7^;f 88 Table 10 Agmy m fAe G^adk 7 DA64TJGcA;AifiMg fa c t q/^Fzf m fAg ZogffAg Zfgm 7(g^iwMfg vdmafyygj: 92 Table 11 Gog^j^Cfgnf v4(pAa_^07M fAg CAzyf;ca/ vlMafyjM Gymg Agman 96 Table 12 DzfAiAafioM q/^AfarA$ By B a f e r ^ f A g SAarf ^/m ygr Tfgma: m fAg Gradig J DA64T 97 Table 13 DigfnAafzan q/^AlarAy fA gG ra(/g7D A t4T 99 B afgrfT^r fAg ^AaTfv^rnwgr Bgmf m Table 14 Garrg/afzayw Bg/wggw ATarAgr.; 100 VI Table 15 mf/K Gradie J aw / GrafAz 7 DM4 Drganizef^ 6y ÆP iSAraw6 102 Table 16 AamMary qpfAe A4afAemaficf Procef^ej m f/K Gradie 5 a»(f Grade 7 DAt47f Grgamzed Zy 7PP 5î(raRd 104 Table 17 AWfAera a/^&wdeMff IFko ITrafe f/%e Grade 5 aad Grade 7 DM 4 Cafegarzzed 6y fAe Dafa CaZ/ecfed 106 Table 18 CarreZadow_/^r fAe Grade 5 DM4PPefweeM DA4/4PS'eoref, P1&4 Skoref and Terni and PinaZ M zrty 107 Table 19 C a rre /a d a n ;^r fAe Grade 7 DA4/4TPedeeen DM 4PPearef and Terni andPina/A4arAÿ 108 Table 20 Da/a Ay &eAaa/ 130 Table 21 Dafa PreaAdawn a/^Term, PTnaZ and T%4 P eW ff 132 Table 22 Pianmaiy a/^Grder a/^Di^eW fy and Düerimznadan, Grade 5 205 Table 23 Aannwny a/^Grder a/^D z^cidfy and Dücrimznadan, Grade 7 206 Table 24 TaA/e a/^,^ei/7cadany/âr fAe Grade 5 DA64T Gfzng fAe MdAemadej TKP 209 Table 25 TaA/e q/^.^cÿzcadangyâr fAe Grade 5 DM4T) lü d n g MdAemadef P raceffef 211 Table 26 TaAZe q/^.^ei^eadanyyàr fAe Grade 7 D M 4P Gdng fAe MzfAemadci^ TRP 213 Table 27 TaAZe q /^,^e i/îe a d a n ;/à r fAe Grade 7 DM}4P TW ng MafAenKzdegPraeeffef 215 vu LW of Figures Ffgw e 7. Sample oWput fo r Iteman 6>r the m ultiple choice items in Ae grade 5 D M AT 49 Figure 2 Sample output &T Iteman fo r the short answer items on the grade 5 D M AT 51 Ffgure 2 Sample output o f TestGraf fa r item 26 o f the grade 5 D M AT 53 Figure 4. The ICC produced by TestGraf fo r item 2 o f the grade 5 D M AT 63 Figure J. The ICC produced by TestGraf fo r item 15 o f the grade 5 D M AT 64 Figure 6. The ICC produced by TestGraf fo r item 1 o f the grade 7 DM AT 79 Figure 7. The ICC produced by TestG raf fo r item 9 o f the grade 7 D M AT 80 Figure & The ICC produced by TestGraf fo r item 11 o f the grade 7 D M AT 81 Figure 9. The ICC produced by TestGraf fo r item 16 o f the grade 7 DM AT 83 vm CHAPTER 1 - INTRODUCTION Assessment, in the classroom, is inextricably linked to instruction. It is used by teachers in a m yriad o f forms to And out whether or not learning has taken place. It also provides direction fo r future instruction. The instruments and models that teachers use to assess student learning vary 6om highly subjective, as w ith direct observation, to more objective measures such as unit, term or even standardised tests. It is true that the subjectivity or o b jecdvi^ o f these measures can be questioned. B ut that is not the intent o f this thesis. Rather, it is sufScient to recognize that assessment in its many forms works hand in hand w ith instruction to create learning o j^ rtu n itie s fo r students. In this thesis, I am interested in examining the two mathematics achievement tests developed by teachers in School D istrict 57. They were designed to test student achievement in mathematics at grades 5 and 7. M y chief interest is in determ ining to what degree these tests are tedm ically adequate and can be ccmsidered reliable and valid indicators o f studait achievement. And in doing so, also provide evidence o f the technical abilities o f educators selected to construct these tests. The School D istrict Mathematics Achievement Tests (DM ATs), have been administered to grotqrs o f grade 5 and 7 students since they were Srst developed in 1995. M ost years they were administered to a representative sanq)ling o f grade 5 and grade 7 students. In t k 611 o f2000, die tests were administered to a ll the grade 5 and 7 students in the school district. This change has meant that the tests can now be subjected to a more complete analysis. Although classical test analysis can be conducted on small or large data sets, because the number o f students that were tested is large (1297 grade 5 students and 1175 grade 7 students) these tests can now be analyzed using a variety o f item response models in addition to 6 e classical analyses. In itia lly , these tests were designed to provide the data needed to examine the overall mathematics program in the Prince Gewge school districL It was & h that this inform ation would be needed because a new mathematics curriculum was scheduled to be implemented beginning in 1995. This was the in itia l reason 6>r the testing program and it has remained its prim ary Amction. However, because aU grade 5 and grade 7 students were tested, a new element was introduced into the testing ^nogram. School administrators and classroom teachers were provided an opportunity to examine individual and school results and make comparisons w ith the overall achievement rates o f students throughout the district. This provided teachers an opportunity to reflect on the ef&ctiveness o f their instructional strategies by comparing their students' results w ith those o f other students in the district. However, these tests are o f assistance to teachers, administrators and students only i f they are reliable and valid indicators o f student achievement I w ill begin by indicating vA y I feel this study to be inqw rtant. To do this I w ill start by giving a b rie f history o f the development and im plementation o f the mathematics achievanent tests. I w ill fo llo w that w ith a review o f W iat some researchers consider the im portant elements o f mathematics achievement. This w ill include reference to elements promoted by the N ational Council o f Teachers o f Mathematics (N C TM ) and wHl also refer to the new directions found in the mathematics curriculum in place in B ritish Columbia (Mathematics K to 7: Integrated Resource Package (IRP), 1995). Because, in this paper, I am concerned w ith the re lia b ility and va lid ity o f the DM ATs, I w ill examine what other researchers consider to be the im portant aspects o f test re lia b ility and validity. I w ill also examine item and test analysis theories and w hat researchers consider to be the im portant elements o f them. There are several computer programs w hich I w ill be using to assist in the amlyses o f diese. This in& rm ation w ill provide abackground fo r the research question w hich I am proposing. Background The As mentimied in the introduction, the DM ATs have over the years provided school trustees, school d istrict personnel, school adm inistrative ofBcers and teachers inform ation about the mathematics achievements o f students in the district. To communicate this inform ation to a ll interested parties, members o f the d istrict mathematics committee (a committee consisting o f elementary and secect o f tlwse tests w hich has intensified the need 6)r a carefid analysis. In 2000, these tests were w rittM i by a ll grade 5 and grade 7 students. Test results were sent to each school and the teachers and the school administrators were able to examine student scores directly. Individual student scores could be compared to d istrict averages. This data gave teachers and administrators an opportunity to examine the eSectiveness o f the instructional practices used w ithin the school. New plans and/or strategies could be developed and tested. In this way, the in& rm ation was available to be used at a school level to develop or sim ply hne tune each teacher's maAematics program. This means that the data should provide a clear and accurate picture o f how w ell students achieve. I f it is viewed as useful, it can provide an o f^ rtu n ity fo r droughtful planning. In other words, the test results can be ef& ctive only in so f ^ as they ^ v e to be reliable and valid measures o f mathematics as prescribed. In May o f2000, the M inistry o f Education began a series o f tests at grades 4, 7 and 10 to assess reading, w riting and numeracy. These tests, called t k Foundation S kills Assessment (FSA), have called into question the usehilness o f the DM ATs. Does the FSA test provide sufBcient infarm ation about mathematics to provide a clear picture o f the mathematics achievement o f students in the district? P erh^s trustiog that the FSA results ^ v id e sufficient inform ation is warranted. Perhaps the additional inharmation providW by (be DM ATs is warranted. This thesis w ill hopehiUy provide inhm nation that is relevant to this debate. It is m y intention, in this stW y, to determine whether or w t the DM ATs provide a reliable measure o f student achievement in mathematics fa r grades 5 and 7 students. It is also my intention to determine the degree to which the tests can be considered valid measures o f student mathematical achievement. BacAgrouW to the Tertr In 1994-95, Sclxw l D istrict 57 undertook the planning o f curriculum assessment in a number o f curriculum areas. In mathematics, a cmnmittee o f elementary and secondary teachers (the district mathematics committee) was asked by the D irector o f School Services to develop an assessment m odd to test mathematics achievement levels. The committee was asked to keep in m ind budget lim itations as the district was experiencing funding cut backs. W orking w ithin these constraints, the committee proposed testing students in grades 4 ,6 and 8. Moreover, rather than test a ll students in these grades, it was fe lt that a representative sample could be selected that w ould provide sufBcient data to assess student levels o f a b ility fo r ^ h o f the grades. A consultant. D r. Iris M cIntyre, worked w ith the committee to establish criteria 6>r the selection o f the schools that would make iQ) the sample. She also provided a statistical analysis o f the p ilo t study coi&ducted on a grorq)ofgrade 6 students in the spring o f 1995. To avoid conAision it needs to be pointed out that the DM ATs were administered to students in grades 5, 7 and 9, but the curriculum beit% assessed was that o f grades 4 ,6 and 8. Members o f the mathematics committee, in designing the tests, surveyed a sample o f the teachers in the district to 6nd ont when in the year A e tests A onld be administered. The teachers who responded wanted to complete a ll o f the cnrriculnm before the students were assessed and they also wanted to get the results early enough to be able to use it in planning their mathematics program. To sa tis^ both these requirements, it was decided that the grade 4 curriculum w ould be assessed by testing grade 5 students early in the 611. The results could then be reviewed, analyzed and recommendations could go out to the schools by Christmas. S im ilarly, the grade 6 curriculum and the grade 8 curriculum would be tested in grade 7 and grade 9 respectively. It was in itia lly determined that testing w ould be conducted every second year so that the same grorq) o f students could be tracked 6om grades 4 through 8. Mi^th this plan in place, the committee invited interested teachers in grades 4 ,6 and 8 to review test banks and develop tests that w ould match the newly mandated mathematics curriculum (IRP, 1995). The grade 6 test was developed and piloted hrsL The other tests were then modelled after it and the tests were administered to grade 5 ,7 and 9 shaieots in die 611 o f 1996. Although the tests were originally designed to be administered every second year, the plan has beoi modiGed over tim e. Students were tested in 1996,1998,1999 and 2000. M ost years this involved small representative samples. D a6 6 r each o f these years have been collected. As was already mentioned, in die 611 o f2000, a ll grade 5 and grade 7 students were tested. There was some re lia b ility evidence established 6 r the in itia l grade 7 test, however, over the years the content o f this test has changed. The re lia b ility o f the test needs to be re-established. The re lia b ility o f the grade 5 test has not yet been established. Further, there is also a need to establish the ir va lid ity. This is the intent o f this paper. The Problem The DMATs were in itia lly developed in 1995. There have, over tim e, been m inor revisions as test items were tried, reviewed and in some cases, rewritten. The ^ocedures fo r administering the tests have also bear modiGed o v a tim e. Although the re lia b ility o f the in itia l p ilo t test (grade 7 - spring, 1996) was conGrmed, the re lia b ility o f the current form s o f the grade 5 and 7 tests has not been checked. In addition, the va lid ity o f the tests, given the changes that have been made and the adoption o f a new mathematics curriculum in BC needs to be assessed. In ccmsidering the re lia b ility and va lid ity o f these tests, I w ill review the construcGon and adm inistration o f them and assess the eGectiveness o f Aese processes. The large sample sizes in the 2000 test adm inistration have now made it possible to establish not only test characteristics but item statistics as w ell. Analyses o f these tests in previous years were lim ited to a classiW approach. Analyses by a variety o f item response models can now supplement a classical item analysis. This analysis provides an o^qxirtunity to compare these item analysis models. 77K ^se a rch It is my intaiG on, in this study, to determine the re lia b ility and the va lid ity o f the grade 5 and grade 7 DM ATs using data Gom the year 2000 test scores, to review the construction and adm inistration o f the DM ATs and assess the effecGveness o f these processes and to analyse test items using various item response models and in so doing compare the models. To determine the level o f re lia b ility 6>r these tests, I w ill be examining two aspects o f re lia b ility. First, I w ill examine the degree to \^ c h the test item s give consistent results. This is a measure o f the internal re lia b ility o f the tests. Then I w ill examine the degree to which rates agreed when marking the same tests. This w ill establish the re lia b ility o f the raters. The nature o f the tests, mathematics questions and problems taken horn existing test banks, and the structure o f the tests, m ultiple choice and short answer questions, lead me to believe that both measures o f re lia b ility wiU be high. The Srst section o f each test was m ultiple choice. Student responses were recorded cm a bubbled answer sheet and scanned by a computer. Marks have been established and recorded by the computer. The second section was short answer questions marked by a panel o f teachers. To determine the inter-rater re lia b ility , a ll markers were asked to mark one set o f randomly selected tests. The marks horn this set o f tests w ill be compared and w ill be used to diow the degree to w hich die markers agree. I examirK the internal consistency o f the tests by examining the individual item responses. A t issue is whether or not response patterns are as expected. Included in the analysis is the level o f difG culty o f each item (mean score) and the degree o f discrim ination exhibited by the item . There are tw o measures that can be used to calculate discrimmadon. First, A cre is a point-biserial correlation o f correct response to test scores and second, there is a discrim ination index described as the difference in the proportion o f correct responses o f high achieving students compared to the proportion o f correct responses o f the low achieving students. I also include in this analysis overall test statistics such as mean, standard deviation, skew, kurtosis and standard error o f measurement (SEM). As a fin a l measure o f intem al consistency, I examine the calculation o f coeŒcient alpha fo r each test. The programs I w ill be using to assist in this analysis include: ITEM AN , version 3.5, Assessment Systems Corporation, St Paul, Minnesota (1993); TestG raf developed by J. O. Ramsay, M cGdl U niversity (2000); Bigstqrs, constructed by John M . Linacre and Benjamin D W right and available through MESA Press (1996); and Ascal, available through Assessment Systems Corporatitm (1989), and part o f the M icroC at Testing System. W ith Ascal I w ill use both the two parameter and the three parameter item response models. Currrcwlw. To gather evidence o f va lid ity, I w ill Erst match the content o f each test to the curriculum in BC. In part this analysis has already been done and I w ill include a summary o f iL Over the years various members o f the d istrict mathematics committee have reviewed the test questions to ensure that ÛKy are included in the curriculum Arr that grade level. As part o f these reviews, committee members have deliberated on the level o f difG culty o f each question. The predicted degree o f difG culty was recorded by the committee members on a three point scale - easy, average or hard. I include this inform ation as part o f a table o f speciScations referred to later. I w ill further analyze the questions, however, by identifying o f the maAematical pnocesses (communication, connection, estimation, problem solving, reasoning, technology and visualization (The Common Cumculum Framework fo r K-12 Mathematics, W estan Canadian Protocol Collaboration in Basic Education, p. 5)) are involved in each question. This inform ation w ill fnovide a more complete descripticm o f not ju s t the learning outcome associated w ith each que^on but also the predominant mathematical process associated w ith it. For each test, diis w ill be done by drawing up a table o f q)eciûcations. CoWWTCMt. I w ill examine the concurrent va lid ity o f the tests. I w ill do this by comparing each student's results on the DM ATs to the mathematics marks grade 4 or 6 teachers assigned them during the 1999-2000 school year. A n additional measure o f va lid ity fo r grade 4 students is the relationship between the D M AT scores and the FSA numeracy scores 6om the FSA test they wrote in May, 2000. A high correlation between these results would demonstrate a strong concurrent va lid ity. DeZûMîWiow and ZzmrW row the This study has been lim ited to the test results from the grade 5 and 7 tests administered in the 611,2000. Although data 6om the years 1996,1998 and 1999 were available, they w ill not be included. The data fo r these years was lim ited to a representative sampling o f students and schools. The data horn the year 2000 tests included students horn a ll the schools in the school district. It was fe lt that the data horn this one year provided a sufBciently large data set w ith w hich to analyze the re lia b ility and the va lid ity o f the current test instruments. Including data 6om previous years would have increased the relative uncertainty o f the results because o f the changes made to the 10 previous tests. Results 6om the grade 9 test were not used as these ksts were not administered to students in the year 2000. In 1998, the usefulness o f one part o f the mathematics tests was called into question. A t that tim e, the tests at each o f the levels included a section termed Performance Assessment. In this section a randomly selected groiq) o f studMits met w ith an examiner one student at a tiiiK to solve a series o f problems. Each student was given the opportuni^ to use concrete instructional aides and was asked to interact w ith the examiner w hile solving the problems. They were encouraged to articulate the ir reasoning w hile they worked on each problem. Examiners could prompt students using leading questions wherever this was needed. Scores were recorded on a 4 point scoring rubric. Follow ing the 1998 tests the members o f the d istrict mathematics committee decided to discontinue using the perhmnance assessment part o f the test. It was & lt that the inform ation provided by the test was not sufGciently reliable to warrant continuing w ith it This study compared student scores from tw o separate school years - the 19992000 school year and 611 o f the 2000-01 school year. Because movement between schools (and school districts) has occurred, some students' scores could not be matched and their test scores have not been included in the study o f va lid ity. I have used the terms re lia b ility and va lid ity extensively in the preceding sections o f this proposal and w ill consider each term in greater detail in the literature review that fallow s. However, to help provide some understanding o f these terms, I include here some general descriptions given by Sax (1997). For re lia b ility , he states that ''re lia b ility 11 describes Ihe extent to w hich measuremaat can be depended on to provide consistent, unambiguous inform ation" and that reliable measurements "...re fle ct true rather than chance aspects o f the tra it or a b ility measured" (p. 271). Fw va lid ity, he provides the follow ing deGnition: "V a lid ity is deGned as the extent to w hich measurements are useful in making decisions and ;woviding explanations relevant to a given purpose" (p. 304) Each term w ill be considered in greater detail in the next sections o f this proposal. The year end grade fo r students in grades 4 and 6 is given as a letter grade A , B, C+, C, C-, I or F. This grade is norm ally the average o f the term grades reported fo r each subject during the school year. The grades reported during the year are also reported using the above noted le t^ grades. These grades norm ally correspond to the follow ing range o f marks: F —a Gnal mark o f between 0% and 49%, I —an incomplete mark (it can be a^usted w ith evidence o f additional achievement) o f between 0% and 49%, C- —a mark o f between 50% and 59%, C - a mark o f between 60% and 66%, C+ - a m ark o f between 67% and 75%, B — a mark o f between 76% and 83% and A —a mark o f between 84% and 100%. It was noted that a small number o f students were w orking on individual education plans (lEPs). The speciGc nature o f the lEPs and the learning objectives m cludedinthem w erenotnotedinthisstudy. Becausetherangeofm arksrepresentedby the letter grades varied considerably, the relative order o f the marks was deemed the most im portant Mature and fo r analysis the grades were given the A llo w in g values: F —0 ,1 1, lEP - 2, C- - 3, C - 4, C+ - 5, B - 6 and A - 7. The result 60m the DMATs fo r each student is recorded as the number o f correct responses compared to total possible responses. The results fo r students in grade 4 v k o wrote the M inistry o f Education FSA test are listed on a 5 point scale: 1 - "N ot Y et W ithin Expectations", 2 - between "N ot Yet 12 W ithin Expectations" and "Meets Expectations", 3 - "Meets Expectations", 4 - between "Meets Expectations" and "Exceeds Expectations" and 5 - "Exceeds Expectations". Summary The re lia b ility and va lidity o f t k DM ATs need to be established so that teachers ami administrators can confidently use the test results to examine existing mathematics ^g ra m s . These results w ill enable personnel in the schools to plan, rehne and develop mathematical instruction. 13 C H A P TE R !-LITE R A TU R E REVIEW Mathematics Achievement The fallow ing are some o f dK observations made by researchers about the context w ithin w hich mathematics is ta u ^ it and learned. 1 w ill be crwsidering Eve main sources in this section. The Erst, Webb and Romberg (1992), provides an overview o f the NaEonal C ouncil o f Teachers ofMathemaEcs (NTCM ) 1989 Curriculum and EvaluaEon Standards and how the adoption o f these new standards has lead to the recent reE)rms we see in the BriEsh Columbia mathematics curriculum . The second source is the Province o f B ntish Columbia, Curriculum Branch, MathemaEcs K to 7: Integrated Resource Package (IRP) (1995). I w ill examine the raEonale Err its connecEons to the N C TM 's standards. The th ird , Erurth and EEh sources 1 wiU consider together as there are many elem aits that are common to them a ll. The th ird source is the B ntish Columbia M irnstry o f EducaEon, S kills and Trairung and provides infbrmaEon Eom the 1995 Bntish Columbia Assesanart ofMathemaEcs and Science. Its authors, M arshall et al. (1997), outline the technical inErrmaEon about the test and the recommendaEons that came Eom it. The E)urth source, also Eom 1995, is the IntemaEonal AssociaEon E»r the EvaluaEon o f EducaEonal Achievem m t (lE A )'s Third IntemaEonal MathemaEcs and Science Study (TIM SS). The authors o f this report, MuUis et al. (1997), descnbe the results o f this intemaEonal assessment. Canada was one o f 45 countries w hich parEcipated in this assessment (M ullis et al., 1997; McConaghy, 1998). There were results reported at Eve d ifk re n t grade levels. The report 1 reviewed gave results Eom the grade 3 and 4 assesanaits. The EEh, by Kober (1991), is a meta-study o f the issues encompassing the 14 teaching and learning o f mathanatics. This is stg^ported by a recent meta-study by fCilpÆdaicl^EivfaiïcMni aaaration fa r the test. They may also vary in a m yriad o f other ways. Their health, mental alertness, stamina, competitiveness, w illingness to take risks 19 could vary 6om day W d ^ . In 6 c t Ihee could be a ll sorts o f reasons w hy a student's test scores could vary 6om test to test. Many o f these Actors are beyond the control o f test administrators. The d istrict achievement tests included directions to test administrators and to students designed to m inim ize the efkcts o f those Actors by standardizing test procedures. A second influence could be attributed to the examiner or marker. The examiner could mvalidate the test by giving extra help, pom ting out mistakes, giving extra tim e or in oAer ways helping Ae student taking the tesL This too is generally m inim ized w iA A e standardization o f test procedures. Markers can mtroduce error variance w hm subjective assessment cntos into Ae marking o f a test. Error variance can be m inim ized w iA w ell laid out m arking criteria and ruA ics but it can never be conqrletely eliminated. Because the A strict maAematics achievement tests have a section o f short answer questions, some marker subjectivity is expected and interrater re lia b ility w ill need A be considered. A th ird A ctor affecting re lia b ility is Ae content o f Ae achievement tests. To test a ll learning outcomes A r a given grade is unreasonable as Ae resulting test would become long and unwieldy. There is a need to establish A e generalizability o f the questions listed on Ae test. This w ill be considered m coigunction w iA A e Table o f SpeciGcations. A A u rA A cA r aSecting re lia b ility is A e influences o f tim e. Too little tim e between tests, means A at some students w ill remember A st items and Aere A re have a bigger advanAge; too much tim e between Asts, means that student learning could aûect the A stribution o f marks. This hzqrpens because difA rent inA viduals leam at diSerent 20 Tates. Although this is a Ihctor aOecting re lia b ility, it's not a consideration in this study because the mathematics achievement tests are administered only once. The re lia b ility being considered in this study is the internal re lia b ility o f a single adm inistration te st A G fth 6 cto r affecting re lia b ility is the situation in w hich the test is administered. The conditions prevalent at the tim e o f test taking w ill influence the results o f the test. This factor is difG cult to enter into any test re lia b ility measures. M ost o f these influences are best addressed by the examiner. W ith students iM io have greater experience taking test, this 6 c to r is less signiScant. W ith younger, less experienced students this may be a signiGcant factor influencing Ae re lia b ility o f the tesL H istorically, there are three different approaches to computing re lia b ility coefGcients (Cunningham, 1998; Feldt & Brennan, 1989). The Grst is parallel forms. In this method two diGGaent but parallel tests are constructed and administered to students. The extent o f correlation between the test scores measures the re lia b ility. Parallel forms provide two types o f inform ation. They indicate whether students have changed (given tim e between test administrations). They provide inform ation about test items and indicate whether or not sim ilar items test the same traits. The second is test-retest re lia b ility. In this method the same test is administered tw ice. The resulting re lia b ility measure is considered weak, however, because in cases where the period o f tim e between tests is short, a higher than expected coefficient can result W ien students remember and copy their previous responses. A third method is the single adm inistration method. In this method, test items are divided into groups w hich are then tested fo r internal consistency. Because the district achievement tests are one form administered to students only once, the single adm inistration mediod is used in this analysis. There are several 21 methods o f calculating correlations to r a single administered test - Spearman-Brown form ula, tau-equivalent and coefBcient alpha to m ention three (Cunningham, 1998; Feldt & Brennan, 1989). The calculation used in this analysis is C oefficient A lpha The advantage to this method is that it can be used on tests scored dichotomously or on tests where a range o f scores is assigned. Lyman (1998) considers va h d i^ to be the most im portant attribute o f a good test. M cM illan and Schumacher (1997) point out that test va lid ity refers to the extent to w hich test scores can be held to be m eaningful. They insist that va lid ity refers to the infsrences, uses and/or consequences o f a test's measure. Traditionally, authors have referred to test va lid ity under three main aspects (Lyman, 1998; Cunningham, 1998; Gipps and Murphy, 1994): construct v a lid ity, content va lid ity and criterion validity. CoMstrwct Fa/fdïty The Grst type o f validation I wish to consider is construct-related validity. This relates to what Cunningham (1998) refers to as constructs. Constructs are broad descriptors o f behaviours. Examples w hich Cunningham gives include intelligence, creativity and reading comprehension. For the purpose o f this study, constructs could also include reasoning, problem solving, estimating, communicating and investigating. Some o f these behaviours can be quantified using questionnaires that give a range o f responses (L ike rt type scale). The responses can then be assigned a concrete numerical index. Construct v a lid ity can provide an indication o f how w ell the numerical index realty reflects the construct Its purpose is to cla rify exactly what is being measured by the test. There is no single coefBcient that one can establish Instead, this va lid ity is 22 measured by evidence inferred 6om the rest results. Cunningham states that, "Before we can accept the view that a test measures what it purports to measure, a logical case must be established h)r why in&rences about a construct, based on test scores, are legitim ate." (p. 40). He goes on to discuss three phases used to establish construct-related va lid ity. The Grst step is to determine Wiethm^ or not a single entity is being measured; next, to describe the theory on which the construct is based including what the construct is and how it w ill be observed and fin a lly to examine the results to see i f its interaction is as predicted. In the fallow ing sections two other farm s o f test v a lid ity w ill be discussed. They are content-related and criterion-related va lid ity. There is a contemporary view o f test validation W iich draws these two aspects o f v a lid ity measures into a single unifying view. This framework, proposed by Messick (1989), unites content- and criterion-related va lid ity w ith an a ll inclusive concept o f crm struct-validity. Sax identiGes six aspects o f construct va lid ity: "content, substantive, structural, generalizability, external and consequential." (1997, p. 314) He notes that content-relevance validation interprets how w e ll the test items cover the domain tested. It further identiGes the degree to w hich the items can be considered "relevant, represaitative and socially desirable." (1997, p. 314) The substanGve aspect o f validation points to the need fo r evidence fa r va lid ity Gom a variety o f sources. Structural vahdity refers to the scoring criteria itself. The values assigned to test items need to relate logically to the respondent's task. A test demonstrates generalizability validation in so 6 r as the properGes and interpretaGons placed on respondents scores can be generalized across the to ta lity o f the construct domain. The external aspect o f test validaGon is sometimes referred to as convergent and 23 discrim inant validity. In it, test results are expected to correlate strongly w ith related concepts but show a weak correlation to non-related concepts. F inally, the consequential aspect o f validation takes into account the values o f 6im ess, bias and social im plications associated w ith the interpretation o f test scores. Messick's position that test validation combines traditional ideas about test va lid ity under a unified concept - construct va lid ity - postulates that a ll measures o f va lid ity, em pirical and inferential, provide valuable inform ation about the nature o f the assessment instruments (Messick, 1989). Content vofû&ty A test is said to have content related va lid ity when test items relate strongly w ith the domain o f knowledge taughL This va lid ity measure is non-statistical in nature. In & ct, some consider it suspect in its subjectivity. Concern has been expressed that this va lid ity can be easily biased by test producers and vendors (Cunningham, 1998). However, this va lid ity measure is o f particular importance fo r achievement tests such as the mathematics achievement test under consideration in this p^)er because it expresses the degree to which the items included in the test match the actual learning outcomes related to a course. This includes elements o f knowledge included in the course and the skills to apply the knowledge taught. Cunningham (1998) sets out a method o f testing content related va lid ity using a table o f specifications. This table includes a hst o f instructional objectives (learning outcomes), items associated w ith them, the number o f items used to assess each objective and the cognitive level o f each item. 24 CnfenoM reW a f The third form o f va lid ity is criterion-related v a lid it). It provides a more concrete form o f va lid ity because it is statistieally based. It provides evidence that the test measures criteria that compare w ith other standards that are closely related but external to the test (Lyman, 1998; Cunningham, 1998). It sounds ideal but unfortunately has some drawbacks. The Êrst lim itation is the d ifh cu lty in identij^dng ^ipropriate criteria measures that are outside o f the test and o f a quality that give a good point o f referoice. This could prove to be a d iffic u lty in this study because, w ith a new mathematics curriculum , the objectives may be quite diGferent 6om existing standardized achievement tests. The second lim itation is the interpretation o f the resulting correlation coefGcients. There are many Actors that a fk c t the size o f the correlations measure. How are they to be interpreted? C ritaion-related va lid ity because o f its statistical nature is o fta i highly sought after (Lyman, 1998). In general, the higher the measure o f correlation is between the test and criterion standards the better. There are Actors, however, that need to be considered. Some tests may not lend themselves to establishing criterion A r comparison. A lternately, the test may lend its e lf to comparison w ith criterion standards but the criA rion may be altogeAer different or have different emphasis. This needs to be a consideration in Ae present study. Even though Aere are standard mathematics achievement tests available, w iA Ae changes m Ae curriculum , is it possible to compare A e current learning outcomes w iA c rita io n selected 6om previous curriculum models? A real terms, can criA rA that Acused on computation be va lidly compared w iA questions that Acus on reasoning and/or communication o f maAematical concepts. 25 As w ith re lia b ility, a lim itation to va lid ity could be the range o f scores. Where test scores are homogeneous, discrim ination is d ifS cn lt Where there is a wide range in test scores, greater discrim ination is possible and test va lid ity is comparably greater. Because a b ility levels o f studaits in the d istrict vary greatly, it is expected that this 6 cto r w ill increase va lidity. Finally, researchers states that criterion-related va lid ity may be concurrent or predictive (Lyman, 1998; Sax, 1997). For concurrent va lid ity, the test scores and the criterion values are taken at about the same tim e. For predictive va lid ity, there is a lapse o f tim e between the test scores and the criterion values. The focus o f this study w ill be the examining o f concurrent validity. Item Analysis In this study o f the DM ATs, analysis o f the test items is an im portant consideration. A general model fo r test construction described by Henrysson (1971) includes: a pretryout stage where the test is planned, a tryout stage where the test is administered to a representative sample (300 students or more) and a tria l adm inistration stage. In terms o f the DM ATs, a ll these stages have been completed and the data from the tests administered in 2000 provides an opportunity & r an in depth analysis o f the test items. A m ^o r advantage to classical item analysis is that it can be conducted on small samples ju st as w ell as large samples. The main indices examined in classical analysis are item difBcuhy aixl item discrim ination. 26 The measure o f item difS culty fo r a test item is deûned as '^the proportion o f examinees who get that item correct" (A llen & Yen, 1979, p. 120). This measure is usehil in determining whether or not an item is suitable fo r the a b ility level o f 6 e students. I f an item is too easy a ll examinees w ould get it correct and the measure o f difG culty would be 1.0. % on the other hand, an item is too difR cult, a ll examiiKes would get it incorrect and the measure o f difG culty would be 0. These values represent extremes w hich under normal circumstances one does not erKounter. Even so, they show that measures o f item difG culty approaching these values should be held suspect and the items should be examined closely fo r usefulness. AUen & Yen suggest that an item provides maximal inform ation about the difkrences between examinees when the level o f difG culty fa r the item (p j is .5. They indicate that this varies depending on the type o f question. W ith m ultiple choice questions, because there is a guessing factor that must also be considered, they suggest using a value o f .6 as a maximal value fo r a four choice test. Depending on the test, these values act as a target to provide maximum inform ation about the examinees. There are some exceptions to the p value targets suggested by AUen & Yen. The authors add thatp values used as a target need to reflect the overall purpose o f the test. In tests designed to ide ntify students in need o f remedial intervention, items w ith high p values fo r the general population (easy hems) need to be used. In tests designed to identify high achieving students fo r awards or fo r special enrichment type programs, items whh low p values (difG cult item s) are needed. Because the DM ATs are student achievement measures, a range o fp values is desirable. A llen & Yen (1979) suggest that the most suitable range fo r item di@ culty is about .3 < p < .7. 27 A measure o f item discrim ination is determined in either o f two ways —an item discrim ination index (D ,) or an item /total-test-score point-biserial correlation (A llen & Yen, 1979). These statistics norm ally produce sim ilar interpretations. Both are used to calculate the degree to w hich an item discriminates between high and low scoring examinees. O r to put it another way, either o f them can be used to indicate W iether a student who does w ell on the test as a W iole (high scoring) is more hkely to get a particular item correct than a student who does poorly on the test as a whole (low scoring). N orm ally, high values o f individual item discrim inatioris are desirable fo r a test. The item discrim ination index (D,) is determined by calculating the difference between the proportion o f high scoring examinees wdio correctly answer an item and the proportion o f low scoring examinees who correctly answer the item . A form ula fo r this calculation is: i> , = ^ n, where L/} is the number o f examinees in an established upper range o f scores on the whole test and w to also got the item correct, JLjis the number o f examinees in the low er range o f scores on the whole test and w iio got the item correct and n, is the number o f examinees in the upper and low er ranges (AUen & Yen, 1979). W hile the iqiper and lower ranges m ight logically be the top quarter or th ird and the low er quarter or third, the proportion that is chosen by software designers is actually 27%. Research has shown that the sensitivity and stability o f the item discrim ination index is often greatest when using the upper 27% and the low er 27% o f the examinees (Crocker & A lgina, 1986). 28 The item /total-test-score point-biserial correlation is a comparison o f scores on the item to to ta l test scores. This is a Pearson correlation and the harmula fo r this calculation is: _ where I A is the mean o f the scores among examinees who responded correctly to item i, and s% are the mean and standard deviations fo r aU examinees and is the item difG culty (A lle n & Yen, 1979). In classical analysis, knowing the level o f item discrim ination, D or is valuable. For items that are w ell behaved, the discrim ination should be positive. This means that more high scoring examinees select the correct response fo r the item than low scoring examinees. Negative discnmmation values would indicate exactly the opposite, more low scoring examinees select the correct response than high scoring examinees. Items w ith a low or negative discrim ination fo r the correct response are suspicious and in general should be removed horn the tesL Aem Response TAeory Item response theory (IR T) was created in an effort to overcome shortcomings associated w ith classical test analysis. The most notable o f these shortcomings is that in classical item analysis, characteristics that are associated w ith the examinees cannot be separated from characteristics associated w ith the test (Hambleton, Swaminathan & Rogers, 1991). In terms o f a student w ritin g a test, the classical measure o f the student's a b ility is the student's test score. However, the scores are a function o f the difG culty o f the test items. For an easy test, the student's test score could be quite high whereas iv ith 29 a difR cult test that student's test score could be quite low . Item response theory has been developed to overcome this interdependence o f examine and test characteristics. IR T rest on two basic postulates (Hambleton, Swaminathan & Rogers,1991). The frs t postulate is that an examinee's per&rmance can be predicted using factors referred to as traits o r abilities. The second postulate is that there is a relationA ip between the responses to an item and the examinees' traits that can be described by a continuous and increasing function. This function is called the item characteristic fra ctio n and when applied to examinee a b ility and the probability o f a correct response to an item , it is termed an item characteristic curve (IC C ) (Crocker & A lgina, 1986). A n ICC w hich is o f special significance is that based on a normal distribution. This is termed a normal ogive. It has several special properties. First, going fo m the le ft to the righ t the curve rises continually. Second, the lower asymptote approaches zero and the upper asymptote approaches one. Third, it is directly related to a normal distribution and therefore graphs proportions that are functions o f the z-scores (Crocker & A lgina, 1986). IR T models rely on maximum likelihood probabilities and as such may or may not be applicable fo r use in analyzing some sets o f data. We are cautioned to assess the f t o f the model to the data. Where f t can be ve rife d , IR T models provide the opportunity to estimate examinee a b ility independent o f test items and to establish item characterisfcs that are independent o f the group tested. (Hambleton, Swaminathan & Rogers, 1991) Aem cAwocteMsric cr/rves. A n item characterisfc curve (IC C ) is described by A lle n &Yen (1979) as a g ra ^ c display o f the re la fonship between the probability that an examinee w ill correcfy 30 respond to an item and the examinees relative score on the tesL This relationship is supported by Crocker and A lgina (1986) who note that true scores on a test are related to the latent tra it gr^ihed in an ICC. I f test scores are used as the best estimate o f the true score then sim ilarities between ICCs and classical item statistics can be noted. The ICC fo r an item w ould have estimates represented on a graph w ith total test scores ranked along the horizontal axis and the proportion o f examinees' responses located on the vertical axis. The resulting curve provides an opportunity to examine the degree o f difG culty and level o f discrim ination. Ramsay (2000) used ICC displays in the development o f the program TestG raf The program is used to display the probability that examinees w ill choose certain options depending on the prohciency o f the examinee. I use it here to illustrate how the classical measures o f difG culty and discrim ination can be related to ICCs. In TestGraf the degree o f difG culty fo r an item is defined as the prohciency level (e)q)ected score) that corresponds to a probability o f .5 on the vertical axis. In other words, it measures the estimated a b ility score (or rank) at w hich 50% o f the examinees that had that score correctly responded to the question. For an easy item more lower scoring examinees get the item correct and the .5 proportion w ould be reached at a low score or percentile. For a more difG cult item few higher scoring examinees wiU get the item correct and the score (or rank) at w hich .5 o f the examinees w ith that score get the item correct could be quite a high score or percentile. In TestGraf item discrim ination can also be measured using the ICC. The discriinination is dehned as the slope o f the ICC. This is one way in w hich the item analysis using TestGraf d i^ rs 6om item reqxmse logistic (IR ) models. In the IR models 31 the measure o f difRcuhy and the level o f discrim ination are both measured at the estimated total test score (or rank) at w hich 50% o f the examinees w ith that score (or rank) correctly respond to the item . The analysis w ith TestGraf^ however, provides the opportunity to observe how discrim ination is lik e ly to vary between the different groups o f examinees at d iff^ e n t ranges o f the expected score. An item may display an ICC which shows great discrim ination fo r low scoring examinees and have low discrim ination 6)r high scoring examinees. As A U at & Yen note, "ICCs can be useful in id a iti^ in g items that per&rm differently fo r different groups o f examinees" (A llen & Yen, 1979, p. 129). The one drawback to ICC fa r analysis is that large samples are required to make realistic estimations o f the response curves. This is particularly im portant fo r the extremes o f the test scores, the high scoring examinees and the lowest scoring examinees w boe fewer examinees fnovide data hrr estim ating the ICC. One pwamcter logwtzc mWel. The one parameter IR model, also called Rasch model, provides an analysis o f items where the only parameter o f interest is the item d iffic u lty . It is assumed that other parameters do not a fk c t the model. For this model a ll ICCs have an identical shape because discrim ination is assumed to be equal. The ICCs only d iffe r in the ir placement along a d iffic u lty /a b ility continuum. This model is based on the premise that the odds fo r success o f an examinee are based on the product o f an examinee's a b ility ( ^ and the easiness o f the item where easiness is dehned as 1/h w ith h being the difG culty o f the item . Hambleton (1989) shows that based on this premise, the form o f the resulting ICC can be deGned using the fmrmula: 32 1+ The f is the probability that examinee n w ill answer the rA item correctly, is the a b ility measure o f examinee n and 6, is the level o f diS iculty o f the item , i . The a b ility measures o f the groiq* o f examinees can be transformed so that the mean a b ility is 0 and their standard deviation is 1. The parameter o f interest (6,) is measured as the location on the a b ility distribution where 50% o f the examinees o f that a b ili^ would get the item correct. Negative values fo r a b ility are located to the leA o f the mean; there&re, a negative value represents an easier item . S im ilarly, a positive 6, values would indicate a more difG cult item . In Rasch analysis, it is a common practice to center the item difG culty at zero. Parameter values fo r this model then typica lly are values ranging Aom -2 to 2. Two parmneier /ogisirc modle/. The two parameter IR logistic model provides an analysis o f items where the parameters o f interest are the ito n difG culty and the item discrim ination. It is assumed that no other parameters aflect the model. It is deGned by an ICC formed by the follow ing function: 0 = 1,2, 3,..., k). f is the probability that an examinee n w ith a b ility ^ w ill answer item i correcGy and a and h are parameters that characterize item i. The variable k is the number o f items on the test and T) is a scaling factor v h ich brings the resulting curve close to a normal ogive (Hambleton, 1989). 33 In this model a b ility vaines are as deSned & r the 1 parameter logistic model. The parameters a, and 6, are usually referred to as the item discrim ination (a,) and as the item difG culty (6/). The item difG culty represents the point on the a b ility scale where an examinee has a 50% probability o f answering the item correctly. The item discrim ination is the slope o f the ICC curve at the point 6. In theory there are no upper or low er lim its to the value o f a; however, in practice items w ith a negative value fo r a would be discarded. The slope o f ICC generally is not greater than 2 so the e fkctive range fo ra is considered 0 to 2. The three parameter IR T model provides analysis fo r items where three parameters are o f interest: difG culty, discrim inatioi^ and low er asymptote (pseudo­ chance level). It is defined by an ICC formed by the follow ing function: f ((^ J = c, + (1 f 3 , . . . , k). is the probability that an examinee n w ith a b ility ^ w ill answer ita n i correctly and a,, 6; and c, are parameters that characterize item i. The variable k is the number o f items on the test and is a scaling 6 cto r W iich brings the resulting curve close to a normal ogive (Hambleton, 1989). Parameters D, a and 6 have the same means as fo r the two parameter model except that at on the abihty scale, the probability o f a correct response is (1 + cJ/2 rather than .50. The low er asymptote (c) value represents the probability o f an extremely low scoring examinees getting the item correct. It can be considered a measure o f guessing at an item . Because c is a probability o f getting the item correct its range o f values is 0 to 1. For m ultiple choice tests the in itia l estimate fo r c is the inverse o f the number o f possible responses. 34 Summary o f the Thesis Topic The focus o f this paper is the re lia b ility and va lid ily o f the DM ATs & r grade 5 and 7 students. These tests together w ith student grades from the 1999-2000 school year and results 6om the May 2000 FSA grade 4 tests provide considerable data w ith \\h ich to undertake this study. Other inform ation, including an analysis ofth e construction and adm inistration o f Aese tests and a comparison o f the item response models w ill be considered in the course o f this study. The C ontribution this Study w ill Make to the Literature This study is prim arily an em pirical study o f concurrent va lid ity examining the relationship between loca lly developed DMATs and student results on classroom based assessment o f mathematics achievement as w ell as provincial tests o f student numeracy. Although the procedures follow ed w ill 6xms on traditional va lid ity theory, it w ill examine aspects o f v a lid ity more closely associated w ith the position taken by Messick (1989). This project w ill provide suRxrrt to School D istrict 57 personnel in that it wiU provide inform ation about the DMATs. It w ill also add validation inform ation to the general body o f knowledge related to test va lid ity. 35 CHAPTER 3 - METHOD Research Design This is an em pirical study using test data and statistical analysis to establish the re lia b ility and va lid ity o f the mathematics achievement tests developed by School D istrict 57. The test data used fo r this study consists of: scores from the DM ATs administered in the fa ll o f 2000, mathematics grades fo r students enrolled in grades 4 and 6 during the 1999-00 school year-term (marks as w ell as Gnal grades) and FSA scores fo r those grade 4 students who wrote the FSA numeracy test in May, 2000. This data set was collected hom school Principals and 6om School D istrict personnel at the board ofBce. It was cross checked to ensure accuracy and stored in a computer data base. The subjects o f this study are the grade 5 and grade 7 DM ATs and indirectly the teachers and the SDMC members who designed and constructed them. M y analysis o f them w ill include examining the test items, examining the role played by those involved in the construction o f the tests and the rating o f the examinees and items, and examining the item analysis methods used to assist in the analyses o f the test data. A ll grade 5 and grade 7 students in School D istrict No. 57 who wrote the D M AT tests during the fa ll o f2000 are participants indirectly in this study. A lis t o f the schools and the numbers o f the students in each school who wrote the D M AT is included in Appendix B. The instructions fa r administering the tests, sent to the schools w ith the test booklets, outlined which students were to w rite the test and w hich students were to be excluded. Principals were directed that a ll grade 5 and a ll grade 7 students were to be included excq>t where inclusion w ould result in undue harddiip to the student. Students 36 were to be excluded if: they exhibited moderate to severe intellectual disabilities, severe behaviour disorders, m ultiple disabilities, autism, had extended absences, or were not capable o f r^ponding. In the case o f students on Individual Education Plans (lEP), teachers were asked to include them i f they were capable o f responding provided they had ^qnopriate assistance sim ilar to that provided in regular classroom situations or as described in the student's lEP. The data set gathered hom the D M AT was used fo r the item analysis o f the tests. These data were also used to establish the internal consistency o f the tests. However, to establish va lid ity, not a ll these data could be used because not a ll o f it could be matched w ith speciGc students. There were two general categories fo r w hich a match between DM AT scores and term and year end marks could not be made. The Srst category included aU those students 6)r whom identiGcation on the D M AT scores was a problem. Some o f these students had recorded school numbers correctly but when the lists w oe sent to the schools, principals were unable to decipher the names. Others had school numbers recorded incorrectly and tracking down their records was impossible because it was impossible to know w hich school they attended. In Table 22 in Appendix B, under "Miscoded Data", I have noted the numbers o f students fo r whom D M AT results were known but fo r whom the school they attended could not be established. The second category, by & r the largest, included those studaits who wrote the DMATs in the &U o f2000 and moved behrre data on their term and Gnal grades could be collected. The collection o f data Gom the schools did not commaice u n til February, 2002 and was not completed unGl March, 2003. This meant that most students had changed schools and their records had been sent to the ir new schools. In most instances. 37 movement o f grade 7 students was predictable - they moved to the high schools in their catchment area. Some students, however, could not be tracked. Either they had moved out o f d istrict (in some instances, out o f province) or their new school was sim ply not known. In Table 23 in Appendix B, there is a summary o f the students fo r whom school records were not available. This turned out to be 185 students in grade 5 (14.3%) and 177 students in grade 7 (15%). M zaswgf DA&4T Grodk 5 The grade 5 D M AT consisted o f tw o parts each presented in a separate test booklet. The Brst part was made up o f 30 m ultiple choice items. Each item consists o f a stem follow ed by 5ve possible responses. Each item was valued at one mark. The correct responses ^ypeared randomly distributed among the Sve possible. The test booklets included a cover sheet where students were directed to ide ntify themselves by name (last and Srst), grade/class, school, whether or not they were o f aboriginal ancestry and whether or not they were enrolled in a Montessori program. The cover sheet also included an instructions section that provided students w ith the directions required to complete the test A ^ 2 goodness o f h t test can be used to determine whether or not the distribution o f correct responses is in fact random. There are 30 items and 5 response alternatives, there&re the most lik e ly distribution o f responses fo r it to be random would be 6 correct responses fo r each o f the 5 alternatives. For the grade 5 te st the observed distribution was: alternative A - 5, alternative B - 5, alternative C - 9, alternative D - 5 and alternative 38 E - 6. The observed y /"y value is 2. The critica l y Jv rv=.05 is 9.488 har d f = 4, therefore the distribution o f ita n s is statistically random. The second part o f the grade 5 D M AT was made up o f six items designed to be answered directly in the test booklet. These six items were in some cases subdivided. The arrangement o f the items and the distribution o f marks was as follow s: item 31 is divided into sections A and B, each w orth 1 mark; item 32 is divided into sections A and B, each w orth 1 mark; item 33 is worth 5 marks; item 34 is divided into sections A , w orth 4 marks, and B and C, each worth 1 mark; item 35 is w orth 2 madcs and item 36 is divided into sections A and B, each w orth 1 mark. The Short Answer part o f the grade 5 test was, in total, wm th 19 marks. Because this part o f the test was contained w ithin a separate booklet, there was, again, a cover sheet used to record student inform ation (the same inhrrm ation as fo r the m ultiple choice section). As fo r the m ultiple choice section, there was an instruction section which provided students directions fo r the completion o f this part o f the tesL A copy o f the Grade 5 D M AT booklets and the answer key booklet is included in Appendix C. These documents are included only fo r the defence o f this thesis but w ill not be published so as to maintain security o f the ita n bank. D M fT Grade 7 The grade 7 D M AT consisted o f tw o parts each presented in its own test booklet. The Grst part was made up o f 25 m ultiple choice items each consisting o f a stem w ith fiv e possible responses. Each item was valued at one mark. The test booklets included a cover sheet where students were directed to ide ntify themselves by name (last and firs t), grade/class, school, \^ e th o : or not they were o f aboriginal ancestiy^ and whether or not 39 they were enroHed in a Montessori program. It also included an instructions section w hich provided students w ith the directions required to complete the test. A y goodness o f f it test was again used to determine whether or not the distribution o f correct responses is lik e ly random. There were 25 items & r the grade 7 test and 5 response alternatives. As w ith the grade 5 test the most lik e ly random distribution would result whenever we get 5 correct responses &>r each o f the 5 alternatives. The expected distribution is there&re 5. For the grade 7 test, the observed distribution was: alternative A - 5, alternative B - 5, alternative C - 9, alternative D - 6 2 and alternative E —0. The observed y value is 82 . The y /i/ 2 c'v=.05 is 9.488 fa r d f= 4 . We can therefore conclude that the distribution o f responses is random. The second part was made up o f ten items designed to be answered directly in the test booklet. These ten items were in some cases subdivided. The arrangement o f the items and the distribution o f marks was as follow s: item 26 was w orth 4 marks; item 27 was divided into sections A , w orth 1 mark, and B, w orth 2 maiks; item 28 was w orth 2 marks, item 29 was divided into section A , B and C, each w orth 1 mark; item 30 was w orth 2 marks; item 31 was worth 1 mark; item 32 was w orth 2 marks; item 33 was also w orth 2 marks; item 34 was divided into sections A and B, each w orth 1 mark; and item 35 was divided into sections A and B, each w orth 1 mark. In total, the Short Answer test was Old o f 23 marks. As was the case w ith the grade 5 Short Answer test, the grade 7 test booklet had a cover sheet used to record student inform ation (the same inform ation as fo r the m ultiple choice section) and it provided students directions fo r the com pletion o f the test. A copy o f the Grade 7 D M AT booklets and the answer key booklet is included in Appendix C. As fo r the grade 5 D M A T test booklets and the answer key, these 40 documents aie included fo r the defence o f this thesis but w ill not be published so as to m aintain security o f the item bank, f& f - fbwndW ion Ski/Zf The FSA - Numeracy fo r 2000 consisted o f 4 parts. The 6rst and third parts were m ultiple choice items made o f a stem w ith four possible responses. The second and fourth parts were made up o f w ritten response items where students were asked to record the ir answers directly into the test booklets. The correct responses 6)r the m ultiple choice items were distributed w ith the follow ing &equencies: Part A (Items 1-16) alternative A -4 times, alternative B-4 times, alternative C-5 times and alternative D-3 tim es, Part C (Items 19-34) alternative A-3 times, alternative B-4 times, alternative C-6 times and alternative D-3 times. Parts B and D were each made rq) o f two items and each item was w orth 4 marks. Term awZ Tear Gradies The data required to test fo r evidence o f concurrent va lid ity were student marks fo r the year 1999-2000. It was & lt that although hnal grades could present an accurate assessment o f student achievement, term marks w ould also be gathered to provide additional inhum ation as to the relation o f D M AT results to student m aiks. This meant, however, that because term marks fo r the year in question were stored in hard copy form only (on student report cards), each student's 61e had to be reviewed m anually. The marks that were o f particular interest in this study were the mathematics marks. No other marks were recwded. Some principals submitted only the year end marks fo r students. In these cases I decided against pursuing principals fo r the term madcs as w ell. As a consequence, fo r 41 7.9% o f the grade 5s and 3.6% o f the grade 7s, only fina l marks were collected. Table 23 in Appendix D includes numbers and percentages fo r each o f the categories o f the collection o f term and/or 6nal marks. For students who wrote the grade 5 D M AT, i f the scores they received horn the FSA they wrote in grade 4 were in their school file s I was able to record it; however M inistry o f Education ofGcials had directed school principals to send the reports home to the student's parents. In some schools, copies were kept; in others, they were not. As a result, data on FSA results were available 6)r only 47.3% o f the students. frocedîwrçr The development o f the DMATs originated in the early 1990s w ith the members o f the school d istrict Mathematics Committee (SDMC). W ith the help o f Assistant Siq)»intendent Bendina M ille r, D irector o f Instruction Norm Munroe and w ith technical assistance 6om Iris M cIntyre the SDMC designed a three part achievement test to be used to assess how w ell grade 6 students were meeting the learning otgectives o f the grade 6 mathematics curriculum . This test was piloted in the spring o f 1995. The development o f tests & r grades 4 and 8 follow ed and were modelled aAer the already developed grade 6 test. Teachers, w ith knowledge o f the mathematics curriculum in each o f the grades targeted, were recruited and asked to review banks o f mathematical questions and problems and to develop the desired tests. The SDMC members then reviewed these tests, made m inor revisions and approved them fo r use. The original tests consisted o f three parts. The hrst part was made up o f m ultiple choice items, the second part was made o f short answer items and the th ird part was a 42 performance assessmait made up o f three or faur mathematics problems. The Performance Assessment part o f the test was administered, one-on-one, to randomly selected students. Students, in this part o f the test, were askW to talk about how they would solve a given mathematical problem. They were given tools and/or mathematics manipulatives which they could use to solve the problem. As they worked on the problem, the examiner asked them questions and/or encouraged them to ta lk about what they were thinking. A scming rubric was established by the SDMC membas to be used to rate responses. It was orig inally designed that the M ultiple Choice and Short Answer parts o f the test would be administered to 20% o f the students at each grade level. The schools in ^ lic h the tests w œ to be administered were selected randomly but included a m ix o f large and small schools, inner city and community based schools as w ell as rural and urban schools. It was fe lt that this w ould give a representative sample 6om w hich to assess overall progress in im plem aiting the mathematics curriculum . The series o f tests were originally designed to be administered every other year rather than annually. The DMATs fo r grade 5, 7 and 9 were firs t administered in the fa ll o f 1996. Each o f the years that the tests were administered (1996,1998,1999, and 2000), the SDMC members met be6)re the tests were sent to the schools to review the items and make any changes that were deemed necessary. The SDMC members met again, after the tests were administered and marked, to analyse the results and make recommendatimis about the implementation o f the mathematics curriculum to teachers, school and board ofBce administrators, and to make recommendations about any changes in the tests that they thought would be necessary fo r subsequent years. One o f the 43 primary objectives o f the SDMC was to analyse the test results over tim e to see what trends could be observed. That, in part, was why grades 4 ,6 and 8 were chosen and why a two year cycle o f testing was selected. To maximize continuity, SDMC members attempted to keep to a m inim um the changes made to the test items. Some changes did occur. Notably, the Performance Assessment part o f the test was discontinued after 1998; the test was administered yearly rather than every two years beginning in 1998; a ll students were tested in 2000 rather than a representative sample; the whole o f the grade 9 test was discontinued aAer 1998; the wording and the distribution o f marks fo r some o f the grade 5 short answer items was changed after 1996 and tw o o f the grade 7 m ultiple choice items were replaced after 1996 by what was fe lt were more suitable items. Over tim e, priorities w ith in the school d istrict have changed. In the fa ll o f 2002 the decision was made to discontinue administering the DM ATs. A t the tim e o f w riting, these tests remain an assessment tool that, although not in use currently, could be brought out o f storage and once again put into use. DrchotoMKwy aW AW Apoinf TYoceff For each test, the m ultiple choice sections were machine marked. The short answer sections ofthetestsrequiredteam sofm aikers. Eachyearthatthetestsw ere ' administered, d istrict teachers were recruited to mark the short answer sections. Each tim e, the teachers in itia lly reviewed the answer keys and m arking guides to maximize uni& rm ity o f m arking. They reviewed the recommendations o f previous marking teams. Then a few tests were marked togeth^, again to ensure unihrrm ity. Then test booklets were distributed and each marker started to mark the test booklets. Student scores were recorded on the bubble sheets. The marking teams compared a ll problem papers (\^ e re 44 responses were unnsnal or difG cnlt to assess because o f an unusual approach). A t the end o f the marking sessions the marking teams drafted notes to the mathematics committee and recommendations to the next set o f markers. For the year 2000 tests, re lia b ility between the markers was tested by having them mark the same tests & r twenty students. The twenty tests were randomly selected over the space o f two days. Ten tests were selected on each o f the two days. The data were tabulated and the degree o f agreement betweoi the markers was measured. A summary o f the correlations between markers is shown in Table 14 found in chapter 4. The data that were used in this study consisted o f student scores on the fa ll 2000 DM ATs, term and Goal grades fo r the school year 1999-2000 hn students who wrote the 6112000 DM ATs and FSA scores 6 r spring 2000 fo r grade 5 students who wrote the fa ll 2000 DMATs. Student scores on the DM ATs were aval6ble through a central data base located at the d istrict school board ofGce. The inform ation fo r each student included: a twentyone d ig it identiGer number, Grst and last name, grade, gender, whether the student was o f aboriginal ancestry or not, the school attended, whether the student was enrolled in a regular school program, a French immersion program or a Montessory program, choices fo r a ll items in part 1, scores on each item fo r part 2, a M ultiple Choice score, a Short Answer score and the date on w hich data were recorded. Principals in d istrict elementary and high schools were contacted (see Appendix D fo r a copy o f the letters) and term marks as w ell as fin a l grades were requested 6 r a ll those students who wrote the DMATs in the year 2000. To facilitate the collection o f 45 data, lists including the names o f students who wrote the tests were sent out to the schools where the students had last attended. In some schools the in& im ation was collected by school sta ff and returned to me. M ost school principals invited me to come to the school to review school files and copy the required data. N ot a ll the names o f students were identiGable. Some students had moved, many to other schools in the d istrict but there were many also who hM moved out o f the d is tric t I did not attempt to fo llo w up on students who had moved to another d is tric t Some students had incorrect or incomplete coding fo r the school code and were consequently not listed w ith any particular school. For these students, once they were identiGed, school lists were revised and they were then included. Some students could not be identiGed or located and infbrmaGon about the ir term and Gnal marks could not be retrieved. Final grades fo r students who wrote the DM ATs were available through the schools on a school based data base; however, term grades fm the students required a review o f each student's school Gle. Only the mathematics grades fa r the school year 1999-2000 were used. The in6>rmaGon was taken direcGy Gom the student's report card. The grades were recorded as letter grades including: A , B, C-I-, C, C-, I and F. Where students were w orking w ith individual educadon plans, the grades were recorded as lEP. Most schools had grades fo r three terms plus a Gnal grade. Two schools in School D istrict No. 57 use a report card that records grades fo r Gve terms plus a Gnal grade. They are Highland TradiGonal School and Central Fort George. For consistency, when I analysed the data, I converted t k Gve grades 6)r students at these schools to three grades by averaging the Grst and second grades fo r a Grst term grade, the th ird and faurth grades fm a second term grade and retained the GAh grade fo r the th ird term grade. The Goal 46 grades were leA unchanged. For the overaU analysis I converted the grades to numbers as follow s: A - 7, B - 6, C+ - 5, C - 4, C- - 3, lEP - 2 ,1 -1 , and F - 0. M issing data were coded as 9 and were excluded from any calculations that w ould skew the results. For the FSA scores fo r students who wrote the grade 5 DM AT, I recorded the scores student received on the Numeracy part o f the FSA they wrote in 2000. Results were recorded on a scale that included N ot Yet Meeting Expectations, Meeting Expectations, Exceeding Expectations and measures between each o f these categories. For the purposes o f this analysis the FSA scores were located on a Eve point scale w ith N ot Yet M eeting Expectations measuring a 1, Meeting Expectations measuring a 3, Exceeding Expectations measuring a 5. The points between these values were assigned the measures 2 and 4 respectively. The raw scores 6om these tests were not available to me - only the inform ation from the Enal reports. ZWovfyWysfg The analysis o f these data involved the use o f the computer programs: Iteman version 3.05; TestGra^ Department o f Psychology, M cG ill U niversity, M ontreal, Quebec; Bigsteps, a Rasch-Model computer program constructed by John M . Linacre and Benjamin D W right and available through MESA Press; Ascal, Assessment Systems Corporation, 1984, part o f the M icroCat Testing System and SPSS version 11.0. These programs were used to assist in the analysis o f test items and to calculate summary statistics and re lia b ility indices. 47 Compwfer f rogramï C/setf Iteman provides a classical analysis o f the student response data. It calculates endorsement rates as proportions (or percentages) and calculates item -total correlations fo r each response in order to determine the degree to which each item contributes to the re lia b ility o f the test. In a sim ilar way, it calculates proportions fo r the alternate responses to determine i f these options are functioning as intended. Iteman determines the degree to w hich student responses accurately reflect a b ility in two ways. It establishes the discrim ination index "D " by calculating the proportion correct in the upper and lower (27%) a b ility groups and comparing these values. A second measure o f discrim ination is a choice o f item -total correlations —a point biserial correlation or a biserial correlation i f the data are dichotomous. Iteman calculates statistics fo r the test as a whole. These include: Aequency, mean, variance, standard deviation, skew, kurtosis, re lia b ility, and median p-value. Iteman can compute two types o f item -total correlations —a point biserial correlation and a biserial correlation. T te correlation chosen h)r the analysis o f each D M AT was the point biserial. For Iteman this is a Pearson product-moment correlation between the item scores and the number-correct (total) 6>r the tesL The point biserial correlation is calculated fo r each alternative. This provides an opportunity to examine each alternative and assess how w ell or poorly it is behaving. Iteman calculates a discrim ination index fo r each dichotomous test item (the m ultiple choice item s). The index is the diSerence between the proportion correct in the high a b ility group and the low a b ility g ro ip . These values can range 6om -1.0 to 1.0. It 48 is indicated in the Iteman Users Manual that negative values and low values (less than 0.20) may indicate that the test item is flawed or perform ing poorly. Higher values, however, w ould indicate that die test item diEferentiates between high scoring and low scoring examinees. In general, a discrim ination index o f over .40 is considered great, between .30 and .40 is considered average, and although scores o f between .20 and .30 can sd ll represent an acceptable level o f discrim ination, scores o f lower than .20 are marginal at best. A n example o f the output generated by Iteman is presented in Figure 1. Item S ta tis tic s A lte rn a tiv e S ta tis tic s Seq. No. Scale -Item Prop. correct Disc. Index Point B iser. 25 1-25 . 53 .60 .52 1 2 3 4 5 Other .02 .08 .53 .10 .24 .02 .04 .15 .23 .21 .31 .00 .01 .04 .83 .01 .11 .00 -.0 7 - . 18 .52 -.2 7 - . 18 -.2 4 26 1-26 .43 .59 .49 1 2 3 4 5 Other Page 5 .18 .26 .04 .06 .43 .03 .23 .41 .08 .07 .14 .00 .09 .10 .02 .06 .73 .00 -.1 4 -.2 6 -.1 3 - . 03 .49 -.2 5 A lt . Prop. Endorsi no Total LOW Hi gh Poi nt B iser. Key Ffgw e 7. Sample output fa r Iteman fo r the m ultiple choice items in the grade 5 D M AT. The output shown in Figure 1 is taken from the Iteman analysis o f the grade 5 test, items 25 and 26. In this analysis, the data were divided into two parts —the dichotomous items (M ultiple Choice) and the m ultipoint items (Short Answer). The program assigned each item an identiGer as noted in the column entitled "Scale-ltem ". In the above, item 26 was assigned 1-26. The correct response fo r this item is noted by the asterisk located in the column "K ey". For item 26. the GAh response is the correct 49 * response. The three columns under '*Item Statistics" show: the degree o f difG cu l^, "Proportion Correct", w hich fo r this item is .43, and two measures o f the level o f discriminaticm und«^ the healings "D iscrim ination Index" equal to .59 and "P oint B iserial" equal to .49. In Figure 1, it can be noted that because the discrim ination index is deGned as the difkrcnce between the proportion o f low scoring examinees who chose the correct alternative and high scoring examinees who chose the correct alternative, fo r item 26, this diSaence, .73 - .14, is .59 as indicated above. Under the heading "A lternative Statistics", there are six columns. They show: the numbers o f the different responses listed under "A lternatives", the correct response noted under "K ey", the proportions o f students choosing the d ifk re n t alternatives listed under the headings "T otal", "Lo w " and "H igb " and the correlation between the selected alternative and the total test scores given in the column "P oint B iserial". In Iteman, the category "Low " is deSned as the 27% lowest scoring examinees and the category "H igh " is defined as the 27% highest scoring examinees. For item 26, we see that 43% o f a ll examinees chose the correct req>onse alternative 5 (this inhum ation also appears in the section "Item S tatistic"), 14% o f the group categorized as "Lo w " chose the correct alternative and 73% o f the group categorized as "H igh " chose the correct alternative. The point biserial correlation appears in both the "A lternative S tatistic" section and the "Item Statistic" section. In Figure 1, the point biserial values fo r item 26 are negative fo r a ll distractor responses and positive fo r the correct response. This shows that proportionally more low scoring examinees chose alternative responses than high scoring examinees and fo r the correct response proportionally more high scoring examinees chose the reqxmse than low scoring examinees. This is how we want the responses to 50 behave. The label "O ther" in the column "A lternatives" w ill include data not appearing in the 5 altanatives listed on the test. In most instances this w ould be students who selected two or more ahematives fo r one item. A lte rn a tiv e S ta tis tic s Item S ta tis tic s Item -Scale N per C orrelation Item A lt e r n ative Proportion Endorsing 536 1 2 3 Other .92 .08 .00 1.38 .36 616 1 2 3 Other .46 .54 .00 1.07 1.276 .66 1073 1 2 3 4 5 6 Other .18 .25 .30 .24 .04 .00 .19 1.211 .67 967 Page 6 1 .24 Seq. No. scale -Item Item Mean Item v a r. 31 2-1 1.080 0.074 .29 32 2-2 1.537 0.249 33 2-3 2.713 34 2-4 2.480 Key Ffgw g 2 Sample output fo r I^m an & r the short answer items on the grade 5 D M AT. The output shown in Figure 2 is taken from the Iteman analysis o f the grade 5 test, items 31,32,33 and a part o f 34. As mentianed earlier, the data in this analysis were divided into two parts —the dichotomous items (M ultiple Choice) and the m ultipoint items (Short Answer). The program assigned each item an identiGer as noted in the column entitled "Scale-Item ". In Figure 2, item 31 is identiGed as 2-1 because it is the Grst item in the second part, the mulGpoint secGon. The column "Item Mean" calculates the average response. For this analysis, i f an examinee was missing data fo r an item then the item was excluded Gom the calculaGon o f the mean. For item 31, the average 51 ■f response is 1.080 or very nearly 1. These data come 6om the 536 examinees ib r whom the data was included. This calculation is, however, somewhat confusing because when the answer sheets were completed the raters used alternative 1 to represent an incorrect attempt at solving the problem w id i a score o f 0, alternative 2 to rq)resent a score o f 1 and alternative 3 to represent a score o f 2. The answer sheets were leA blank whenever an examinee did not even attanpt to solve the problem. There&re, fo r this item an ''Item Mean" o f 1.080 6 r 536 examinees means that very few examinees tried the item and o f those w lx) did most (92%) got this item completely wrong, a few (8% ) got it partially correct and no examinees got this item completely correct. Iteman provides this inform ation in the section labeled "A lternative S tatistic". The "O ther" category under "A lternative S tatistic" show the number o f examinees not included in categories 1,2 or 3. This would include a ll those students who did not even attempt the item (761 examinees). The + under the column "K ey" shows that the scores are listed in ascending order. Because o f the way the items were scored, the actual mean w ould be 1 point less than is noted in the column "Item Mean", the "Item Variance" would remain unchanged as would the "Item-Scale C orrelation". The other two categories o f inform ation under the heading "Item S tatistic" are also calculated based on the items that include data. The category "Item Variance" includes a calculation o f the variance o f the responses to the item and "Item-Scale C orrelation" includes a Pearson correlation between responses to the item and mean scores fo r the examinees. 52 This program, developed by J. O. Ramsay o f M cG ill U niversity, was designed to provide inform ation in graphical farm about questionnaires and conventional exams (m ultiple choice and short answer test items). TestGraf makes use o f statistical methods to produce estimates o f examinees' responses. ;\Te6tGraf98\tgrafgr5 data Ite m 26 Probability 5% 50% 25% 95% 75% 1.0 ^ 5 0. 0. 0. 12 16 20 28 E xpected Score Figwre j. Sample o u ^u t o f TestGraf 6)r item 26 o f the grade 5 DM AT. 53 TestGrafwas used to produce response curves (IC C s) fa r each test item by plotting the probability o f response 6)r each response option along a range o f expected scores (measured in whole tests scores and paicentile ranking). This results in response curves as shown in Figure 3. Figure 3 shows a typical TestGraf ou^mt to r a test item —in this case Item 26 o f the grade 5 test. The correct response, response 5, is favored by more proGcient students and fo r students scoring in the G flh percentile and higher has an overall slope Gom the low er le ft comer o f the graph (lowest test scores) to the igiper rig h t comer (highest test scores). Alternate responses, cm the other hand, are more favored by less proGcient students and exhibit an overall slope Gom a high point on the leG to a low er point on the rig ^t. In Giis sample we see alternate response 4 follow ing ju s t this pattern. The alternate response 2 acts as a great distractor even among low scoring examinees and is preferred up to the fortieth percentile. The alternate responses 1 and 3 are chosen by & w o f the students at t k lowest scoring level but gain preference quickly and by the fifth percentile are also preferred responses u n til about the fbrdeth percentile where the correct response becomes the preGrred response. In TestGraf^ the measure o f difG culty is deGned as that point along the expected scores at w hich .50 o f those examinees are expected to get the item correct. In this example, about .50 o f the examinees at the sixty-G für percenGle are expected to get the item correct The level o f difG culty is measured as .65 (an expected score o f 23) and this test ita n could be considered o f moderate difG cu l^. In TestGraf^ the slope o f the response curve at the point o f .50 probability gives an irKÜcaüon o f the degree o f diamminaGon exhibited by the item . A steep slope would 54 suggest a high degree o f discrim ination. A shallow slope w ould suggest a low degree o f discrim inatioa. In Figure 3, the steep slope o f the curve at the point where tl^ probability is .50 diows that this item has a relatively high degree o f discrim ination. In fact, fo r this item , the slope o f the curve is uniform ly steep through a wide range o f expected scores. This indicates that this item has a h i^ degree o f discrim ination through a wide range o f examinee proGciency. TestGrafprovides additional inform ation by slmwing how test item may be behaving at particular points o f the response curves. One should be careful, however, at the extreme ranges o f the data because die numbers o f examinees providing inform ation fo r creating the ICC are, at these points, small. For comparison purposes, the item shown in Figure 3, is the same iterh used to examine the Iteman o u ^u t in Figure 1. From Figure 1 we found that item 26 is a moderately difG cult item w ith good discrim ination. The difG culty (p), the discrim ination index (D ) and the point biserial response 1) = -. 14, -.13, values are: p = .43, D = .59, rp& (fo r alternative (fo r alternative response 2) = -.26, (fo r alternative response 4) = -.03 and (fo r the correct response 3) = (fo r alternative response 5) = .49. The Bigsteps program is a Rasch-Model computer program constructed by John M . Linacre and Benjamin D W right and available through MESA Press. The version used fo r this analysis was 2.61. Linacre and W right (1996), in the user's guide to the program, indicate that the program is designed to provide an analysis that balances statistically the ef&cts o f item difG culty and person a b ility. In so doing it provides another means o f examining the test items. The Bigsteps procedure uses PROX (normal 55 ^p ro xim a lio n ) and UCON (uncondititm al maximum likelihood, jo in t maximum likelihood) estim ation methods to obtain progressively closer and closer approximations o f the test diÆ culty/ability regression curve (Linacre & W right, 1996). For th is analysis Bigstep was programmed to begin w ith a central estimate fo r each person measure, item calibration and rating scale category step calibration. A rough convergence to the observed data pattern was obtained by several iterations o f the PROX algorithm . The UCON algorithm was dien used to establish more exact estimates, standard errors and h t statistics. The UCON method that was used involved progressive proportional curve htting to hnd improved estimates. The m esures are reported in Logits (log-odds units) and the h t statistics, In h t and O utht, are reported as mean-square residuals (these have ^)proxim ate chi-square distributions). These mean-square residuals are normalized through a cube root function to provide a t-statistic fo r assessing the probability o f a response. —tw o/w gm eter ynWeZ. Ascal is (me o f the analysis programs available through Assessment Systems Corporation, 1984 and is part o f the M icroC at Testing System. The program used fo r this analysis was Ascal ™ version 320. The authors o f the User's Manual indicate that Ascal is an Item Response Theory calibration program which uses examinee responses to provide estimations o f iq) to three test parameters: discrim ination, difG culty and low erasymptote (psuedo-guessing). The estim ation procedure involves dividing the data into 20 categories, called hactiles. A curve approxim ating a normal distribution is used fo r the in itia l estimation. Each item 's lack o f h t to the model is established using chi-square statistic. The program repeats calculations through a series o f iterations to generate a 56 curve that progressively a p ^xim a le s the distribution o f items - the ICC. In the 2 parameter model, the item characteristic curve is used to estimate discrim ination and difG culty. The lower-asymptote (psuedo-guessing) parameter is elim inated by setting the number o f response alternatives to zero. This program is lim ited to analysis o f dichotomous items only. Æ cal —rAree pwamerer /Mode/. Ascal ™ version 320 was the program used to analyze the diree parameters: discrim ination, difG culty and lower-asymptote (psuedo-guessing). The authors o f the User's Manual indicate that the procedure used in this model is the same as was noted fo r the 2 parameter model, except that a th ird parameter, the lower-asymptote, is now included in the estimations. For this program, the in itia l estimate fa r Ae lower-asymptote parameter is the reciprocal o f the number o f alternate responses fo r the items. Because there were 6ve choices fo r each o f the m ultiple choice items, the in itia l value was set at 0.200. This program, as w ith the 2 parameter model, is lim ited to analysis o f dichotomous items only. The SPSS program that was used in this analysis was SPSS v 11.0. It was used to calculate summary statistics on aU data related to term and Gnal grades. This included frequencies, t-tests and a ll correlaGons. 57 CHAPTER 4 - RESULTS Analysis o f Test Items The in itia l analysis conducted was a classical analysis using the program Iteman. The next analysis used TestGraf to provide an assumption free analysis most closely associated w ith IR T. TestGraf used the test data to produce ICCs fo r each test item . This inform ation was used to supplement the findings o f Iteman and to reveal some o f the properties o f the test items fa r speciGc groups o f students - often those at the more extreme ranges such as the lowest and the highest scoring. Three programs were then used to provide logistic item response analysis. Bigsteps was used to gain additional inkrm ation about the difG culty and discrim ination o f the dichotomous and short answer items. This program provided a one parameter or Rasch analysis. A two parameter and three parameter analysis was obtained using the program Ascal. Each model was used to provide additional inform ation about the test items by estimations o f difG culty, discrim ination and guessing when comparing test items. In the follow ing sections TU describe the inform ation about the test items that is gained from each. Gradk J In Table 1, the values calculated by Iteman fo r item difG culty, the item discrim ination index and point-biserial correlations fa r each o f the mulGple choice test items are shown. Values printed in bold p rin t are fo r those o f the keyed correct responses. A high p-value (near 1) fo r the keyed correct response indicates a relatively 58 easy item. S im ilarly, a low p-value (near 0) 6)r the keyed correct response indicates that the item was relatively d ifh c u lt Table 1 , Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 D 2 ' P 1 /-pt P . .. ^ .43 .37 .01 .09 .04 .56 .12 .44 .03 .47 .15 .33 .01 .35 .07 .50 .07 .44 .89 .22 .51 .51 .04 .21 .18 .49 .51 .53 .19 .33 .17 .57 .07 .52 .54 .11 .47 .53 .03 .20 .01 .44 .06 .35 .01 .22 .03 .34 .02 .60 .18 .59 .54 .08 .15 .68 .56 .65 .43 1 .04 -.17 -.1(1 -.12: -.26 -.19 -.20 - .iir -.20 -.12: J4 .44 -.14 -.09 .43 -.04 -.22 -.23 -2 0 .45 -.1^1 -.15 -.27 -.08 -.08 -.07 -.14 -.21 -.26 .48 -.09 .04 .01 .03 .04 .74 .03 .12 .18 .06 .00 .14 .03 .06 .10 .31 .08 .04 .54 .14 .02 .29 .03 .89 .79 .08 .26 .69 .06 .04 .05 -.12: -.13 -.13 -.18 .47 -.23 -.21 -.1<) -.22 -.09 -.22 -.1:1 -.09 -.12: -.M -.16) -.18 .47 -.23 -.12: -.18 -.15 37 .41 -.18 .50 -.17 -.11 -.12 3 . P 1 .12 .95 .08 .74 .05 .78 .80 .08 .10 .09 .10 .88 .48 .06 .04 .54 .12 .11 .29 .03 .23 25 .01 .03 33 .04 .08 .07 .03 .44 Responses 1 4 1 5 1 P fp» 1 P -.08 26 -.30 .44 -.22 37 38 -.11 -.19 -.24 -.Ij) 32 .42 -.15 -.11 .48 .16 -.12: -.12: -.15 -.06 -.05 -.07 32 -.13 -.15 -.14 -.11 38 59 39 .02 .14 .04 .05 .02 .03 36 .02 .00 .11 .02 .15 .04 37 .07 .70 .09 .04 .89 .08 .03 .03 .12 .10 .06 .06 .07 .16 .13 33 -.11 -.25 -.1() -.If) -.1() -.11 .43 -.09 -.11 -.16) -.13 -.145 -.I"? 29 -.1*7 .49 -.1/7 -.13 32 -.15 -.11 -.21 -.3(1 -.27 -.03 -.13 -.24 -.1!) -.11 .01 .01 .68 .04 .12 .01 .03 .08 .71 .01 .10 .03 .10 .24 .09 .11 .05 .09 .05 .02 37 32 .05 .01 24 .43 .06 .61 .08 .30 Other fp6 P -.11 -.141 .50 -.1^1 -.25 -.1() -2 0 -.15 .44 -.13 -.04 -.1() -.23 -.19 -.06 -.11 -23 -.14: -.17 -.07 .40 32 -.145 -.09 .1!( .49 -.23 .57 -.24 -.11 .01 .00 .02 .02 .01 .01 .01 .03 .03 .01 .04 .01 .04 .03 .01 .02 .01 .05 .02 .01 .03 .02 .02 .02 .02 .03 .03 .04 .04 .04 -.1:: -.11 -.11 -.1:7 -.09 -.18 -.145 -.145 -.24 -.145 -.17 -.21 -.16 -.18 -2 0 -.21 -.24 .18 -.25 -.21 -21 -.22 -.23 -.26 -2 4 -2 5 -.27 -.26 -.27 -.27 H igh values fa r the alternate responses, depending on the degree o f discrim ination, could indicate that the response acts as an excellent distractor or that the item is flawed. Consider item 15 in Table 1. The correct response is the 6)urth response (values are shown in bold p rin t). Because the proportion correct is .37, this is tied w ith item 21 as the more diJBGcult items on the test Response 2 is the preferred distractor w ith .31 o f the students selecting i t Response 1 also appears popular w ith .19 o f aU students selecting i t Responses 3 and 5 had respectively .04 and .09 o f the students select them. The category labeled "O ther Response", which 5)r this item is .01 o f the students, would include those students who selected more than one response & r an item. Consider, & r comparison, item 2 in Table 1. The correct response is the third response. This tim e the proportion correct is .95. The item can be considered a very easy item w ith almost a ll the students (hoosing this response. Responses 1 ,2 ,4 , and 5 were respectively chosen by .01, .01, .02 and .01 o f the students. For this item , there were no students included in the "O ther Responses" category. In going horn most d iffic u lt to least difB cult the items w ould be placed in the follow ing order: 1 5 ,2 1 ,1 ,2 6 ,3 0 ,1 9 ,1 3 ,1 1 ,1 4 ,2 5 ,1 6 ,1 8 , 8 ,2 8 ,2 2 ,2 9 ,3 ,2 7 ,1 7 ,9 ,4 , 5 ,6 ,2 4 ,7 ,1 2 ,1 0 ,2 0 ,2 3 , and 2. Iteman calculates endorsement rates fo r each response fo r three diSerent groupings o f the students —total score, low score and high score. This inform ation helps in analyzing item performance. The data in Table 1 indicate that no item is so diB icult and exhibits so little discrim ination that it appears students are only guessing at it. This would be suggested where students selected tw o or more responses equally and where there was a lack o f discrim ination showing high a b ility as w ell as low a b ility students 60 were selecting the possible responses. A n examination o f the data, especially fo r the most difB cult items, shows that this is not the case. Table 1 also includes the values h)r the discrim ination index and the point-biserial correlations fo r each reqwnse as calculated by Iteman. A positive value fo r the pointbiserial indicates that low scoring students chose the item less frequently than high scoring studmts. This is a desirable feature h>r a ll the correct responses. A negative value indicates exactly the opposite - low scoring students chose the item more Aequently than high scoring students. This is what good distractors should do and so a negative point-biserial value is desired 6)r a ll alternate responses. Small values (discrim ination index and point-biserial) 5n an item show that the item discriminates poorly between high and low scoring examinees w hile large values show a good discrim inatirm between the two groups. When we use the four categories o f discrim ination discussed in chapter 3, the items can be categorized as follow s: # Great D iscrim ination (D > .40)- 3 ,4 , 5, 8 ,9 ,1 1 ,1 3 ,1 4 ,1 6 ,1 7 ,1 8 ,1 9 ,2 1 ,2 5 ,2 6 , 27,28,29 and 30, Average D iscrim ination (.30 < D < .40)- 1 ,6 ,7 ,1 5 ,2 2 and 24, Acceptable D iscrim ination (.20 < D ^ .30)- 10,12,20 and 23 and M arginal D iscrim ination (D ^ .20)- 2 # # # Looking closer at the items that display the least discrim ination, we Snd that the five items are also the easiest items in Table 1. This means that a large proportion o f the students (low scoring as w ell as high scoring) got these items correct, A low level o f discrim ination can be e^qpected. Proportion selected fo r these items, comparing high scoring and low scoring students, is as follow s: # # Item 2- low scoring is .89 and high scoring is .98, Item 10- low scoring is .75 and high scoring is .97, 61 # Item 12- low scoring is .75 and high scoring is .96, # Item 20- low scoring is .77 and high scoring is .97 and # Item 23- low scoring is .74 and high scoring is .96 These scores show that even though the measure o f discrim ination overall is low , the test items s till did what they were supposed to, that is, more high scoring stuiknts got the items correct and lew er low scoring students got them correct. For a Gnal comparison o f values in Table 1, the discrim ination index D and the point bisorial correlations fo r the correct response are compared. For most items the two values are comparable. In some instances they are the same (this is coincidental). For items 2, 10,12,20,23,26 and 28 the difkrence between the tw o indices is .10 or more. An examination o f the items shows that fa r items 2 ,1 0 ,1 2 ,2 0 , and 23 the value o f D is quite low and the values ofr^^ are higher. These are easier items and the diSerence between the proportion o f high scoring examinees who got these items correct and the proportion o f low scoring examinees who got these items correct is m inim al. For items 26 and 28 the opposite is the case —the value o f D is high and the values o f are lower. Figures 4 and 5 show sample output generated by the Testgraf program. They display data h)r items 2 and 15 respectively. In Figure 4 we see the response curves calculated fo r item 2. This item was identiGed in the previous secGon as the easiest item on the grWe 5 test. It is included here to show the type o f curve Testgraf generates fo r this type o f item. For comparison purposes, the degree o f diŒ culty (p), discriminaGon index (D ) and point-biserial proporGons = .95, D = .09, (response 1) = -.10, (response 4) ==-.11 and Gom Gie Iteman analysis fo r this item are: p (response 2) = -.13, (response 5) = -.14. 62 (response 3) = .26, The most prominent characteristic o f the graph in Figure 4 is the response curve shown 6)r response 3. This is the correct response and preferred by over three quarters o f the students by the Gfth percentile. B y the tw enty-fîAh percentile almost aU the students selected this response. The curve is more or less fla t beyond this p o in t ] : \T e B t G r a f 9 8 \t g r a f g r 5 d a ta Ite m P r o b a b ilit y 5% 50% 25% 75% 95% 1.0 0.8 0.6 0 .4 0.2 : i _, 8 12 16 : 24 E xpected Score Figure 4. The ICC produced by TestGraf fo r item 2 o f the grade 5 DM AT. 63 32 g } 36 Note that w hile the point biserial was low and the D index was ^proaching zero, the ICC produced by TestGraf showed a strong discrim ination w ith in the Grst quartile. Beyond the firs t quartile the ICC flattens out and discrim ination is negligible. This response curve is representative o f the type o f curve Testgraf produced fo r the easier items on the ];\T e s t G r a f 9 B \t g r a f g r 5 d a ta Ite m 15 P r o b a b ilit y 5% 25% 50% 75% 95% 0 8 .6 .4 2 .0 8 28 12 Expected Score Ô. The ICC produced by TestGraf fa r item 15 o f the grade 5 D M AT. 64 32 3€ grade 5 test. In particular, response curves 6 r items 10,12,20,23 and 24 are very sim ilar to what is shown is Figure 4. IuguK5shund to be moderately d ifh cu lt w ith item 31 follow ing ita n 30 (compared w ith the order taken from Iteman) and item 32 follow ing item 9 (compared again w ith the order taken 6om Iteman). A measure o f die difG culty o f these two items is found in the numbers o f examinees whom attempted them. O f the 1297 examinees, there were 536 examinees who attanpted item 31 and there were 616 who attempted item 32. Item discrim m atioii, between the item analysis models, was not quite as clear. Where item 2 was judged to have the lowest discrim ination when using classical analysis, in the Rasch analysis it was judged second lowest (behind item 15), in the 2 parameter logistic model it was judged twenty-second lowest and in the 3 parameter log istic model it was the tenth lowest. In classical analysis, item 28 was judged to have the greatest discrim ination. It had the greatest discrim ination, when considering only the M ultiple Choice items fo r the Rasch analysis and the 2 parameter logistic model. In the 3 parameter logistic model, it had the second greatest discriinination A llo w in g item 14. W ith the Rasch analysis we found that the level o f discrim ination fo r the six short answer 72 items was w ith in the h a lf o f the items w ith greatest discrim ination (34 - firs t, 33 - second, 31 - sixth, 35 - ninth, 36 - GAeenth and 32 - sixteenth). Table 24, comparing the order o f difG culty and discrim ination between these programs, is included in Appendix E. In general, the point biserial correlations A)r alternative responses Ib r each item in the classical analysis agreed quite closely w ith the response curves fo r the alternative reqwnses shown in TestGraf. Several items have been Gagged as lacking Gt in the Rasch analysis (using Bigsteps) and die logistic model analyses (using tw o parameter and three parameter Ascal) o f the grade 5 D M AT data. In Table 5 a summary o f the items that lacked Gt is shown. The items in bold p rin t are Giund in a ll three analyses. Tables AcTMS m the Grodle 5 DA64TErMWrzng Zack F it in the Zogirtie ttem Response Items Lacking F it Bigsteps 2. (.23), 23 (.33), 10 (.34), 20 (.34), 7 (.51), 24 (.52), 5 (.58), 6 (.58) 1 ,2 3 ,3 0 ,8 , 15,2 23 Ascal, 2 Parameter Ascal, 3 Parameter 7,14 2 a The numbers in bold p rin t are identiGed as lacking Gt in each o:'th e analyses. We Gnd that items 2 and 23 are idendGed as lacking Gt in a ll three item response models. Using values found in Table 1, we see that item 2 has p = .95, and D = .09, or = .26; item 23 hasp = .89, D = .22, and f),* = .37. Both items ^p e a r to be easy w ith a lack o f discrim ination; however, in TestG raf we Gnd that altemadve response 4 fo r item 23 acts as a great distracter among the low scoring group (p = .85 at the Grst and second percentile). 73 Item 7 is identiGed in both the Rasch analysis and the two parameter logistic model as lacking Gt. Again using Table 1, we Gnd tha tp = .80, D = .35 and = .38. Here again we Gnd that the item e^)pears easy; this tim e w ith moderate discriminaGon. The ICC displayed using TestGraf shows desirable characterisGcs G)r a ll responses. In the Rasch analysis, items 10,20,24,5, and 6 are also identiGed as lacking Gt. Item 10 has p = .89, D = .22 and = .34. Item 20 has p = .89, D = 20 and Item 24 has = .41. Item 5 has p = .74, D = .47 and f),» = .47. Item = .79, D = .34 and = .32. 6 has p = .78, D = .33 and rpt = .37. Each o f these Gve items is easy, w ith the two easiest (10 and 20) exhibiting low discriminaGon and the three others exhibiting an average to strong discriminaGon. In the ICCs displayed using TestGraf^ item 5 shows a slight negaGve correlaGon Gar the correct response pnor to the GGh percentile w ith a strong posiGve correlaGon thereafter. The other items display response curves as expected. For die two parameter logisGc model, items 1,30, 8, and 15 show a possible lack o f Gt. Item 1 has p = .39, D = .37 and .38. Item 8 hasp = .56, = .43 and = .33. Item 30 has p = .44, D = .43 and T},* = = .38. Item 15 hasp = .37, D = .33 and = .29. Three o f the four items (1,30, and 15) ^p e a r to be difG cult items w ith good or exceUent discriminaGon. The fourth item is a moderately difG cult item again w ith excellent discriminaGon. In the ICCs displayed using TestGraf^ we observe that fo r item 1, the resp>onse curves G)r the correct reqionse and altemaGve show slight irregulariGes at the low score range (below the tenth percentile) and at the high score range beyond the ninety-GAh percenGle. The response curve in item 30 fo r altemaGve reqxtnse 5 shows a posiGve discriminaGon iq) to the Gfleenth percenGle and then shows an expected negaGve discriminaGon. The response curve fo r the correct response in item 8 shows a strong and 74 imexpected negative discrim ination ib r the highest scoring students. It q>pears that something in this item is causing the strongest students to choose alternative response 5. And hnally, the respmise curves in i t ^ 15 diows a strong distr^^tor in alternative response 2; however, a ll response curves appear to behave as they should. In most o f the above items it appears that lack o f fit may stem hom two sources: an easy item lacking in discrim ination or a difS culty item w ith strong distracters. Item 8 has a response curve that can be considered atypical because responses by the highest scoring examinees is contrary to what is expected. Grodk 7 C/rKsicnZvdnnZysw Using ftcTnan In Table 6, the values calculated by Iteman fa r the difB culty and the discrim ination index and point-biserial correlations fo r each o f the m ultiple choice test items are shown. Values printed in bold are those o f the keyed correct reqxmses. In the column displaying proportion selected, a high p-value (approaching 1) fo r the keyed correct response indicates a relatively easy item . In a sim ilar way, a low p-value (^proaching 0) fa r the keyed correct response indicates that the item was relatively difG cult. H igh values fo r the alternate responses, dq)ending on the degree o f discrim ination, could indicate either that the response acts as an excellent distracter or that the item is flawed. 75 Table 6 Classical Analysis o f Grade 7 D M AT Items Using Iteman, Measures o f D ifB culty and Discrim ination Responses 1Item 1 D 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 .43 .50 .38 .30 .20 .45 .31 .39 .35 .61 .34 .38 .18 .52 .20 .21 .37 .62 .46 .46 .40 .49 .31 .44 .36 1 1 1 P 1 r/,6 .11 .08 .02 -.19 -.15 -.!() .76 35 .01 .03 .01 .35 .26 .11 .32 .05 -.07 -.09 -.13 -.03 -.13 -.!() -.03 -.03 .76 .20 .07 .17 .46 -.1:1 -.11 -.04 .77 .41 .01 .06 .18 -.1C) -.24 -.24 .69 35 .12 .05 .21 -.26 -.12 -.17 35 36 2 P .08 .05 .04 .06 4 3 1 1 5 1 Other | 1 p 1 rpA 1 1 p -.19 -.19 -.17 -.15 .71 .60 .73 .41 .43 .40 -.25 -33 .88 30 .15 .05 .07 .08 .14 -.15 -.26 -.24 .70 .84 37 .41 .40 37 36 .40 37 31 .20 .06 .14 .14 -.19 -.17 -.11 -.10 .12 .17 .07 .29 .05 -.04 -.15 -.13 -.14 -.14 33 .45 .76 35 .05 -.17 .10 .07 .02 .11 .10 .16 .09 -.14 -.17 -.07 -.16 -.16 -.13 -.20 32 30 .75 34 .05 .34 -.19 -.08 .07 -.18 .61 35 .12 .16 .05 .04 .10 .10 .13 -.15 -.11 -.17 -.19 -.10 -.05 -.14 .04 .08 .19 .01 .05 .06 .05 .07 .19 .12 -.18 -.12 -.26 -.07 -.11 -.18 -.19 -.17 -.06 -.23 36 31 37 37 .03 .15 .01 .09 .02 .31 -.03 -.22 -.11 .04 -31 -.49 .67 39 .44 .43 .04 -.12 .72 .49 .04 -.17 30 .40 .06 - .iir .04 .17 .01 .01 .01 .12 .01 .04 .12 .18 .13 38 .01 .10 .01 .02 .05 .04 .03 .13 .05 .02 .05 .13 .07 -.13 -3 2 -.13 -.05 -.1:» -3 4 -.07 -.08 -.1C) -.18 -.05 -.10 -3)1 -.22 -.08 -.i:r -.16 -.05 -.18 -.01 ..IS ) -.18 -.15 -.1(5 -.03 .01 .02 .01 .01 .00 .01 .01 .04 .05 .03 .02 .01 .02 .02 .01 .02 .01 .02 .02 .03 .01 .01 .02 .02 .06 -.12 -.11 -.14 -.13 -.10 -.12 -.13 -.10 -.17 -.13 -.15 -.12 -.10 -.15 -.17 -.16 -.15 -.11 - .iir -.13 -.14 -.15 -3 0 -.16 -.16 Consider item 9 in Table 6. The correct response is the second. Because the proportion correct is .26, this is tied w ith item 11 as the most difG cult item on the test. Response 1 is the preferred distractor w ith .26 o f the examinees selecting it. Response 4 76 also appears popular w ith .19 o f a ll Ae examinees selecting it. Responses 3 and 5 each had .12 o f the examinees select them and the category labeled "O ther Responses", had .05 indicating that this p rc ^ rtio n o f students selected more than one response as their answer. This item warrants closer scrutiny because o f the high proportion o f students selecting each o f the responses. This item could be sufGciently difG cult that most students sim ply guess at an answer; however, the negative point biserial values fa r a ll distractoTS and a positive point biserial fo r the correct response indicate that this item is behaving as desired. Consider also item 11 in Table 6. The correct response is the fourth. It too has .26 o f the students who chose this fo r the correct response. For the alternate responses we Gnd: response 1 was the preferred response w ith .32 o f the students selecting it, response 2 had .20 o f the students selecting it, response 3 had .07 o f the students selecting it and response 5 had .13 o f the students selecting i t This tim e, the category labeled "O ther Responses", had .02 o f the students who selected more than one response as their answer. This item also behaves as desired as we shall see in the next section. Consider, fo r comparison, item 5 in Table 6. The correct response is the second. This tim e the proportion correct is .88. The item can be considered an easy item because o f the high proportion o f students choosing this response. Responses 1 ,3 ,4 , and 5 were respectively chosen by .01, .05, .05 and .01 o f the students. For this item , there were no students included in the "O ther Responses" category. In Table 6 we Gnd that the items when placed in order going Gom most difG cult to least difG cult are: 9 ,1 1 ,1 2 ,1 6 ,2 5 , 8 ,2 0 ,1 0 ,2 4 ,1 4 ,2 ,1 8 ,1 9 ,2 1 ,6 ,1 ,2 2 ,3 ,2 3 ,4 , 1 3 ,15 ,17,7 and 5. 77 In Table 6 the discrim ination index and the point-biserial calculation fo r each response are also displayed. Low values would show that the item discrhninates poorly between high and low scoring examinees. In contrast, high values w ould show that the item discriminates w ell between high scoring and low scoring students. Using the four categories h)r discrim ination discussed in chapter 3, the items can be categorized as follow s: # Great D isciim ination (D > .40)- 1 ,2 ,6 ,1 0 ,1 4 ,1 8 ,1 9 ,2 0 ,2 1 ,2 2 and 24 # Average D iscrim ination (.30 < D < .40)- 3 ,4 , 7, 8 ,9 ,1 1 ,1 2 ,1 7 ,2 3 and 25 # Acceptable D iscrim ination (.20 < D < .30)- 5,15, and 16 and # Marginal D iscrim ination (D <.20)- 13 Looking closer at the items that display the least discrim ination, we Gnd that a ll three o f the four items are listed among the easier items shown in Table 6. This means that a large proportion o f the students (low scoring as w ell as high scoring) are getting these items correct. A low level o f discrim ination can be expected fa r these items. A comparison o f the high scoring and low scoring students is as follow s: # # # # Item 5- low scoring is .76 and high scoring is .96, Item 13-low scoring is .67 and high scoring is .85, Item 15 -low scoring is .64 and high scoring is .85 and Item 16- low scoring is .24 and high scoring is .44. For items 5,13 and 15, even though the measure o f discrim ination overall is low , the test items are s till doing \^ ia t they should, that is, more high scoring students got the items correct and fewer low scoring students got them correct. Item 16, however, even though it shows this same pattern and consequently may prove to be a w ell behaved item , because it can be considered a more difG cult item , has surprising low discrim ination. This item warrants addihonal consideration. In the next section, TU be examining more closely items 9,11 and 16 and what TestG raf shows about the items. 78 Figures 5 ,6 ,7 , and 8 show the output generated by the TestG raf program fo r items 1, 9,11 and 16 respectively. In Figure 5 we see the response curves calculated fo r item 1. r ve included it fo r comparison purposes because aU reqwnses appear w ell C :\T e s tG r a E 9 8 \g r 7 tg r a f d a ta Ite m P ro b a b ilitz y 5% 25% 50% 75% 95% .0 .8 6 .4 .2 20 24 28 Expected Score f igz/re 6. The ICC produced by TestGraf for item 1 of the grade 7 DMAT. 79 36 behaved, it is moderately easy and it shows a high level o f distaiinination. M ore speciGcally, the degree o f difR cnlty (p), discrim ination index (D) and point-biserial proportions 2) = -.19, & r this item are: j? = .71, D = .43, (response 3) = .41, (response 1) = -.19, (response 4) = -.18 and (response (response 5) = -.13. C : \T e 8 t G r a f 9 8 \g r 7 t g r a f daba Ite m 9 P r o b a b i lit y 5% X.U — T 25% ) ! 50% 75% j 95% : 0 .8 .J 1 ' ! 1 I /" 1 1 0 .6 / ' / 1 ! 1 0 .4 0 Q 1 i'z i 4 ! — -^1 12 / ' ____ ......i_ _ ______ ___ ___i___ 8 1 16 20 24 i 1 5 ' 28 32 36 E xpected Score Mgwe 7. The ICC produced by TestGraf for item 9 of the grade 7 DMAT. 80 The response curve calculated fo r item 9 is displayed in Figure 7. This item has p = .26, D = .35, and /pt (response 2) = .37. Although a difG cult item , item 9 appears to behave ju s t as it should. Up to ^yproxim ately the seventy-SAh percentile, a ll alternate response effectively distracted students; however, beyond this point the students chose the correct response. This item demonstrates a high level o f discrim ination and ef& ctively shows which students are high scoring. C :\T e 8 t G r a f 9 8 \g r 7 t g r a f d a ta Ite m 11 P r o b a b ilit y 5% 1 .0 _ 0 .8 _ 0 .6 _ i ; ! ! ! \ , 1 1 I . ^ 0 .2 -/ 25% Î X. , \ 50% ; 75% i 95% 1 ! 1 1 : 1 : 1 1 1 1 i ! , 1 1 1 L.------/ ' ' 1/ ' , , ^7''s ? t i ? ! ^ 1 / 1 / i 1 1 1 1 1/ / / y/l y/ i / i , ... 0 ,0 4 8 12 16 20 24 28 E xpected Score Ffgwe & The ICC produced by TestGraf for item 11 of the grade 7 DMAT. 81 32 ... ' - " ï 36 The response cun'e calculated 6)r item 11 is displayed in Figure 8. This item has: = .26, D = .34, and (response 4) = .37. This item displays many o f the characteristics o f item 9 as displayed in Figure 7. It also spears to behave as it should. Up to approxim ately the eightieth percentile, alternate responses effectively distracted students; however, beyond this point the students chose the correct reqionse. This item demonstrates a high level o f discrirnination and e fkctive ly shows which students are high scoring. This is the sort o f item that would w ork w ell fo r selecting the highest scoring students. The response curve calculated h)r ita n 16 is shown in Figure 9. This item has: ^ = .32,2) = .21, and (response 3) = .20. This is the only item on the grade 7 D M AT that has a distractor w id i a positive point biserial value - (response 4) = .04. This item , displays some o f the characteristics o f items 9 and 11, but also appears to behave poorly & r students above the ninety-AAh percentile. Up to (approximately the eighty-hflh percentile, alternate responses efG xitively distracted students; beyond this point the students chose the correct response except fo r students above the ninety-&Rh percentile where many were effectively distracted by response 1. The level o f discrim ination throughout the item is low and high scoring students are choosing the wrong réponse. This is not a desirable characteristic h)r a test item. The lack o f discrim ination could indicate that students are guessing at this item . 82 : ; \T e s L G r a f 9 8 \g r 7 t g r a f d a ta Ite m 16 P r o b a b ilit y 25% 50% 75% 95% /" \ 0.6 0 .4 >r other items. In Table 9, we have at least one item that spears to have a chi-square value that is considerably larger than the others - item 16. Because o f its apparent lack o f Gt, item 16 can be considered suspect. Items 9 ,2 ,1 0 1 and 13 are less suspect as they do not display the same dramahc diGerence between chi-square values as between items 16 and 9 but they warrant a careful look because o f the high chi-square values they have. A ll other items appear to Gt this model. A summary o f the ouqait fo r the 3 parameter model o f Ascal is also included in Table 9. The program was again asked to run a maximum o f twenty iteraGons. This tim e it ran eleven iteraGons and stoiq)ed when the maximum parameter change was 0.00322. These items when ordered by degree o f difG culty going Gom most d iS c u lt to least difG cult are: 1 6 ,9 ,11,25,12, 8 ,2 0 ,1 0 ,2 4 ,2 ,1 4 ,1 8 ,2 1 ,1 9 ,6 ,2 2 ,1 ,3 ,2 3 ,1 7 ,1 3 ,4 , 7, 15 and 5. These items when ordered by discriminaGon going Gom greatest to least are: 11,16,2,10, 9 ,2 0 ,1 8 ,2 2 ,7 ,1 2 ,1 4 ,1 9 ,2 5 ,6 ,1 7 ,1 , 3 ,5 , 8 ,2 4 ,2 1 ,4 ,2 3 ,1 3 and 15. These items when ordaod by the guessing parameter going Gom the greatest degree o f 89 guessing (the highest values) to the least degree o f guessing (the lowest values) are: 2,16, 13,21,23,20, 6 ,7 ,2 5 ,1 5 ,5 , I I , 3 ,9 ,1 2 ,1 4 ,4 ,8 ,2 4 ,1 9 ,1 ,1 7 ,1 0 ,2 2 and 18. The degrees o f freedom fo r the 3 parameter model o f Ascal are 17 because the model begins its estimates by b ro kin g the data into 20 fractiles and tests the 3 parameters; difG culty, discrim ination and guessing. The critica l values fo r chi-square w ith 17 degrees o f freedom are 27.587 (signihcance level .05) and 33.409 (signihcance level .01). A look at the data reveals that items 7 ,1 7 ,2 ,4 , 5,10 and 1 lie outside a .05 conGdence interval and items 5,10 and 1 lie outside a .01 conGdence interval. O f greater signiGcance is that none o f the chi-square values seem to be considerably greater than a ll the others. When they are ordered, the greatest difference between successive chi-square values is 6.797 (item 5 w ith chi-square 37.053 and item 4 w ith chi-square 30.256). We can conclude that fo r the 3 parameter model o f Ascal a ll the items demonstrate a suitable Gt. A fw nary o f Gradk 7 JZgjgwwe vfwzfysü There appears to be general agreement between the item analysis models about the order o f difR culty fo r the items in the grade 7 te st W ith each program item 9 was idenGGed as either the Grst or second in the order o f d iffic u lty . S im ilarly, w ith each program item 5 was found to be the easiest o f the mulGple choice items and w ith the Rasch analysis using Bigsteps we found that the short answer items 26,27,29, and 33 were judged easier again than item 5. Items 31,28,30, and 32 were found to be moderately difB cult w ith item 31 being the most difG cult o f the short answer items eleventh when a ll items are ordered Gom most to least difG cu lt The other short answer ita n s, items 34 and 35, were found to be moderately easy and were twenty-second and 90 th irtie lh when placed in order o f difR culty. A measure o f the difR culty o f these items is found in the numbers o f examinees whom attempted them. O f die 1165 examinees fo r whom statistics on the short answer items were recorded, there were 636 who attempted item 31,391 who attempted item 28,386 who attempted item 30,438 who attempted item 32, 548 who attempted item 34,603 who attempted item 35,632 who attempted item 33,926 who attençted ita n 29,1049 who attempted item 27, and there were 1093 who attempted item 26. Item discrim ination, between the item analysis models, was not quite as clear. Some exception were: item 13 was ranked least discrim inating in a ll programs, item 15 was ranked ather th irty-th ird or thirty-fou rth in discrim ination, and item 8 was ranked tw enty-ninth in three o f the programs (it was twenty-second in Iteman). W ith the Rasch analysis using Bigsteps, most o f the short answer items were judged to have great discrim ination. 16)und that the ten short answer items came w ithin the top thirteen items when ranked horn greatest to least in discrim inatioiL See Table 25 in Appendix E, where the orders o f d ifh cu lty and the ordas o f discrim ination between these programs are compared. In general, the point biserial correlations fo r alternative responses fo r each item in the classical analysis using Iteman agreed quite closely w ith the respxmse curves &>r the alternative responses shown using TestG raf Several ita n s were Ragged as lacking fît in the item response analysis o f the grade 7 DM AT data using the programs Bigsteps, and the two parameter model and the three parameter model o f Ascal. A summary o f the items that lacked 6 t is shown in Table 10. The items in bold p rin t were found to lack fit in each o f the programs. 91 Table 10 Aem; m fAe Grodk 7 ZacA Zogüfzc Akm Æe^^Rge v4nafy6:gg Items Lacking F it Bigsteps 5 (.35), 7 (.37), 22 (.47), 17 (.48), 4 . (.52), 3 (.53), 1 (.54), 23 (.55), 6 (.56), 19 (.58), 15 (.59), 18 (.59), 34 (1.69), 33 (1.70), 35 (1.77), 32 (1.87), 28 (1.97), 30 (2.16) Ascal, 2 Parameter 20,4 Ascal, 3 Parameter 7,1 7,2 ,4 13,1,10, 2, 9,16 5 ,1 0 ,1 a These items ^p e a r in bold p rin t and are identified as lacking St in each o f the analyses. We Snd that items 1 and 4 are identified as lacking St in a ll three item response models. Using values S)und in Table 6, we see that item 1 hasp = .71, D = .43, and 7^» = .41; item 4 hasp = .76, D = .30, and 7),» = .35. Both items ^ypear to be moderately easy w ith good discriminaSon. In TestGraf we Snd that fo r item 1 , the discrim in a tions (measured by the 7 ^) o f the correct reqwnse and the ahemaSve response 1 are the op;x)site o f what is expected up to the tenth percentile; and in item 4 the altemaSve response 3 acts as a great distractm among the low scoring group (p=.85 at the firs t and second percentile). Items 5 ,7 , and 17 were identiSed in both the Rasch analysis using Bigsteps and the three parameter logistic model using Ascal as lacking St. Again using Table 6, we hnd that item 5 has p = .88, D = 2 0 and = .30, item 7 has p = .84, D = .31 and 7),* = .40; and item 17 hasp = .77, Z) = .37 and 7^6 = .41. Here again we 6nd that the item appears easy, more so even than items 1 and 4; this tim e item 5 has low discrim ination 92 w hile hems 7 and 17 have an average discrim ination. The response curves & r aU three items ^p e a r normal. The items appear w e ll behaved. In the tw o parameter and three parameter logistic models using Ascal, items 10 and 2 were identiGed as lacking Gt. Item 10, using Table 6, has p = .40, D = .61, and = .51 and item 2 hasp = .60, D = .50, and r};* = .43. Item 10 appeam to be a difB cult item w ith excellent discrim ination. Item 2 appears easier and also has excellent discrim ination. An examination o f the response curves found in TestGraf shows that fo r item 10, the correct response exhibits a negative discrim ination between the tenth and the th irtie th percentiles and the alternative response 4 acts as a great distractor fix low scoring examinees (it has p = .60 in the Brst and second percentiles); and B)r item 2, the correct response exhibits a slight negative correlation between the tenth and twenty-GAh percentiles. In the Rasch analysis using Bigsteps, items 3 ,2 3 ,6 ,1 9 ,1 5 ,1 8 ,3 4 ,3 3 , 35,32,28, and 30 are also identiBed as lacking Bt. O f these, only the Brst six items can be compared using data Bom Table 6 and item ICCs shown using TestG raf Item 3 has p = .73, D = .38 and D = .45 and = .20 and = .40. Item 23 hasp = .75, D = .31 and rp* = .34. Item 6 hasp = .70, = .41. Item 19 hasp = .67, D = .46 and = 25. Item 18 has p = .61, D = .62 and = .44. Item 15 has p = .76, D = .55. Each o f these six items can be termed moderately easy. Moreover, w ith the exception o f item 15, aU exhibit at least average discrim ination. Items 6,19 and 18 appear to have excellent discrim ination. In TestG raf the response curves fo r item 3 show that alternative response 4 is an excellent distractor fo r most low scoring examinees (p = .70 fo r examinees in the Brst and second percentile), w hile the other response curves ^p e a r normal. In ita n 23, the 93 coirect response exhibits a slight negative discriin in atio ii among the low scoring examinees (to the tenth percentile). In items 6 and 15 there is a slight negative discrim ination in the high scoring examinees. In item 19 the response curves appear to behave ju st as expected. In item 18, response 4 acts as an excellent distractor and is favored by examinees up to the 50^ percentile. The response curve fo r the correct response shows a slight negative discrim ination fo r the highest scoring students. In the tw o parameter logistic model using Ascal, items 2 0 ,1 3 ,9 , and 16 show a possible lack o f Gt. Item 20 has ^ = .39, Z) = .46 and .18 and and = 2 0 . Ito n 9 h a s = 26 , D = .35 and = .43. Item 13 has p = .76, Z) = = .37. Item 16 has^ = .32, D = 21 = .20. Three o f the four items (20,9, and 16) ^p e a r to be difG cult items. The fourth item (13) is a moderately easy item . In only tw o o f the items (20 and 9) do we Gnd average to excellent discrim inadon. In TestG raf we see that fo r item 20, the response curve fo r the correct response shows a slight negaGve discriminaGon fo r high scoring examinees (over the ninty-GAh percentile). The response curves in item 13 are Gat and show GtGe discriminaGon. The reqxmse curve fo r the correct response in item 9 shows a strong posiGve discriminaGon fo r high scoring examinees (above the seventyGAh percentile). And Gnally, the response curves in item 16 fo r the correct response and the strongest alternative response (response 1) show a large reversal o f the expected discnminaGons fo r the high scoring examinees. In most o f the above items it appears that lack o f Gt may stem Gom three sources: an easy item lacking in discriminaGon, a difBcuhy item w ith strong distractors or an item where the responses misbehave Grr a small proporGon o f the examinees. Item 16 has a 94 response cwve that can be considered atypical because responses by the higher scoring examinees is contrary to what is expected. R eliability o f C riterion Measures Two separate runs o f the Iteman program were conducted 6>r each o f the DM AT tests. The Grst run identified two subtests to be analyzed. The Grst subtest included a ll mulGple choice test items. The second subtest included aU short answer test items. The internal consistency estimates (coeGScient alphas) fo r these subtests are shown in Table 11. In the second run, the test data was separated into Gve subtests. The Grst four subtests consisted o f mulGple choice test items categorized by provincially defined learning strands. The Gfth subtest was a ll d io rt answer test items. The internal consistency estimates fo r these subtests are also shown in Table 11. Both runs were made w ith the grade 5 D M AT data and the grade 7 D M AT data. Overall test staGsGcs fo r the iniGal run o f the grade 5 data set using Iteman were: fo r mulGple choice test items, a mean o f 19.165 (out o f 30 item s), standard deviaGon o f 5.631, skewness o f -0.407, kurtosis o f -0,407 and a standard error o f measurement o f 2.277; fo r short answer test items, a mean o f 7.716, standard deviaGon o f 3.619, skewness o f 0.158, kurtosis o f -0.736 and standard error o f measurement o f 1.157. The test staGsGcs fo r the in itia l run o f the grade 7 data set using Iteman w oe: fa r mulGple choice test items, a mean o f 14.607 (out o f 25 item s), standard deviaGon o f 4.353, skewness o f -0.185, kurtosis o f -0.307 and a standard error o f measuranent o f 2.136; fo r short answer test items, a mean o f 10.362, standard deviation o f 5.020, skewness o f 0.184, kurtosis o f 0.726 and standard error o f measurement o f 1.146. 95 Tablell f/K CZfMfiW ^/zo/yfM [/ymg Aeman 1 Dichotomous Items 1 No o f Items | Alpha M ultipoint Items No o f Items | Alpha 0.837a Grade 5 30 6 0.707b Numbers 16 5 Patterns and Relations 0.589b Shape and Space 5 0.507b Statistics and Probability 4 0.432b Grade 7 0.759. 25 10 Numbers 13 0.665b Patterns and Relations 3 0.324b 5 Shape and Space d.322b Statistics and Probability 4 0.325b Tiese are the coefBcient alp las fo r dichotomous items in t Me Grst run. b These are the coefBcient alphas fo r the curriculum strands in the second run. c These are the grade 5 and grade 7 coefBcient alphas 6)r both runs. 0.898c 0.948c It can be noted that when the firs t subtest (in the hrst run) is sectioned into smaller units the re lia b ility decreases. This is to be expected. Sax (1997) points out that one o f the factors affecting re lia b ility is the number o f items on the test. B y sectioning iq) the dichotomous test items we are in essence creating four smaller tests. From Table 11 we observe that die coefRcient alphas fo r the tw o sets o f tests (dichotomous items and m ultipoint items) are 0.759 and higher. This indicates a strong internal consistency and leads to the conclusion that both the Grade 5 and the Grade 7 tests exhibit a high degree o f internal re lia b ility. As indicated previously, the short answer test items were graded by a team o f markers. R eliability was assessed w ith randomly selected tests copied and marked by all markers. This process was undertaken w ith both groups o f raters - grade 5 and grade 7 w ith a few m inor diflerences. There were three grade 5 markers and 6ve grade 7 96 | 1 markers. The grade 5 madcers marked 19 papas in common and Ihe grade 7 markers marked 20. And one o f the grade 5 markers was away on one o f the days that inter-rater reliabdhy was assessed. Table 12 Diytnhndon Item 31A 3 IB 32A 32B 33 34A 34B 34C 35 36A 36B Mean Total Standard Deviation ^ Marks 0 1 0 1 0 1 0 1 0 1 2 3 4 0 1 2 0 1 0 1 0 1 2 0 1 0 1 /ô r the Jÿgms m the Groiie 5 D M 4T Rater 1 42.1 57.9 94.7 5.3 632 36.8 57.9 42.1 0.0 21.1 10.5 31.6 36.8 26.3 31.6 42.1 26.3 73.7 47.4 52.6 36.8 15.8 47.4 36.8 63.2 10.5 89.5 Rater 2 33.3 66.7 100.0 0 33.3 66.7 33.3 66.7 0.0 33.3 0.0 33.3 33.3 33.3 33.3 33.3 22.2 77.8 44.4 55.6 0.0 44.4 55.6 44.4 55.6 11.1 88.9 Rater 3 42.1 57.9 94.7 5.3 36.8 63.2 52.6 47.4 0.0 27.8 5.6 22.2 44.4 27.8 27.8 44.4 27.8 722 55.6 44.4 15.8 31.6 52.6 36.8 63.2 10.5 89.5 9.32 3.33 10 3.97 9.89 3.03 97 1 1 The overall scores as w ell as Ihe item by ita n percentages o f marks awarded are shown in Table 12 fo r the grade 5 markers and Table 13 fo r the grade 7 maikers. Table 12 dmws the proportions o f students assigned the diflerent marks by each o f the raters fo r each o f the items on the grade 5 tesL That is, fo r item 31A : Rater 1 gave 42.1% o f the examinees a mark o f 0 and 57.9% o f the examinees a mark o f 1, Rater 2 gave 33.3% o f the same examinees a mark o f 0 and 66.7% o f them a mark o f 1 and Rater 3 gave 42.1% o f these examinees a mark o f 0 and 57.9% o f drem a mark o f 1. The jBnal two lines show the mean mark and the standard deviation fo r the marks each rater assigned this grorq) o f examirKes (19 in total). These Ggures show that Rater 1 can be considered the most severe and Rater 2 can be considered the easiest marker. Even so, these markers can be considered equal because using the effect size index where I ef = -—'----- , f is a marker, and p and o are the sample mean and standard deviation (T respectively(Hurlburt, 1998) and taking an even larger difference by choosing the easiest and most severe markers, we Snd that = .20. The critica l value w ith 2 is greater than 4 6)r a significance o f .05; and so we can conclude that there is little difference between the markers. In a sim ilar way. Table 13 shows the percentages o f students assigned the diSerent marks by each ofth e raters Rrr each o f the items on the grade 7 tesL As w ith the grade 5 data (Table 12), the last tw o lines o f Table 13 show the mean and standard deviation o f the overall marks each rater assigned this group o f examinees (20 in total). These Ggures show that Rater 1 can be considered the most severe and Rater 4 can be considered the easiest marker. 98 Table 13 ^ A zfgrfyôr fAe STzorf ,4?myer Ae/w m f^Ae Grodle 7 D A t^T Item 26 27A 27B 28 29A 29B 29C 30 31 32 33 34A 34B 35A 35B Marks 0 1 2 3 4 0 1 0 1 2 0 1 2 0 1 0 1 0 1 0 1 2 0 1 0 1 2 0 1 2 0 1 0 1 0 1 0 1 Mean Total Standard Deviation Rater 1 IS 20 15 10 40 20 80 40 40 20 65 10 25 45 55 45 55 50 50 70 0 30 40 60 50 15 35 55 10 35 40 60 35 65 25 75 35 65 11.70 5.33 Rater 2 15 20 15 10 40 20 80 15 60 25 65 10 25 45 55 45 55 50 50 65 0 35 40 60 35 25 40 55 10 35 35 65 35 65 25 75 45 55 12.25 4.94 99 Rater 3 15 15 20 10 40 20 80 35 50 15 65 10 25 45 55 45 55 50 50 65 0 35 40 60 40 25 35 50 15 35 40 60 40 60 25 75 40 60 1 1 .^ 4.98 Rater 4 15 20 20 5 40 20 80 20 55 25 65 10 25 50 50 45 55 60 40 65 0 35 35 65 40 25 35 55 10 35 35 65 40 60 30 70 40 60 12.55 6.03 Rater 5 15 20 15 10 40 20 80 40 40 20 65 10 25 45 55 45 55 55 45 65 0 35 37 63 50 15 35 55 10 35 35 65 40 60 25 75 40 60 12.11 4.88 As w ith the grade 5 data, we Gnd that these raters can also be considered equal. Using the same calculations w ith the easiest and most severe ratas, we 6nd that (/ = .16 and the critica l value fo r .05 significance w ith 4 is over 2. We can conclude that there is little difference betw eai the markers. In Table 14, the correlations between the markers fa r both sets o f tests - the grade 5 and tlK grade 7 is drown. For the grade 5 test, a factor that influenced these correlations was that Rater 2 was away fo r one o f the two days that inter-rater re lia b ility was measured. As a result rather than marking a ll 19 papers, this rater marked only 9. The smaUo" number o f papers w ith w hich to compare results would lead to a lesser correlation. Even though the correlation between raters 1 and 2 is a b it low , it is s till sufBciently high to indicate an acceptable level o f agreement between these raters. For the grade 7 test, the correlation betw eai raters was .95 or better, indicating excellent correlation between these markers. Table 14 Corre/afzon; RePygen AAirtgrf Grade 5 2 .82 1 Rater 1 Rater 2 Rater 3 “ 3 .91 .93 ” - Grade 7 1 Rater 1 Rater 2 Rater 3 Rater 4 Rater 5 - 2 .94 3 .98 .97 “ - 4 .95 .98 .97 “ 100 5 .99 .95 .98 .96 Content Related V a lid ity As mentioned in Chapter Three the design o f the DM ATs was determined by the SDMC members in l9 9 5 . The teachers who were recruited to w ork on the actual tests were instructed to develop items based on the learning outcomes listed in the B ritish Columbia Mathematics Integrated Resource Package (IRP), 1995. The in itia l match o f learning outcomes w ith items was completed by these subcommittee members. This was brought to the SDMC to be ratiGed. The match between learning outcomes and test items was further reviewed by the members o f the SDMC after each tim e the D M A T's were administered. Over the years 1996,1998,1999 and 2000 several changes occurred. Some items in the grade 7 test were discarded. The Performance Assessment part o f the test was discontinued. The grade 9 test was discontinued. There are 4 table; included in Appendix G w hich outline the Tables o f SpeciGcadons fo r these tests. The inform ation in Tables 24 and 26 relate test items to the Strands, Substrands and learning outcomes o f the B iiG d i Columbia IRPs. This infbrmadon summarizes the analysis o f the SDMC members. The infbrm adon in Tables 25 and 27 relates the test items to the learning outcomes outdned in the Western Canadian Protocol & r CoUaboradon in Basic Educadon. These two documents are closely related w ith many o f the same learning outcomes. I have included these tables because they outline the mathematical processes that are associated w ith each learning outcome. 101 Table 15 jk /w m f/ze Griadb 5 aW Gra^jb 7 D M 4Z; GrgaMfzere calculatmg the correlation between D M AT scores and school grades, I tested the data to determine whether or not there was a diSerence between the scores o f the group 5)r w hich ta rn and/or Gnal grades were available and the group fo r w hich no marks were available. This was done using a t-test on the data. The hypothesis being tested fo r each grade was the same: there w ill be no signiGcant diHerence between the D M AT scores 6 r the groiq) w ith marks recorded compared to the group that has no marks recorded. This can be ea^xressed as Ho: //i = Ha: or //i- - 0 and or jWi- /fz # 0. W ith the grade 5 data, the complete data group and the missing data groiq) varied slightly in mean score and in standard deviation. The results were 26.72 (9.177) and 23.69 (8.679). A t-test resulted in = -4.23 and w ith = :^1.96 and 1295, it showed that the sample diBerences were signiGcant. W ith the grade 7 data the results were very sim ilar. The complete data group had a mean score and standard deviaGon o f24.86 (9.063) and the missing data group had 106 22.80 (8.624). The t-test again resulted in = -2.75 and w ith tc = ±1.96 and 1165, it also Aowed that the sample difkrences were significant. In Table 18 the correlation between die D M AT scores and student grades is displayed. The data fo r this analysis was lim ited to the group o f grade 5 students fo r whom we have Gnal and/or term marks. Table 18 includes correladons between the parts o f the D M AT, between the D M AT and the FSA scores, and between the D M AT and the students' term and Gnal marks. Table 18 Corre/atzoMfybr tAe Grade 5 D M 4 T Between DM4T&orgf, FlSd &ores and Term and Fzna/M zrks MC SA .67 D M AT .95 .87 FSA .25 .19 .25 Term 1 .51 .37 .49 .26 Term 2 Term 3 Final ” MC .53 .53 .52 SA .38 .42 .41 " D M AT .51 .53 .53 FSA .27 .23 .23 * Term 1 .79 .80 .87 “ Term 2 .86 .93 “ Term 3 .93 « Final Note. MC is the mulGple choice secGon (part 1) o f the D M AT w ith staGsGcs: 19.17, 5D = 5.63, n = 30 and SA is the short answer secGon (part 2) o f the test w id i staGsGcs: M = 7.72,5Z) = 3.62, n = 16. Final re&rs to the Gnal grades. In Table 18, we observe a general increase in the correlaGon between the overall D M AT scores and the term grades going Gom the Grst term to the Gnal grade. This is to be expected as the Gnal grade provides a measure o f the curriculum studied during the whole year. There is also a strong correlaGon between the term and Gnal marks. This too is to be e)q)ected as the Gnal mark reGects the total o f the ta rn grades. Using Cohen's criteria forjudging eGect size (H urlburt, 1998), r = . 1 is a small effect size, r = .3 is a medium eGect size and r = .5 is a large eGect size, we observe that: the correlaGon 107 between the D M A T scores and fin a l grades is large; the correlation between tl^ dichotomous items scores and Gnal grades is large; the correlation between the m ultipoint items scores and Gnal grades is medium; and die correlation between the D M AT scores and FSA scores is small. In a sim ilar manner. Table 19 shows the various correlations between the D M AT and term and Gnal student grades. W ith the grade 7 data there is no reference to FSA semes because grade 6 student do not w rite FSA Numeracy tests. Using Cohen's criteria forjudg ing effect size: correlations between D M AT scores, dichotomous items scores and m ultipoint items scores and Gnal grades are a ll large. Table 19 Corre/oTronfybr the (Trodk 7 D M 4 T D M 4 T S c o r e s and Term oWFmoZ AAartr MC SA .72 D M AT .92 .94 Term 1 .52 .49 .54 Term 2 .55 .52 .58 .80 Term 3 .54 .53 .58 .77 .80 Final .57 .55 .60 .88 .91 .92 “ MC “ SA D M AT Term 1 « Term 2 “ Term 3 “ Final Note. MC is the m ultiple choice section (part 1) o fth e D M AT w ith statistics: A f = 14.61, &D = 4.35, n = 25 and SA is the short answer section (part 2) o f the test w ith statistics: M = 10.36, SO = 5.02, n = 23. Final refers to the Gnal grades. Table 18 shows the strengths o f the various correladons that exist between the D M AT scores, the FSA scores, term and fin a l grades Gir the grade 5 students. Sevaal patterns appear to emerge. The correladons o f the total scores on the grade 5 D M AT and the term and Gnal grades are closely ^rproxim ated by the same correladons measured using the m iildple choice items only. On the whole, correladons o f D M AT scores to 108 term and ûnal grades increase as one progresses &om term 1 grades through to final grades. This is to be expected because this D M AT was designed to test the whole grade 4 curriculum . It is assumed that the fin a l grade is the most accurate measure o f achievement on a ll curriculum elements. As noted previously, the correlation o f DM AT scores to FSA scores was small. S im ilarly, the correlation o f FSA scores to term and Gnal grades was small; although a direct comparison could be misleading as the FSA scores were measured on a Gve point scale. It appears that the grade 5 D M AT acted as a better measure o f student proGciency than the FSA. For the grade 7 DM ATs, we look at Table 19. Table 19 shows the strengths o f the various correladons that exist between the D M AT scores and the FSA scores, term and Gnal grades fo r the grade 7 students. The correladons are higher fo r the m uldple choice secdon than the short answer secdon; however, the best comparison seems to be between the total D M AT scores and the Gnal grades. Again, this is W iat one would e)q)ect. On the whole, correladons o f D M AT scores to term and Gnal grades progressively increase as one goes Gom term 1 grades through to Gnal grades. It is assumed that the Gnal grade is the most accurate measure o f achievement on all curriculum elements. Addidonal V a lid ity Evidence Further va lid ity evidence des in how the DM ATs have been received and accepted in the school d is tric t For Gve years the tests were administered to students in the school district. During that dme they have been accepted by principals and teachers alike as a test o f students' mathonadcs abG i^. Re&rence was made in chapter 2 to 6ce validity. This is an example o f i t In gathering data fo r this study many principals 109 expressed an interest in hearing the Gndings o f this research. In some school teachers on hearing about the topic o f this study began conversing heely about the tests, even to the specificity o f certain items on the tests. It impressed me that not only were they aware o f the tests but in talking about them, they were accepting them as mathanatics tests. They ^p e a r to be a mathematics tests and have been received as such. 110 CHAPTER 5 - DISCUSSION Summary There were three main issues examined in this thesis, aU related to the analysis o f the grade 5 and grade 7 DM ATs. The Grst issue raised was the re lia b ility and the va lid ity o f the tests themselves. This analysis has provided an opportunity to examine the strengths and weaknesses ofthe tests overall as w ell as those o f individual test items. It also provided an opportunity to examine how these items and/or tests can best be used in future assessments. The second issue involved the process under w hich the DMATs were developed and im plim oited. It involved the construction, adm inistration and scoring o f the tests. The tests were designed in response to the im plim entation o f a new mathematics curriculum and the new curriculum was its e lf a response to a new way o f thinking about the learning o f mathematics. The school d istrict embarked on a process designed to test how w ell the new curriculum was being imphmeated in the classroom. This study provided an opportunity to study the process. The th ird issue involved the analysis o f the test items. I chose to examine the test herns using four diSerent approaches to test analysis - the Grst, a classical approach w ith its concentration on classical test statistics, the second, an assumption hee approach w ith its concentration on constructing and analyzing ICCs, the third, item analysis based on Rasch analysis, and the 6)urth, a logistical item analysis using two and three parameter models 6 r in&rm ation about item diS iculty, item discrim ination and psuedo-guessing. The tools used in these analyses reflected these dif& re nt a^oaches. For the classical analysis, 1 used Iteman to generate test and item statistics. For the 111 assumption 6ee analysis, I used TestGraf to construct ICCs that graphically presented the data Tvitbout over riding assumptions about the item parameters. For the Rasch analysis, I used Bigsteps to provide in&rm ation on item diO iculty and fo r two parameter and a three parameter logistic analysis I used Ascal. In using tk s e dif& rent programs I had an opportunity to compare them and indirectly to compare the diSerent ^proaches to test analysis. Suggestions w ill fo llo w m y conclusions but firs t I w ould like to summarize the steps that were taken in the analysis o f the re lia b ility and va lid ity o f the DM ATs. M y analysis began w ith observations about the internal consistency o f the tests. I used Iteman to assist in the calculation o f the overall test statistics needed fo r this analysis. There were two separate runs o f Iteman fo r each D M AT test. The Grst &cused on two subtests —dichotomous test items and m ultipoint test items. The second further divided the dichotomous test items into the four mathematics strands defined in the B ritish Columbia, M inistry o f Education, IRP fo r mathematics. To study the re lia b ility o f the marking o f the m ultipoint test items, 19 tests (fo r grade 5) and 20 tests (fo r grade 7) were ;Aotocopied and marked by each o f the markers. The results were compared to observe i f there were significant diSerences between die markers. Correlations between the markers were also calculated. To obtain the data needed A ir observations about test va lid ity, I contacted school principals and obtained FSA scores &>r a sample o f students Wm wrote the grade 5 D M AT and term and Gnal grades fo r a large sampling o f students who wrote the grade S or grade 7 DM AT. The data was compared and correladons between the D M AT scores, m uldple choice scores, short answer scores, term and Gnal grades and in the case 112 o f the grade 5 students the FSA scores were a ll included. This analysis was conducted using the program SPSS. Inform ation 6om the SDMC was used in tabulating the test items fo r a Table o f SpeciGcations based on the B ritish Columbia Mathematics IRP. The Table o f SpeciGcations showing matbemaGcs processes was compiled using the infbrm aüon Gx)m the SDMC and cross tabulating it w ith infbrmaGon taken Gom The Western Canadian Protocol fo r CoUaboraGon in Basic EducaGon. InfbrmaGon Gom teachers, administrators and school district personnel was used in determining face vahdity observaGons. In assessh% the process involved in the designing, constructing and implementing the DM ATs, I used in&rmaGon Gom the SDMC. The analysis o f the interrater reHabihty also contributed to the overall assessment o f the process as did the analysis o f the test items. The analysis ofthe test items involved the use o f several computer programs and the data had to be organized in a way that was compatable w ith each. I began by using Iteman and a classical ^iproach. As menGon previously, there were two separate runs fo r each test using Iteman; the analysis o f the items is independent ofthe number o f subtest involved. For the analysis o f the items I used the data G"om the Grst run (dichotomous test items and mulGpoint test item s). In terms o f the individual test items, fb r the grade 5 DM AT, those items that display a poor discriminaGon were easia" items w hich w ould not norm ally be expected to show much discriminaGon. Items which were identiGed as hard items generally showed good to great discriininaGon, a desirable attribute o f an achievement test. There was no evidence that items were excessively severe and so examinees sim ply guessed nor is 113 {here any evidence that would suggest that an items were keyed incorrectly. The alternative responses appear to behave as they should. In the analysis o f the individual test items, fb r the grade 7 D M AT, most items that showed a low level o f discrim ination were easier items and a lower discrim ination level is expected. There is an exception however in question 16. This question has a diGBculty level measured as = .43 and a discrim ination index o f D = .21 ; so it a d iffic u lt question w ith poor discrim ination. N ot too surprisingly, question 16 also has an alternative response (3) that has a positive point biserial. This is not a desirable tra it fb r an achievement test. This is an item that should be considered fb r exclusion. I fallow ed (his in itia l analysis by examining the TestGraf output. I was able to con&m many o f the aspects fb r individual test items that I fbund using Iteman by examining die response curves fw each item . The added feature w ith TestGrafwas diat I was able to observe how the response curves varied over the changing student a b ility levels. A b ility levels were measured along an expected scores axis. D ifB culty and discrim ination were measured as a function o fth e students' abilities. This meant that it was possible to predict at what levels the items displayed the greatest and/or least discrim ination. In some instances, items that demonstrated poor discrim ination as identiGed by Iteman were fbund to have excellent discrim ination among low scoring students. This type o f question would be ideal fb r identifying students who are at risk. In a sim ilar way it was possible to examine some items that were idenGGed, using Iteman, as hard items and h i observe how the items showed great discriminaGon among high scoring examinees. This type o f item w ould be ideal fb r identifying students who may q u a li^ G)r special enrichment classes, programs or awards. 114 W ith the assumption 6ee analysis using TestGraf I was able to observe the behavior o f item s along the ICCs geno-ated. This provided an opportunity to examine discrim ination at various points along the d iS icid ty/a b ility continuum. The items in the grade 5 D M A T that discriminated w ell 6)r low scoring students (below the 25* percentile) were 2 ,4 ,6 ,7 ,9 ,1 0 ,1 2 ,2 0 ,2 3 , and 24. The items that showed a slight negative discriinination fo r these students were 1 ,4 ,5 ,1 1 , and 15. The items that discrim inated w e ll & r high scoring students (over the 75* percentile) were 1,15, and 21. The items that showed a slight negative discriinination fo r these students were 8 ,9 ,2 1 , 25, and 27. The other items discrim inate w ell & r students in the m iddle ranges. In the grade 7 D M AT, items that discrim inated w ell fo r low scoring students (same percentiles as fo r grade 5) were 3 ,4 ,5 , 7,15,17, and 23. The item that showed a slight negative discrhnination fo r these students was 21. The items that discriminated w ell fo r the high scoring students were 8, 9 ,1 0,1 1,20 , and 25. The items that showed a slight negative discrim ination were 6,1 2,1 5, and 21. Item 16 showed quite a pronounced negative discrinnnation fo r the highest scoring students. The other items discrim inate weU fo r m iddle scoring students. A fter reviewing the TestGraf output I conducted a Rasch analysis o f the data using the program Bigsteps. M y ch ie f focus was to analyze the levels o f di@ culty and discrim ination fo r the short answer questions; however, it provided an opportunity fo r me to compare a ll the items short answer and m ultiple choice. I was able to compare them and rank them in terms o f difS culty and discrim ination. In addition, I was able to id e n tic items that did not appear to St a Rasch model. Knowing which items lacked St provided the opportunity to once again examine them using TestGraf and Iteman. 115 The Gnai analysis I did was w ith the program Ascal where I examined the data using a 2 parameter model as w ell as a 3 parameter model. The resulting output gave me the opportunity to identify the items that lacked St. As w ith Bigsteps, I used this inform ation to focus examination o f these items using Iteman and TestG raf Conclusions There are conclusions that can be reached about the test items themselves. In the grade 5 test, the m ultiple choice ita n s appear to be behaving as expected. There is a broad range o f difB culty and the individual items fa r the m ultiple choice section seem to discrim inate w ell. The short answer items ^p e a r more severe. The distribution o f scores fo r the short answer items is positively skewed (.158) indicating that the items were &und to be generally difB cult. This appears to be die case eqiecially w ith items 31 and 32. I f items were to be changed, these two short answer items should be considered. There is a strong correlation between the m ultiple choice items and students' grades so it is conceivable that the m ultiple choice items alone could provide the SDMC the inform ation needed about student achievement in mathematics. In the grade 7 test, as w ith the grade 5 test, there is a broad range o f item difG culty. Most ofth e items behave as expected. The exception is ite m l6. This item has p = .32, Z)= 21 and = .20 (6om Table 6). The ICC as shown using TestGraf is fla t horn the Gfteendi percentile to about the eightieth percentile and even w ith high scoring examinees it behaves unusually (see Figure 9). The item , when we used the two parameter logistic model o f Ascal, showed the greatest lack o f Gt. This item should be replaced. The distribution o f scores fo r the short answer items was positively skewed 116 (. 184) indicadng that this part o f the test is generally diS icnlt. The correlations in table 20 show a strong relationship between the short answer scores and students' m arts. Even though it is a hard test it s till spears to give valuable inform ation and should therefore be retained. The manner in which marks are assigned fo r the short answer items should be reviewed. For an achievement test, is it im portant to make a distinction between students who attempt a question and get zero and students who do not even attempt the question? The analysis o fth e short answer section is made more difB cult by choosing this distinctioiL I f the inform ation is not needed and/or useful, it may be advisable to sim ply assign zero to a ll in this sort o f situation. Item 34 in the grade 5 test needs to be separated into two items i f the same scoring sheets are to be used. The item is out o f six marks and the scoring sheets have only room & r 0 up to 4. Finally, w ith the level o f difB culty known fo r each o f the test items, grade 5 and grade 7 items can be reorganized so that they progress generally 6om easier items at the beginning to the more difB cult items toward the end. Such an organization w ill provide most students a better opportunity to accurately show what they know. There appears to be solid evidence that the SDMC was successfW in constructing an assessment instrument that showed how students were achieving in mathematics. In part, whether a test was valid or not is a function o f whether or not it has met its original purpose. W ith the DM ATs, the original purpose was to assess student achievement during a period o f transition Bom the old mathematics curriculum to the new mathematics curriculum . The tables o f speciBcation, tables 26 and 28 in appendix G, show clearly how the test items matched the new curriculum . Although the test items 117 did not incliide every learning outcome (to do so would have meant including 55 o f them), the resulting correlation between the DM ATs and students' Snal maHcs are sufRciently strong to consider that the test items can be generalized across the curriculum . The tables o f q)eci6cations, tables 27 and 29 in appendix G, support this assessment They outline the mathematical processes that are in evidence fo r existing test items. These same mathematical processes are vdiat one finds in a ll the learning objectives. The one exception, in this, is the technology component in the grade 5 DM AT. The members ofthe SDMC decided to construct a mathematics achievement te st As was discussed earlier (Chapter 2), the recommended range o f difB culty fo r items in an achievement test is .30 < p < .70. For m ultiple choice test items we could skew these values slightly higher. A range o f .40 < p < .80 w ould be acceptable. For the grade 5 test and using data 6om Table 1, there are twenty-three out o f th irty items that 611 w ith in this range. O f those outside the range & u r items are easier (p > .80) and three items are harder (p < .40). The overall measure o f kurtosis (-.407) siq>ports what we hnd about the distribution o f items by difB culty. We can expect the scores to be more w idely distributed. In a sim ilar way using data from Table 6, in the grade 7 test we Bnd that sixteen out o f twenty-hve items 611 w ith in the recommended range. O f 6ose outside the range, two are easier items and seven can be considered harda^ items. An overall kurtosis o f -.307 siq)ports d iis wider, fla tte r distribution o f scores. A comparison o f D M AT scores w ith term and fin a l grades and, in the case o f 6 e grade 5 D M AT, w ith FSA scores provides an opportunity to compare 6 e DMATs 118 and the FSA —Numeracy test The correlations in Tables 18 show that the correlation between the D M A T scores and 6nal grades is .53, a strong e fk c t using Cohen's criteria forjudging e fk c t size. S im ilarly, it shows that the correlation between the FSA scores and the fin a l grades is .23, a small to medium eGect using Cohen's criteria. It can be concluded that the D M AT is a more pow erful measure o f mathematics achievement than the FSA. The strong correlation between the markers o f the short answer items is another indicator o f how successful the SDMC members were in designing and implementing the DMATs. The raters were w ell aware o f how the tests were to be graded. It ^ypears that the grading process was such that raters clearly understood how marks were to be ^)portioned. By way o f an overall conclusion to the test construction process, the members o f the SDMC can be commended fo r the successful design, construction and im plim entation o f these mathematics achievement tests. Using a variety o f analysis programs to study the D M AT data has provided an opportunity to compare the usefidness o f each. W ith Iteman 16)und I had an excellent starting point in considering these tests. I hmnd it to be particularly valuable w ith the dichotomous test items. The data were processed quickly, overall test statistics and item statistics were available rig h t away and the ouqmt was easy to read. It fnovided a good indication as to whether or not the items were behaving as they should. I was able to examine correlation o f a ll responses but was especially able to check that the alternative responses were behaving as they should. There were however some lim itations. Because a ll calculations were taken directly horn the test data, t k test analysis and the item analysis was test dependent. It was good inform ation to have but it was lim ited to the 119 examinees and how the examinees responded to these items 6)r this adm inistration o f the tests. By using TestG raf I was able to use the data to predict probable responses and to thereby analyze how items were behaving across a range o f students abilities. The program is set up to use probabilities o f correct responses and expected scores. It is therefore possible to examine the maximum likelihoods fo r certain responses. The data were used to predict responses by studoits across the range o f a b ility levels. This was particularly valuable in examining discrim ination levels 6*r the various a b ility grotg)s. I found the TestGraf output fo r m ultipoint items difB cult to read and as a result used the program in the analysis o f the dichotomous items only. 1 used Bigsteps to examine the short answer test items. However, I found the program difB cult to setup and the output difB cult to read. There was a lo t o f o u ^u t and 1 lim ited my considerations to an analysis o f the items. It could have also been used in examining the examinees (persons). I used Bigsteps and Ascal to identify items that could have been causing difBcuhies. Even in id a itify in g an item 1 found that I reverted to Iteman and TestGraf to examine more closely what could have been happening w ith it. In terms o f w hich programs were most useful, Iteman and TestGraf were this researcher^ s preferences. The Iteman program provided valuable inform ation about how the grade 5s and the grade 7s students who wrote the DM ATs in 2000 responded to the test items. The TestGraf program provided valuable inform ation about item discrim ination at various levels o f a b ility. The programs Bigsteps and Ascal, although more d iË c u lt to use, provide an opportunity to compare test results in subsequent years w ith diGerent groups o f grade 5 and grade 7 students. They also provide an opportunity, should the tests be changed, to 120 reference any new items to the older items fo r w hich difB culty, discrim ination and pseudo-guessing values have already been established. They provided anchor points fo r any new test items. In this regard, although they were diGBcult to use, they also provide valuable inform ation. Lim itations ZWa CoZZectfon The cleaning up o f the data was a long and tedious a f& ir which cannot be considered complete because there were s till several students who could not be identiGed 6om the inform ation recorded from the DM AT. The biggest factor seemed to be the inaccurate coding that appeared on the bubbled answer sheets. A t the very least the inform ation should be made clear at the very beginning. An even better course o f action could be the identiGcation o f students by a bar coded or machine stamped label which can be afSxed to the answer sheet and which wiU clearly ide ntify students by name, number (PEN), and school. Such an identiGcaGon could occur even before the test booklets were sent to schools. The correlaGons, in Table 18, show to what degree there is agreement between the various mathematics measures. There appears to be a strong reladonship between the DMATs and students' marks and the relaGonship between the FSA scores and students' maiks appears much weaker. CauGon should be exercised in conclusions relating to the FSA scores. The data collected on FSA scores was not taken Gom the raw test data. Rather, it was taken Gom the reports sent to schools and to parents. This data consists o f only Gve points and is therefore quite restiicGve. Although less so, students' marks are also somewhat restricted. The system o f grades in the schools produces a 121 seven or eight point system depending on whether or not students w ith lEPs are included. I f possible raw data should be used. There was a six month tim e delay between the FSAs and the DMATs. This is a lim itahon in the study. It can mean that a lo t o f learning and/or forgetting has taken place. It is expected that as the tim e between tests increases the correlations w ill decrease. Two o f the ^xograms that were used in this analysis, Ascal, 2 parameter and 3 parameter models, can be used on dichotomous test items only. Because both these program present an IR T analysis, the overall IR T analysis was lim ited. As IR T programs are developed to process m ultipoint items, the range o f analysis option w ill be expanded. Im plications fo r Future research There are a variety o f assessment needs. Mathematics achievement is but one. I also see a need fo r good diagnostic tests. It should be a mathematics test that included a large number o f items, cross referenced to speciGc learning outcomes and that display maximal discnm ination characteristics fo r low a b ility students. It could be o f great assistance to teachers and administrators. The data from mathematics achievement tests can be used to measure the relative achievements o f d ifk re n t group o f students. This data could assist in the analysis o f factors that affect achievement in mathematics. Im plications fo r Future Practice There needs to be a decision as to the purpose o f the test in order that one can on a personal level decide on the va lid ity o f the tesL I f the usefulness and therein the purpose o f the test is fo r diagnosis - the identification o f students who are at risk in math then the type o f question that is needed is one d ia l relates to specific curricular objectives and shows good discriinination fo r students in the low scoring group. The overall look o f 122 the item w ould be "easy" and peAaps an overall discrim ination that is low but one where the greatest level o f discrim ination is achieved 6>r students in say the lowest twenty-6ve percent o f the population. A t the other extreme, i f the usefulness and therein the purpose o f the test is to determine which students qualifying fa r enrichment, scholarships or advancement into programs designed fa r the most capable math students then the type o f question that is needed is one that shows good discrhnination among the high scoring group. This item would probably be "hard" and may even have low discrim ination overall but must have a great level o f discrim ination fo r students in the top twenty-hve percent o f the population. The original purpose o f the DM ATs was to give an overall read o f math achievement in the d istrict and to that end they seem to be successful. The best indicator that we currently have o f math achievement is student grades. The correlaticm between the DM ATs and dnal grades is quite strong. The big issue here is that there are a variety o f expectations that practitioners have o f a test and the measure o f va lid ity is in part going to be a measure o f how successhdly the instrument meets the various expectations that are set on it. 123 Reference L W A llen, M . J. & Yen, W . M . (1979). Introduction to measurement theory. Belmont, CA: Wadsworth Inc. Assessment Systems Corporation. (1989). User's manual & r ascal: 2- and 3narameter IR T calibration program. St. Paul, Minnesota: Assessment Systems Corporation. Assessment Systems Corporation. (1993). User's manual fo r the iteman conventional item analysis prnpram. SL Paul, Minnesota: Assessment Systems Corporation. B ritish Columbia Foundation S kills Assessment H ighlights 2000 (n.d ). In& nnation about the FSA 2000 and results fo r reading comprehension, w ritin g and numeracy. Retrieved from Carpenter, J. & Gorg, S. (Eds) (2000). Principles and standards fo r school mathematics. Reston, V A : The N ational Council o f Teachers o f Mathematics, Inc. Crocker, L. M . & A lgina, J. (1986). Introduction to classical and modem test theory. Orlando, Florida: H olt, Rinehart and W inston, Inc. The Crown in R ight o fth e Governments o f Manitoba, Saskatchewan, B ritish Columbia, Yukon Territory, Northwest Territory and Alberta. (1995). The common curriculum hamework fa r K-12 mathematics: Western Canadian protocol fo r collaboration in basic education. Cunningham, G. K. (1998). Assessment in the classroom: Constuctine and interpreting texts B ristol, PA: The Fahner Press. Feldt, L. S. & Brennan, R. L. (1989). R eliab ility. In R. L. Lion (Ed), Educational Measurement: Third E dition. New Y ork: M acm illan. Gipps, C. & M urphy, P. (1994). A 6 ir test?: Assessment, achievement and equity. P h ila d e lp ^: Open U niyersity Press. Hambleton, R. K. (1989). Principles and selected applications o f item response theory. In R. L. Linn (Ed), Educational Measurement: Third E dition. New Y ork: MacmiUan. Hambleton, R. K ., Swaminathan, H. & Rogers, H. J. (1991). FurKlamentals o f . Newbury Park, CA: Sage Publications, Inc. 124 Henrysson, S. (1971). Gathering, analyzing, and using data on test items. In R. L. Thorndike (E d), Education Measurement Second E dition. Washington, DC: American Council on Education. H urlburt, R. (1998). Comprebendine Behavioral Statistics. Toronto, ON: Brooks/Cole Publishing Company. K ilp a trick, J., SwaBbrd, J. and Findell, B. (2003). Adding it un: Helpine children learn mathematics. Washington, DC: National Academy Press. Kober, N . (1991). W hat we know about mathematics teaching and learning: ED TALK. Washington, DC: Council & r Educational Development and Research. (ERIC Document Reproduction S avice NO. ED 343 793) Linacre, J. M . and W right, B. D. (1996). A user's guide to biestens: Rasch-model computer program. Chicago, D: Mesa Press Lyman, H. B. (1998). Test scores and \^bat they mean: Sixth edition. Needham Heights, M A : A Viacom Company M arshall, M . A . et al. (1997). The 1995 B ritish Columbia assessment o f mathematics and science: Technical reporL V ictoria, BC: Queen's P rinter 6)r B ritish Columbia McConaghy, T. (1998). Canada's participation in TIMSS. Phi Beta Kapoan. 79(101.793, 800 M cM illan, J. H. & Schumacher, S. (1997). Research in education: A conceptual Introduction New York: Addison-W esley Educational Publishers Inc. Messick, S. (1989). V a lid ity. In R. L. Linn (Ed), Educational Measurement: Third E dition. New Y ork: M acm illan. MuUis, 1. V . S. eL al. (1997). Mathematics achievement in the prim ary school years: lE A 's third international mathematics and science study (T lM S S l. Chestnut H ill, M A : TIMSS International Study Center. (ERIC Document Reproduction Service NO. ED 410 120) Province o f B ritish Columbia, M inistry o f Education, Curriculum Branch. (1995). Mathematics K to 7: Integrated resource package. V ictoria, BC: (Queen's Printer fo r B ritish Columbia Ramsay, J. O. (2(KX)). TestGraf: A nroeram fo r the graphical analysis of m ultiple choice test and questionnaire data. M ontreal: M cG ill U niversity. 125 Sax, G. (1997). Principles o f educational and psychological measurement and evaluation. Scarborough, ON: Wadsworth Publishing Company Stanley, J. C. (1971). R eliability. In R. L. Thorndike (Ed), Educational Measurement Second E dition. Washington, DC: American Council on Education. Webb, N . & Romberg, T. A . (1992). Im plications o f the NCTM standards fo r mathematics assessment In T. A . Romberg (Ed.). Mathematics ass^sment and evaluadon: Imperatives & r mathematics educators. Albany, N Y : State U niversity o f New Y ork Press. (ERIC Document Reproduction Service NO. ED 377 073 ) 126 Appendix B Table 20 DMAT Code School Code 010 020 030 037.5 040 050 060 311 313 314 312 316 317 324 070 075 080 090 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 285 290 300 310 320 330 340 350 360 318 319 321 322 323 327 328 329 331 332 333 334 336 337 338 339 340 342 343 344 345 348 350 347 351 353 354 355 358 359 360 364 Grade 5 Cases School. Name AUSTIN ROAD BEAYERLY BLACKBURN ELEMENTARY BEAR LAKE BUCKHORN CARNEY HILL FORT GEORGE CENTRAL COLLEGE HEIGHTS ELEMENTARY DOME CREEK DUNSTER EDGEWOOD FOOTHILLS GISCOME GLADSTONE GLENVIEW HALDIROAD HART HIGHLANDS HART HIGHWAY HARWIN HERITAGE HIGHGLEN HIGHLAND mxoN KING GEORGE V LAKEWOOD ELEMENTARY MACKENZIE ELEMENTARY MALASPINA MCBRIDE CENTENNIAL MEADOW MORFEE MCLEOD LAKE MOUNTAIN VIEW NECHAKO NORTH NUKKOLAKE PEDENHILL PINEVIEW PINEWOOD QUINSON RON BRENT 130 54 36 56 5 21 30 15 40 Grade 7 Cases 11 54 78 98 5 48 55 26 45 85 42 42 27 25 1 3 5 35 6 20 31 11 58 37 14 27 33 19 7 15 31 20 40 25 29 28 6 23 32 21 30 30 20 31 22 School Total 1 4 13 23 4 27 5 17 48 28 23 4 8 44 34 27 23 15 36 1 17 29 30 23 18 24 22 7 18 58 10 47 31 16 58 37 31 75 61 42 11 23 75 54 67 48 44 64 7 40 61 21 60 53 38 55 44 370 380 390 400 410 420 430 440 450 460 470 480 490 365 366 368 326 367 369 370 374 375 376 378 379 305 05 07 280 281 282 283 284 285 286 287 288 289 290 291 295 361 373 SALMON VALLEY SEYMOUR SHADY VALLEY FORT GEORGE SOUTH SOUTHRIDGE SPRINGWOOD SPRUCELAND VALEMOUNT ELEMENTARY VAN BIEN VANWAY WESTWOOD WILDWOOD HEATHER PARK CONTINUING EDUCATION CORRESPONDENCE CENTRE BLACKBURN JUNIOR COLLEGE HEIGHTS SECONDARY DUCHESS PARK KELLY ROAD LAKEWOOD JUNIOR SECONDARY MACKENZIE SECONDARY MCBRIDE SECONDARY JOHN MCINNIS PRINCE GEORGE SECONDARY D f . TODD VALEMOUNT SECONDARY CONTINUING EDUCATION YOUTH CONTAINMENT CENTRE RED ROCK UPPER FRASER 4 17 11 17 48 30 42 29 30 37 44 21 TOTAL KNOWN CASES MISCODED DATA TOTAL CASES 1297 12 1309 131 10 15 38 32 17 23 24 37 240 1175 17 1192 4 27 11 32 86 30 74 46 53 61 81 21 240 2472 2501 Table 21 DoAz fin a l aW fl& 4 KeW ly The numbers o f students 6>r which a ll tarm and final marks were recorded, ju s t Gnal marks were recorded, total number o f studait w ith at least fin a l marks, no final marks were recorded and students (grade 5 only) whose FSA results were recorded is as k llo w s: A ll term and Snal marks Final marks only A t least fin a l marks W ithout data W ith FSA results Grade 5 number 1009 103 1112 185 614 Total Number o f Students 1297 percentage 77.8% 7.9% 85.7% 14.3% 47.3% Grade 7 number 955 42 998 177 1175 132 percentage 81.4% 3.6% 84.9% 15% Appendix c # # # # # # % e test booklets fo r the DM AT grade 5 and grade 7 included: 2000 D istrict Assessment o f Mathematics —Grade 5, Part 1 - M ultiple Choice; ZOOODistrictAssesanentofMathem alics —G rades, Part 2 -S h o rt Answer; M ath Assessment - Grade 5 Answer Key; 2000 D istrict Assessment o f Mathematics - Grade 7, Part 1 - M ultiple Choice; 2000D istrictAssessm entofM athem atics —Grade 7, Part 2 —Short Answer; and M ath Assessment - Grade 7 Answer Key. They comprised pages 133 to 198 o f this thesis. They have been excluded horn the published thesis to sa&guaid the integrity o f the item database. 133 Af^)endix D Letter to the priiK ipals at elementary schools in the school district: February 27,2002 (Principal's Name) (School Name) Dear (Principal) I am currently working as a classroom teacher at Beaverly Elementary School, and am working on my Master's degree in Curriculum and Instruction through the Education graduate program at UNBC. This letter outlines my research project and is a request for your assistance in its completion. The program has been approved by Bonnie Chappell, Director of Instruction for School District #57, who will be advised of any and all particulars of this project throughout ik duration. The results of this study are of interest to various teachers and administrative officers in the school district. Background In the 1996-97 school year, grade 5 and 7 students in selected schools in the district wrote locally developed Math Achievement tests. The tests were administered in the fall and were designed to test student achievement in the grade 4 and 6 curriculum respectively. Tfie tests were originally developed and administered as part of the districts commitment to ongoing student assessment. Some minor changes to the tests have occurred but most items have remained unchanged and the tests have now been administered several times since they vwre originally developed. The most recent use of tfiese tests was in October, 2000, when they were written by all grade 5 and grade 7 students in the district. Current Studv To complete my master's degree thesis, I have proposed a study of the reliability and the validity of tfiese tests. The reliability of each test is the easier aspect to establish because student scores, taken directly from the Math Achievement tests, can be used. The reliabilities can be statistically calculated using these scores. The validity of each test is tfie more difficult aspect to establish and will require additional information. This is where I will need your help. My study will focus on two measures of validity. The first - content-related validity - will require an item by item analysis of the tests to determine the degree to which each test item matches the 199 curriculum. The second - criterion-related validity - will require a comparison of each student's Math Achievement test score with his/her scores in comparable curriculum areas and tests. The specific information I need is outlined in the next section. Method To carry out this study, I'll be analyzing students' scores from the Grade 5 and Grade 7 Math Achievement tests written in 2000. This information is available at Central Office through School District files. It will provide information about the test items and will be used to measure the reliability of each of the tests. The validity of the tests will be determined by comparing each student's year end and/or FSA scores with the scores they received on the Math Achievement tesk. From each school, I will need: # for students who wrote the Grade 5 Math Achievement test - (1) math letter grades for the 1999-2000 school year for each student Wio wrote this test (this would include math letter grades A>r each term plus their final math grade) and (2) results on the FSA-Numeracy test he/she wrote in May 2000 (Not Yet Within Expectations (1), Meets Expectations (3), Exceeds Expectations (5) or halfway between these either (2) or (4)). # for Grade 7 students -math letter grades for the 1999-2000 school year for each student who wrote this test (this would include math letter grades for each term plus their final math grade). # please note students who are on modified or adapted math programs Ethics This project will follow all UNBC research procedures and guidelines to safeguard and maintain information conhdentiality. Student names will be coded once data is collected and will be removed from all research documentation for further phases of the study. As data collection only involves examining existing school records, there will be no direct contact with the students and they will not be personally affected or identified by the study in any way. This research proposal has been presented to and approved by the UNBC Ethics Committee. Dr. Peter MacMillan from the UNBC Education Department will be supervising the project. Research resulk will be shared with university and school district personnel. Summary To facilitate the collection of this infomnation. I've attached a list of the students who wrote the Math Achievement tests. In the case of students who wrote the grade 7 Math Achievement test, their files may have been forwarded to a junior secondary school. 200 Letter to the principals at secondary schools in the school district: I am currently working as a classroom teacher at Beaverly Elementary School, and am working on my Master's degree in Curriculum and Instruction through the Education graduate program at UNBC. This letter outlines my research project and is a request for your assistance in its completion. The program has been approved by Bonnie Chappell, Director of Instruction for School District #57, who will be advised of any and all particulars of this project throughout its duration. The results of this study are of interest to various teachers and administrative officers in the school district. In Ihe 1996-97 school year, grade 5 and 7 students in selected schools in the district wrote locally developed Math Achievement tests. The tests were administered in the fall and were designed to test student achievement in the grade 4 and 6 curriculum The tests were originally developed and administered as part of the districts commitment to ongoing student assessment. Some minor changes to the tests have occurred but most items have remained unchanged and the tests have now been administered several times since they were originally developed. The most recent use of these tests was in October, 2000, when they were written by all grade 5 and grade 7 Current Studv To complete my master's degree thesis, the validity of these tests. The reliability of each test is the easier aspect to establish because student scores, taken directly from the Math Achievement tests, can be used. The reliabilities can be statistically calculated using these scores. The validity of each test is the more difficult aspect to establish and will require additional information. This is where I will need your help. My study will focus on two measures of validity. The first - content-related validity - will require an item by item analysis of the tests to determine the degree to which each test item matches the curriculum. The second - criterion-related validity - will require a comparison of each student's Math Achievement test score with his/her scores in comparable curriculum areas and tests. The specific information I need is outlined in the next section. Method To carry out this study, I'll be analyzing students' scores from the Grade 5 and Grade 7 Math Achievement tests written in 2000. This information is available at Central Office through School District files. It will provide information about the test items and will be used to measure Ore reliability of each of the tests. The validity of the tests will be determined by comparing each student's year end and/or FSA scores with the scores tfrey received on the Math Achievement tests. From each school, I will need: » for students who wrote the Grade 5 Math Achievement test - (1) math letter grades for the 1999-2000 school year for each student who wrote this test (this would include math letter grades for each temr plus their final math grade) and (2) results on the FSA-Numeracy test he/she wrote in May 2000 (Not Yet Within Expectations (1), Meets Expectations (3), Exceeds Expectations (5) or halfway between these either (2) or (4)). * for Grade 7 students -math letter grades for the 1999-2000 school year for each student who wrote this test (this would include math letter grades for each term plus their final math grade). » please note students who are on modified or adapted math programs Ethics This project will follow all UNBC research procedures and guidelines to safeguard and maintain information confidentiality. Student names will be coded once data is collected and will be removed from all research documentation for further phases of the study. As data collection only involves examining existing school records, there vwll be no direct contact with the students and they will not be personally affected or identified by the study in any way. This research proposal has been presented to and approved by the UNBC Ethics Committee. Dr. Peter MacMillan from the UNBC Education Department will be supervising the project Research results will be shared with university and school district personnel. Summary Most students who wrote the grade 7 Math Achievement test will be in grade 8 this year and their files will have been forwarded to a secondary school. To complete this study, I will need to know, for each, the final and term grades they received in grade 6 (ie. 19992000). I will also need to know the school from which they came so that I can match 203 j^{)pendîx E Table 22 Orffer q/^D^cW fy w?(/ Dzfcn/wMoffOM, Grodle J Iteman 15 21 1 26 30 19 13 11 14 25 16 18 8 28 22 29 3 27 17 9 4 5 6 24 7 12 10 20 23 2 D ifB cully Bigsteps 12PAscal 3PAscal 15 1 15 15 1 21 1 1 21 21 26 30 30 14 30 26 31 19 19 13 13 13 26 11 11 8 25 25 19 14 14 11 16 16 18 18 18 25 8 8 16 28 28 29 22 22 29 3 29 28 3 3 27 27 17 17 17 22 27 9 4 9 32 4 5 5 4 5 9 24 7 6 24 6 24 7 7 6 23 12 23 10 10 10 12 23 12 20 20 20 2 2 2 35 36 34 33 Order 1st 2nd 3rd 4th 5th 6*^ 7th 86 9tb lOth 116 126 136 146 156 166 176 186 196 20 6 21st 22nd 23rd 25 6 266 27 6 28 6 29 6 306 31st 32nd 33rd 346 356 36 6 205 Iteman 28 25 26 16 29 3 27 18 19 17 14 11 8 13 5 21 9 4 30 1 22 7 24 15 6 23 3Ô 12 20 2 D iscrim ination Bigsteps 2PAscal 3PAscal 34 33 28 14 28 25 25 28 17 17 30 31 3 27 8 27 3 29 35 29 23 3 26 26 15 5 5 25 16 29 26 18 2 18 36 32 14 16 5 4 4 17 11 24 13 19 18 4 8 10 27 9 19 21 13 9 11 7 21 16 24 19 20 30 23 14 7 11 2 23 12 7 6 8 24 10 13 10 1 6 9 20 21 12 30 6 22 1 12 2 22 1 15 15 22 This table summarizes the order o f the difB culty and discrim ination determined by the d ifk re n t programs fo r the grade 5 DM AT. The items w ith the greatest difB culty and the items w ith the greatest discrirnination ^tpear at the top o f the table. Table 23 ofO rdbr o f Iteman 9 11 12 16 25 8 20 10 24 14 2 18 19 21 6 1 22 3 23 4 13 15 17 7 5 o w f Dironmmotfon, Gradle 7 D ifB culty Bigsteps 2PAscal 3PAscal 9 16 16 11 9 9 12 11 11 16 12 25 25 12 25 8 8 8 10 20 20 20 10 10 24 24 24 14 2 14 31 28 2 14 18 30 18 2 18 32 19 19 21 21 22 19 6 6 6 1 1 22 1 22 21 34 3 3 3 23 17 23 4 4 17 13 7 13 23 4 15 7 17 13 7 15 15 35 5 1 5 5 Order 1st 2nd 3rd 4th 5th 6"^ 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th 17th 18th 19th 20th 21st 22nd 23rd 24th 25th 26th 27th 28th 29th 30 6 30th 206 Iteman 18 10 14 2 22 19 20 6 24 1 21 8 3 12 17 25 9 11 7 23 4 D iscrim ination Bigsteps 2PAscal 3PAscal 29 10 18 11 33 31 18 22 16 26 32 34 22 10 2 35 27 28 30 14 7 10 2 19 9 19 17 20 1 14 18 20 2 22 6 7 20 3 12 6 7 14 1 9 5 19 17 3 25 25 11 6 24 9 17 11 12 1 4 24 3 12 4 5 8 8 8 21 24 21 21 23 25 33 29 27 26 1 1 1 1 1 3^d 1 33rd 1 34th 1 356 1 1 1 1 16 5 15 13 5 15 16 13 1 1 1 1 23 15 16 13 4 23 15 13 This table summarizes the order o f the difB culty and discrim ination detam ined by the diSerent programs fa r the grade 7 DM AT. The items w ith the greatest difB culty and the items w ith Ae greatest discrim ination appear at the top o f the charL 207 Af^iendix F CoefScient A lpha can be used fo r dichotomous test items (iu which case it is the same as KuderRichardson form ula 20) and 6)r m ulti-point test items. The form ula (Sax, p282) is: « = n n -1 X Where n = number o f item on the test ^^2)" —variance o f scores on the test o f the variances on each item ; that is : (Z M Z S D > Where W = number o f examinees / = hequency score Spearman-Brown form ula: 2rA rs = ------rA + 1 where rs = sp lit h a lf re lia b ility rh = correlation between two halves o f the test Chi-Square Goodness to F it (H urlburt, 1998) ^ E, where O = is the observed value E = the expected value 208 Appendix G Table 24 TobZe Gra:/e i DA<4T Pedtems & 30 Relations Multiple Choice - Part 1 lesbon 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 IRP Strand Numt)er Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Shape & Space Shape & Space Shape & Space Shape & Space Shape & Space Statistics & Probatxiity Statistics& Probability Statistics & Probability Statistics& Probability Patterns & Relations Patterns & Relations Patterns & Relations Patterns & Reladons Ma/Aemafzcf ZRP Patterns Average HO Degree of Difficulty Average Easy Average Average Average Easy Easy Difficulty Average Easy Average Easy Average Difficulty Difficulty Difficulty Average DifRculty Average Difficulty Ministry of Education* K K A A K K A HO K K A K K A K A K K K HO 5 Easy K Data Anal^is 1 Average K Data Analysis 3 Easy K Data Analysis 2 ,3 Average K Chance & Uncertainty 1.2 Average K Patterns 2 Difficulty HO Patterns 2 Average A Patterns 2 Average HO Patterns 1.2 Difficulty HO Learning Outcome 6 ,7 4 2 6 3 3 Substrand Numt)er Concepts Number Concepts Number Concepts Numtier Concepts Number Concepts Numt)er Concepts Numtier Operations Numt)er Concepts Number Operations Number Operations Number Operabons Number Concepts Number Concepts Numt)erOper8tk)ns Number Concepts Number Operations Measurement Measurement Measurement Measurement 3D Objects & 2D Shapes 1 5 2, 3 ,4 4 3 10 10 8 9,10 8 2 3 9 5 ,7 209 ShoaAn»M#-Pad2 Question 31a 31b 32a 32b 33 34a 34b 34c 35 IRP Strand Shape & Space Number Shape & Space Shape & Space StatisbcsA Probability Shape & Space Shape & Space Shape & Space Shape & Space 36a Shape & Space Shape & Space Substrand Measurement Number Operations Measurement Measurement beaming Outcome 14 1,2 14 14 Data Analysis Measurement Measurement Measurement Transformations 3D0tyectS&2D Shapes 3D Objects & 2D Shapes Degree of Difficulty Ministry of Education* A A K K 2,3 6 6 4 1.2 A HO HO HO A 2 K 2 A K is Ministry of Education Knowledge (Bloom's Knrwledge) A is Ministry of Education Application - Bloom's Comprehension and Application HO is Ministry of Education Higher Order Reasoning - Bloom's Analyse, Synthesis and Evaluation 210 Table 25 TaWe fAe Grade 5 DM4 7^ JCüfiMgMd/KMadcf frw e ffe f MuRipie Choice - Part 1 2 3 WCC Strand Number Numt)er Number Sutatrand Numt)er Concepts Number Concepts Numtier Operations 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Number Number Number Number Number Number Number Number Number Number Number Number Number Shape & Space Shape & Space Shape & Space Shape & Space Shape & Space Statisbcs & Prot)ability Statistics & Prot)abllity Statmtics & Pnot)abillty Statistics & Prob^lity Patterns & Relations Patbms & Relations Patterns & Relations Patterns & Relations Patterns & Relations Number Operations Nunnber Concepts Number Concepts Number Operations Numt)er Concepts Number OperaUons Numt)er Operations Number Operations Numt)er Concepts Number Concepts Number Operations Number Concepts Number Operations Measurement Measurement Measurement Measur^nent 30 Otyects & 2 0 Shapes Oata Analysis Oata Analysis Oata Analysis Chance & Uncertainty Patterns Patterns Patterns Patterns Patterns ion 1 Learning Outcome 6 ,7 3 12 12. 13, 14 5 5 12 8 15 15 14 11 11 19 10 19 2 3 9 5 21 1 3 2 5 2 2 2 1 2 Mathematics Processes** C .V V C, PS, R, V Learning Outcome 14 12 14 14 3 6 6 Mathematical Processes** C C, PS, R, V C C C, PS. R. V C C C, PS, V E E E C, PS, V CN, R, V CN, R, V C, PS, V c,v C, PS. V E ,R ,V C PS E. PS, R E,V PS c,v PS,V C ,R C, PS, R C, PS, R C, PS. R C, PS, R C, PS, R Short Answer - Part 2 Question 31a 31b 32a 32b 33 34a 34b WCC Strand Shape & Space Number Shape & Space Shape&Space StatBtics & Prot)at)ility Shape & Space Shape & Space Substrand Measurement Numt)er Operations Measurement Measurement Oata Analysis Measurement Measurement 211 c,v PS, R PS.R 34c 35 36a 36b Shape & Space Shape & Space Shape&Space Shape&Space Measurement Transformations 3D Objects & 2D Shapes 3D Objects & 2D Shapes 4 24 17 18 E, PS C, CN E, PS, V CN, V ** Mathematics Processes: Taken frwn Western Canadian Protocol for Collaboration in Basic Education p.4 C - Communication CN-Connections E - Estimation & Mental Mathematics PS - Problem Solving R - Reasoning T - Technology V - Visualization 212 Table 26 Grodle 7 D&MT %mg fAe MafAemaficf ZRf Multiple Choice - Part 1 lesdon IRP Strand 1 2 9 10 11 12 Shape & Space Number Patterns & Relations Patterns & Relations Statistics & Probability Pattems& Relations Shape & Space Shape & Space Statistics & Probat)ility Number Number Number 13 14 15 16 17 18 19 20 21 22 Shape & Space Number Number Number Number Number Numba^ Number Number Number 23 Shape & Space Statistics & ProbabiBty StatisdcsA Probability 3 4 5 6 7 8 24 25 Sut)strand 3D Objects & 2D Shapes Numt)er Operations Learning Outcome 1 Difficulty Ministry of Education* Easy K K Easy A 1 Variables & Equations 2 Patterns 5 Data Analysis 7 Easy A Patterns Measurement Measurement 5 2 3 Average Average Easy HO K K Data Analysis Number Operations Number Operations Number Operations 3D Objects & 2D Shapes Number Crmcepts Numt)erCorx%ptB Number Concepts Number Concepts Numtier Concepts Numt)er Concepts Number Concepts Number Concepts Numlaer Operations 3D Objects & 2D Shapes 8 1 HO Easy A Difficulty A 1 Average A 1 Average A 2 Difficulty 3 Difficulty 12 Easy 3 Average 2 Average 11 Easy 6 Average 3 1 Easy 1 Average K A K A K K K A K HO 10 Easy K Charx» & Uncertainty 3 Average A Data Analysis 6 Average K Learning Outcome DIfRculty Ministry of Education* Shcxt Ansvwer - Part 2 uestlon 26 27a 27b IRP Strand Patterns & Relations Stabstics& Prot>at)ility StatisUcs & Substrand Patterns 6 ,8 6 ,8 Data Analysis Data Analysis 213 28 29a 29b 29c 30 31 32 33 34a 34b 35a 35b Probability Number Number Numt)er Number Shape&Space Number Shape&Space Shape & Space Shape & Space Shape&Space Shape & Space Shape & Space Numt)er Concepts Numtrer Operations Number Operations I Number Operations | 1 Measurement Number Operations Transformations 3D Objects & 2D Shapes Transformations 3 1 1 Measurement Measurement 7 7 3 1 2 2 K is Ministry of Education Knowledge (Bloom's Knowledge) A is Ministry of Education AppRcaUon (Bloom's Comprehension and Application) HO Is Ministry of Education Higher Order (Bloom's Analysis, Synthesis and Evaluadon) 214 Table 27 7W»k fAg Grodlp 7 DM4 71 ZWïMg AWAemaficf Multiple Choice - Part 1 Quesîkm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 WCC Strand Shape&Space Number Patterns & Relations Patterns & Relations Statktks&ProbabKty Patterns & Relations Shape&Space Shape & Space Statistics &Prob8bilNy Numt)er Number Number Shape&Space Number Number Number Number Number Number Number Number Number Shape&Space Statistics & Piobabilily Statistics & Probability Sutrstrand 3D Objects & 2D Shapes Number Operations Variables & Equations Patterns Data Analysis Patterns Meœurement Measurement Data Analysis Number Operations Number Operations Number Operations 3D Objects & 2D Shapes Numtw Concepts Number Concepts Number Conceits Number Concepts Number Concepts Numt)er Concepts Number Concepts NunAer Concepts Numtier Operations Measurement Chance & Uncertmnty Data Analysis Learning Outcome 14 12 6 1 7 1 1 3 8 12 13 13 15 4 10 3 8 9 5 4 1 12 12 11 6 Math Processes C PS, R. T PS,R C, R .V C, E. PS. R. V C, R ,V CN, PS. R CN, PS, R C,CN PS, R, T E, PS, R E, PS, R V C, PS, R C. CN, R, V R E C, R .V R C, PS, R. V C^CN PS, R, T E CN, R C. T .V Short Answer - Part 2 Question 26 27a 27b 28 29a 29b 29c 30 31 32 33 34a 34b WCC Strand Statistics &Prot)ab*ity Statisbcs & ProbabHlty Statistics & Probability Number Number Number Number Shape&Space Number Shape & Space Shape&Space Shape & Space Shape&Space Subsband Data Analysis Data Analysis Data Analysis Numt)er Concepts Number Operations Number Operations Number Concepts Measurement Number Operations Transformations 3D Objects & 2D Shapes Transformations Transformations 215 Learning Outcome 3 3 6 4 12 12 9 3 12 19 17 20 20 Math Processes"* C, PS, T C, PS. T C .T .V C, PS, R, V PS. R, T PS. R ,T C, R, V CN, PS. R PS, R .T C, T ,V PS, T .V PS.V PS,V 3Sa 35b Shape&Space Shape & Space Measurement Measurement 12 E 12 E Mathematical Processes: Taken from Westem Canadian Protocol for Collaboration in Basic Education p.4 C - Communication C N - Connections E - Estimation & Mental Mathematics PS - Prdalem Solving R - Reasoning T - Technology V - Visualization 216