NOTE TO USERS

Page(s) missing in number only; text follows. Page(s) were
scanned as received.

134 -198

This reproduction is the best copy available.

UMI

SCHOOL DISTRICT 57 M ATHEM ATICS ACHIEVEM ENT TESTS:
THEIR R E LIA B ILIT Y AND V A LID IT Y
by
Robert
B.Sc., U niversity o f Calgary, 1972
DipLEd., U niveisity o f Alberta, 1973
B.Th., U niversity o f Ottawa and Saint Paul U niversity, 1978

THESIS SUBMITTED IN PAR TIAL FULFILM ENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF EDUCATION
in
CURRICULUM AND INSTRUCTION

THE U NIVERSITY OF NORTHERN BRITISH CO LUM BIA
A pnl2004
© Robert Bagnall, 2004

1

^

1

Library and
Archives Canada

Bibliothèque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de l'édition

395 W ellington Street
Ottawa ON K 1A 0N 4
Canada

395, rue W ellington
Ottawa ON K 1A 0N 4
Canada
Your file Votre référence
ISBN: 0-494-04626-0
Our file Notre référence
ISBN: 0-494-04626-0

NOTICE:
The author has granted a non­
exclusive license allowing Library
and Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distribute and sell theses
worldwide, for commercial or non­
commercial purposes, in microform,
paper, electronic and/or any other
formats.

AVIS:
L'auteur a accordé une licence non exclusive
permettant à la Bibliothèque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par télécommunication ou par l'Internet, prêter,
distribuer et vendre des thèses partout dans
le monde, à des fins commerciales ou autres,
sur support microforme, papier, électronique
et/ou autres formats.

The author retains copyright
ownership and moral rights in
this thesis. Neither the thesis
nor substantial extracts from it
may be printed or otherwise
reproduced without the author's
permission.

L'auteur conserve la propriété du droit d'auteur
et des droits moraux qui protège cette thèse.
Ni la thèse ni des extraits substantiels de
celle-ci ne doivent être imprimés ou autrement
reproduits sans son autorisation.

In compliance with the Canadian
Privacy Act some supporting
forms may have been removed
from this thesis.

Conformément à la loi canadienne
sur la protection de la vie privée,
quelques formulaires secondaires
ont été enlevés de cette thèse.

While these forms may be included
in the document page count,
their removal does not represent
any loss of content from the
thesis.

Bien que ces formulaires
aient inclus dans la pagination,
il n'y aura aucun contenu manquant.

Canada

SCHOOL DISTRICT 57 M ATH ACHIEVEM ENT TESTS:
TH E IR R E LIA B ILITY AND V A LID ITY

Abstract

This study examines the re lia b ility aitd va lid ity o f locally developed math achievement
tests administered to grade 5 and grade 7 students in the Prince George School D istrict —School
D istrict 57. School D istrict 57 has collected test scores on locally developed mathematics
achievement tests 6)r samples o f students in grades 5, 7 and 9 since 1995. This in& rm ation has
been gathered, through the direction o f the school board, as an ongoing commitment to evaluate
the educational programs in the district. This study uses test data from the year 2000 and
compares math achievement test scores, student math grades fa r the school year 1999-2000 and
Foundation S kills Assessment results fo r grade 4 students in May, 2000 to establish the va lid ity
o f the d istrict mathematics achievement tests. Test items are examined using classical item
analysis, an assumption 6ee item analysis and item response logistic models.
This study provides im portant in& rm ation about the mathematics achievement tests to
the teachers and administrators o f the school district. It demonstrates that item d ifh culty and
discrim ination fo r most test items are suitable fa r use in the achievement tests. It further shows
those test items w hich w ould best be used An other assesanent purposes. Because the analysis
o f test items involved three models o f item analysis, this study provides an opportunity fo r
comparison o f the three models.
This study concludes that the tests exhibit internal consistency and that procedures used
A)r m arking m ultipoint test items were sufBcient to povide rater re lia b ility . This study further
concludes that the Tables o f Specifications, w hich show content va lid ity and comparisons o f test
scores w ith teacher produced grades, w hich show concurrent va lid ity, demonstrate the overall
va lid ity o f these mathematics achievement tests.

TABLE OF CONTENTS
Abstract

ii

Table o f Contents

üi

L is t o f Tables

vi

L ist o f Figures

v iii

CHAPTER I - INTRODUCTION
Background

1
3

A)

Tgffs

Cwrigz/Zw
ConczzrrgMf
DgZfmfAzAow and Lzmdadons q f tAg &udy
Dg^wdon q/^ZgmK
Summary

5
7
7
7
8
9
9
10
10
11
13

CHAPTER 2 -LITE R A T U R E REVIEW
M ath Achievement
R%at ff MztA v4cAzgvgmgnt?
JVadonal Cowncd q/^TgacAgrs q/"AAdkgToadcs (W C T ^
BirzdsA Co/m»Z»A% MatA W ggrafgd Rgsowgg Pactogg
O rA grfacfors
Factors A & c tin g Test Measurement
RglzaAzAfy
FaAdfty
Conftrwct Falzdzfy
Confgnt Falzdzfy
Crdgnon Rg/afgd Falzdrty
Item Analysis
AAadgAv
CZayncaZ Analysts
Pgzn Rgj(ponsg 7%gory
Agm CAaragtgrtsdc Cwwgs
Ong Paramgfgr Logtsdc Model
Two Paramgfgr Zogtsdc M odel
TArgg Paramgfgr Zogtsttc Model
Summary o f6 e Thesis Topic

14
14
14
15
16
17
19
19
22
22
24
25
26
26
26
29
30
32
33
34
35

The Problem
Agm
7%e RgsgwcA Qugstion
Rg/m&zZf(y

in

Contnbuüoa o f this Study to the Literature

35

CHAPTER 3 - METHOD
R e s^ch Design
Afeaswes
D A M TG raffeJ
DAM T Grade 7
PlSd-Pbwndadon
vLî,;ess/?%e/d
Term and Pear End G^adef
Proeedwres
Teft Cansiracdan Praeesf
DfcAaramaas and ATaZfÿaint Aem bearing Pracea^a^
Dafa Cadeedan
Data yëfWysÂr
Campider Programs Used
Peman
Tes(gr(^
P zgsf^s
vfseaZ —d fa ^w am eter madeZ
yfseaZ —ZAreeparameter madeZ
&P&S

36
36
36
38
38
39
41
41
42
42
44
45
47
48
48
53
55
56
57
57

CHAPTER 4- RESULTS
Analysis o f Test Items
AZeZAadk a/^^na^szs
Grade 5 PesaZts
CZasdeaZ^naZysZs Using Peman
vfssan^dan Free ^naZysis Using PesZgraf
PascA Pern Pespanse v4naZysis ZTsing PigsZ^s
Lagwde Pern Pespanse v4naZysis Udng Twa Parameter v4seaZ
Lagisde Pem Pespanse vinaZysis Z^ing T%ree Parameter vIscaZ
Pammary q/^Grade 5 Pem Pejpanse vfnaZysis
Grade ZPesaZts
CZassicaZ vfnaZysis ZTsing Peman
y^ssanpdan Pree /(naZysis ZTsing Testgrqf
PascA Pem Pespanse vdnaZysis ZZsi?%^Pigsteps
Zagisde Pem Pe,panse yfnaZysis Using Twa Parameter y4scaZ
Pagisdc Pem Pespanse vInaZysis ZTsing T%ree Parameter yfscaZ
Eammary a/^Gradle 7 Pem Pespanse y^naZysis
R eliability o f C riterion Measures
W er-rater PeZiaW ify
Content-Related V a lid ity
Ta6Zesa/^^$Deci/Tcadans
Criterion-Related V a lid ity
Strength a f ZAe PeZadansAps

58
58
58
58
58
62
65
68
70
72
75
75
79
83
87
89
90
95
96
101
101
105
108

IV

A dditional V a lid ity Evidence

109

CHAPTER 5 - DISCUSSION
Summary
Conclusions
ReAoôz/ity mwf Pa/idify
Lim itations
Data Co/Icction
Im plications fo r Future Research
Im plications fo r Future Practice

111
111
116
116
121
121
122
122

Reference L is t

124

Appendix A . Letters o f Approval

127

Appendix B . Data by School; Data Breakdown o f Term, Final and FSA Results

130

Appendix C. Grade 5 D M AT and Answer Key; Grade 7 D M AT and Answer Key 133
Appendix D . Letters to Principals

199

Appendix E. Summaries o f Order o f D ifG culty and D iscrim ination

205

Appendix F. Statistical Formulae

208

Appendix G. Tables o f SpeciGcation

209

List of Tables
Table 1
q/^GrWe J DA&4TTiremy [/ym g Aeman,
Afgaywef q/^D /^c wZ(y
Dzfcn/M/Moffo»

59

Table 2 J(oycA
q/^Grodie J DA64T CWng
AownaTy q/^fAe AfeaM/rg^ —fgri^ow am ï ^gi?M

66

Table 3 J&zycA vffW ygig q/^GfO& J DAdMT LWog
AfgaM/rgf q / ^ D z f c n m m o f f o n oW F ;f

67

Table 4 Zogzf/fc Agm ^gjponfg
q/^Groijk J DA64T LWog
Two oW TTygg fw am gfgr A(W gk, M eam rgf q/^D^cz/7(y,
ZXfgn/wTMffoTt ff«g& f-g« gffm g aW 2%

69

Table 5 Zfg/yw m fAg Gradle J DA^TJE% Af6f(i»gfact q / ^ m fAg
Zqgisfic Afgm j(gapowg iiw w [yfgf

73

T ^ le 6 CAzyffco/
q^Gradlg 7 DA64TZfgmj [/ym g Agmart
A fg a w g f q/"Dzj^caZfy amf DzfcnmzMafzaM

76

Table 7 j(agcA vfaafy^w q/^GraJIg 7 D M 4T CMog
fAg Afgaawgf - fg rjo w aw f Afgow

84

S^ammazy q/^

Table 8 jZaacA v4»a/yfM q/^Gradle 7 DA64T Gymg^zgyfgpf, Mzaawga^ q/"
ZX^c«/(y, Dücrzmfaafzon aW 7^;/

86

Table 9 la g ijffg Aem TZgapoMfg v4aa(yM:$ q/^Gradg 7 DA64T Gf%»g Two
aW 7%rgg faram gfgr Atadk/f q /^ c a f, ATgaawgj q/^Dzj^caZfy,
jDfATZoaaa/zan, ffagzfo-gagfjm g a»J 7^;f

88

Table 10 Agmy m fAe G^adk 7 DA64TJGcA;AifiMg fa c t q/^Fzf m fAg
ZogffAg Zfgm 7(g^iwMfg vdmafyygj:

92

Table 11 Gog^j^Cfgnf v4(pAa_^07M fAg CAzyf;ca/ vlMafyjM Gymg Agman

96

Table 12 DzfAiAafioM q/^AfarA$ By B a f e r ^ f A g SAarf ^/m ygr Tfgma: m
fAg Gradig J DA64T

97

Table 13 DigfnAafzan q/^AlarAy
fA gG ra(/g7D A t4T

99

B afgrfT^r fAg ^AaTfv^rnwgr Bgmf m

Table 14 Garrg/afzayw Bg/wggw ATarAgr.;

100

VI

Table 15
mf/K Gradie J aw / GrafAz 7 DM4
Drganizef^ 6y ÆP iSAraw6

102

Table 16 AamMary qpfAe A4afAemaficf Procef^ej m f/K Gradie 5 a»(f
Grade 7 DAt47f Grgamzed Zy 7PP 5î(raRd

104

Table 17 AWfAera a/^&wdeMff IFko ITrafe f/%e Grade 5 aad Grade 7
DM 4 Cafegarzzed 6y fAe Dafa CaZ/ecfed

106

Table 18 CarreZadow_/^r fAe Grade 5 DM4PPefweeM DA4/4PS'eoref, P1&4
Skoref and Terni and PinaZ M zrty

107

Table 19 C a rre /a d a n ;^r fAe Grade 7 DA4/4TPedeeen DM 4PPearef and
Terni andPina/A4arAÿ

108

Table 20 Da/a Ay &eAaa/

130

Table 21 Dafa PreaAdawn a/^Term, PTnaZ and T%4 P eW ff

132

Table 22 Pianmaiy a/^Grder a/^Di^eW fy and Düerimznadan, Grade 5

205

Table 23 Aannwny a/^Grder a/^D z^cidfy and Dücrimznadan, Grade 7

206

Table 24 TaA/e a/^,^ei/7cadany/âr fAe Grade 5 DA64T Gfzng fAe
MdAemadej TKP

209

Table 25 TaA/e q/^.^cÿzcadangyâr fAe Grade 5 DM4T) lü d n g
MdAemadef P raceffef

211

Table 26 TaAZe q/^.^ei^eadanyyàr fAe Grade 7 D M 4P Gdng fAe
MzfAemadci^ TRP

213

Table 27 TaAZe q /^,^e i/îe a d a n ;/à r fAe Grade 7 DM}4P TW ng
MafAenKzdegPraeeffef

215

vu

LW of Figures
Ffgw e 7. Sample oWput fo r Iteman 6>r the m ultiple choice items in Ae
grade 5 D M AT

49

Figure 2 Sample output &T Iteman fo r the short answer items on the
grade 5 D M AT

51

Ffgure 2 Sample output o f TestGraf fa r item 26 o f the grade 5 D M AT

53

Figure 4. The ICC produced by TestGraf fo r item 2 o f the grade 5 D M AT

63

Figure J. The ICC produced by TestGraf fo r item 15 o f the grade 5 D M AT

64

Figure 6. The ICC produced by TestGraf fo r item 1 o f the grade 7 DM AT

79

Figure 7. The ICC produced by TestG raf fo r item 9 o f the grade 7 D M AT

80

Figure & The ICC produced by TestGraf fo r item 11 o f the grade 7 D M AT

81

Figure 9. The ICC produced by TestGraf fo r item 16 o f the grade 7 DM AT

83

vm

CHAPTER 1 - INTRODUCTION

Assessment, in the classroom, is inextricably linked to instruction. It is used by
teachers in a m yriad o f forms to And out whether or not learning has taken place. It also
provides direction fo r future instruction. The instruments and models that teachers use to
assess student learning vary 6om highly subjective, as w ith direct observation, to more
objective measures such as unit, term or even standardised tests. It is true that the
subjectivity or o b jecdvi^ o f these measures can be questioned. B ut that is not the intent
o f this thesis. Rather, it is sufScient to recognize that assessment in its many forms
works hand in hand w ith instruction to create learning o j^ rtu n itie s fo r students.
In this thesis, I am interested in examining the two mathematics achievement tests
developed by teachers in School D istrict 57. They were designed to test student
achievement in mathematics at grades 5 and 7. M y chief interest is in determ ining to
what degree these tests are tedm ically adequate and can be ccmsidered reliable and valid
indicators o f studait achievement. And in doing so, also provide evidence o f the
technical abilities o f educators selected to construct these tests.
The School D istrict Mathematics Achievement Tests (DM ATs), have been
administered to grotqrs o f grade 5 and 7 students since they were Srst developed in 1995.
M ost years they were administered to a representative sanq)ling o f grade 5 and grade 7
students. In t k 611 o f2000, die tests were administered to a ll the grade 5 and 7 students
in the school district. This change has meant that the tests can now be subjected to a
more complete analysis. Although classical test analysis can be conducted on small or
large data sets, because the number o f students that were tested is large (1297 grade 5

students and 1175 grade 7 students) these tests can now be analyzed using a variety o f
item response models in addition to 6 e classical analyses.
In itia lly , these tests were designed to provide the data needed to examine the
overall mathematics program in the Prince Gewge school districL It was & h that this
inform ation would be needed because a new mathematics curriculum was scheduled to be
implemented beginning in 1995. This was the in itia l reason 6>r the testing program and it
has remained its prim ary Amction. However, because aU grade 5 and grade 7 students
were tested, a new element was introduced into the testing ^nogram. School
administrators and classroom teachers were provided an opportunity to examine
individual and school results and make comparisons w ith the overall achievement rates o f
students throughout the district. This provided teachers an opportunity to reflect on the
ef&ctiveness o f their instructional strategies by comparing their students' results w ith
those o f other students in the district. However, these tests are o f assistance to teachers,
administrators and students only i f they are reliable and valid indicators o f student
achievement
I w ill begin by indicating vA y I feel this study to be inqw rtant. To do this I w ill
start by giving a b rie f history o f the development and im plementation o f the mathematics
achievanent tests. I w ill fo llo w that w ith a review o f W iat some researchers consider the
im portant elements o f mathematics achievement. This w ill include reference to elements
promoted by the N ational Council o f Teachers o f Mathematics (N C TM ) and wHl also
refer to the new directions found in the mathematics curriculum in place in B ritish
Columbia (Mathematics K to 7: Integrated Resource Package (IRP), 1995). Because, in
this paper, I am concerned w ith the re lia b ility and va lid ity o f the DM ATs, I w ill examine

what other researchers consider to be the im portant aspects o f test re lia b ility and validity.
I w ill also examine item and test analysis theories and w hat researchers consider to be the
im portant elements o f them. There are several computer programs w hich I w ill be using
to assist in the amlyses o f diese. This in& rm ation w ill provide abackground fo r the
research question w hich I am proposing.

Background
The
As mentimied in the introduction, the DM ATs have over the years provided
school trustees, school d istrict personnel, school adm inistrative ofBcers and teachers
inform ation about the mathematics achievements o f students in the district. To
communicate this inform ation to a ll interested parties, members o f the d istrict
mathematics committee (a committee consisting o f elementary and sec<mdaiy teachers
and administrators w ith a particular interest in mathematics) met follow ing the
adm inistration o f the tests, reviewed the results o f the tests, reviewed the test items and
used the inform ation to draw up recommendations about the tests and make suggestions
about implementing the mathematics curriculum . These recommendations went to the
school board and to the teachers and administrators in the schools. This inform ation was
intended to assist teachers in the development o f the ir mathematics ^ g ra m s . It was also
intended to help teachers and schools id e n tic areas o f &cus fo r professional
developnM it. To be able to review the test scores and make these recommendations, the
committee members had to trust that the in& rm ation hom the test was giving an accurate
picture o f tl% students' abilities. A statistical analysis o f the o rig ioally piloted test (grade

7) was conducted to determine re lia b ility. The grade 5 test has, however, not undergone
the same statistical scrutiny. Moreover, each test has been revised since it was originally
constructed and the re lia b ility and va lid ity o f the tests has become a matter o f fa ith and
intuition.
There is, as was mentioned in 6 e last section, now a second aq>ect o f tlwse tests
w hich has intensified the need 6)r a carefid analysis. In 2000, these tests were w rittM i by
a ll grade 5 and grade 7 students. Test results were sent to each school and the teachers
and the school administrators were able to examine student scores directly. Individual
student scores could be compared to d istrict averages. This data gave teachers and
administrators an opportunity to examine the eSectiveness o f the instructional practices
used w ithin the school. New plans and/or strategies could be developed and tested. In
this way, the in& rm ation was available to be used at a school level to develop or sim ply
hne tune each teacher's maAematics program. This means that the data should provide a
clear and accurate picture o f how w ell students achieve. I f it is viewed as useful, it can
provide an o f^ rtu n ity fo r droughtful planning. In other words, the test results can be
ef& ctive only in so f ^ as they ^ v e to be reliable and valid measures o f mathematics as
prescribed.
In May o f2000, the M inistry o f Education began a series o f tests at grades 4, 7
and 10 to assess reading, w riting and numeracy. These tests, called t k Foundation S kills
Assessment (FSA), have called into question the usehilness o f the DM ATs. Does the
FSA test provide sufBcient infarm ation about mathematics to provide a clear picture o f
the mathematics achievement o f students in the district? P erh^s trustiog that the FSA
results ^ v id e sufficient inform ation is warranted. Perhaps the additional inharmation

providW by (be DM ATs is warranted. This thesis w ill hopehiUy provide inhm nation that
is relevant to this debate.
It is m y intention, in this stW y, to determine whether or w t the DM ATs provide a
reliable measure o f student achievement in mathematics fa r grades 5 and 7 students. It is
also my intention to determine the degree to which the tests can be considered valid
measures o f student mathematical achievement.
BacAgrouW to the Tertr
In 1994-95, Sclxw l D istrict 57 undertook the planning o f curriculum assessment
in a number o f curriculum areas. In mathematics, a cmnmittee o f elementary and
secondary teachers (the district mathematics committee) was asked by the D irector o f
School Services to develop an assessment m odd to test mathematics achievement levels.
The committee was asked to keep in m ind budget lim itations as the district was
experiencing funding cut backs.
W orking w ithin these constraints, the committee proposed testing students in
grades 4 ,6 and 8. Moreover, rather than test a ll students in these grades, it was fe lt that a
representative sample could be selected that w ould provide sufBcient data to assess
student levels o f a b ility fo r ^ h o f the grades. A consultant. D r. Iris M cIntyre, worked
w ith the committee to establish criteria 6>r the selection o f the schools that would make
iQ) the sample. She also provided a statistical analysis o f the p ilo t study coi&ducted on a
grorq)ofgrade 6 students in the spring o f 1995.
To avoid conAision it needs to be pointed out that the DM ATs were administered
to students in grades 5, 7 and 9, but the curriculum beit% assessed was that o f grades 4 ,6
and 8. Members o f the mathematics committee, in designing the tests, surveyed a sample

o f the teachers in the district to 6nd ont when in the year A e tests A onld be administered.
The teachers who responded wanted to complete a ll o f the cnrriculnm before the students
were assessed and they also wanted to get the results early enough to be able to use it in
planning their mathematics program. To sa tis^ both these requirements, it was decided
that the grade 4 curriculum w ould be assessed by testing grade 5 students early in the 611.
The results could then be reviewed, analyzed and recommendations could go out to the
schools by Christmas. S im ilarly, the grade 6 curriculum and the grade 8 curriculum
would be tested in grade 7 and grade 9 respectively. It was in itia lly determined that
testing w ould be conducted every second year so that the same grorq) o f students could be
tracked 6om grades 4 through 8.
Mi^th this plan in place, the committee invited interested teachers in grades 4 ,6
and 8 to review test banks and develop tests that w ould match the newly mandated
mathematics curriculum (IRP, 1995). The grade 6 test was developed and piloted hrsL
The other tests were then modelled after it and the tests were administered to grade 5 ,7
and 9 shaieots in die 611 o f 1996.
Although the tests were originally designed to be administered every second year,
the plan has beoi modiGed over tim e. Students were tested in 1996,1998,1999 and
2000. M ost years this involved small representative samples. D a6 6 r each o f these
years have been collected. As was already mentioned, in die 611 o f2000, a ll grade 5 and
grade 7 students were tested.
There was some re lia b ility evidence established 6 r the in itia l grade 7 test,
however, over the years the content o f this test has changed. The re lia b ility o f the test

needs to be re-established. The re lia b ility o f the grade 5 test has not yet been established.
Further, there is also a need to establish the ir va lid ity. This is the intent o f this paper.

The Problem
The DMATs were in itia lly developed in 1995. There have, over tim e, been m inor
revisions as test items were tried, reviewed and in some cases, rewritten. The ^ocedures
fo r administering the tests have also bear modiGed o v a tim e. Although the re lia b ility o f
the in itia l p ilo t test (grade 7 - spring, 1996) was conGrmed, the re lia b ility o f the current
form s o f the grade 5 and 7 tests has not been checked. In addition, the va lid ity o f the
tests, given the changes that have been made and the adoption o f a new mathematics
curriculum in BC needs to be assessed. In ccmsidering the re lia b ility and va lid ity o f these
tests, I w ill review the construcGon and adm inistration o f them and assess the
eGectiveness o f Aese processes.

The large sample sizes in the 2000 test adm inistration have now made it possible
to establish not only test characteristics but item statistics as w ell. Analyses o f these tests
in previous years were lim ited to a classiW approach. Analyses by a variety o f item response models can now supplement a classical item analysis. This analysis provides an
o^qxirtunity to compare these item analysis models.
77K ^se a rch
It is my intaiG on, in this study, to determine the re lia b ility and the va lid ity o f the
grade 5 and grade 7 DM ATs using data Gom the year 2000 test scores, to review the
construction and adm inistration o f the DM ATs and assess the effecGveness o f these

processes and to analyse test items using various item response models and in so doing
compare the models.

To determine the level o f re lia b ility 6>r these tests, I w ill be examining two
aspects o f re lia b ility. First, I w ill examine the degree to \^ c h the test item s give
consistent results. This is a measure o f the internal re lia b ility o f the tests. Then I w ill
examine the degree to which rates agreed when marking the same tests. This w ill
establish the re lia b ility o f the raters. The nature o f the tests, mathematics questions and
problems taken horn existing test banks, and the structure o f the tests, m ultiple choice
and short answer questions, lead me to believe that both measures o f re lia b ility wiU be
high.
The Srst section o f each test was m ultiple choice. Student responses were
recorded cm a bubbled answer sheet and scanned by a computer. Marks have been
established and recorded by the computer. The second section was short answer
questions marked by a panel o f teachers. To determine the inter-rater re lia b ility , a ll
markers were asked to mark one set o f randomly selected tests. The marks horn this set
o f tests w ill be compared and w ill be used to diow the degree to w hich die markers
agree.
I examirK the internal consistency o f the tests by examining the individual item
responses. A t issue is whether or not response patterns are as expected. Included in the
analysis is the level o f difG culty o f each item (mean score) and the degree o f
discrim ination exhibited by the item . There are tw o measures that can be used to
calculate discrimmadon. First, A cre is a point-biserial correlation o f correct response to

test scores and second, there is a discrim ination index described as the difference in the
proportion o f correct responses o f high achieving students compared to the proportion o f
correct responses o f the low achieving students. I also include in this analysis overall test
statistics such as mean, standard deviation, skew, kurtosis and standard error o f
measurement (SEM). As a fin a l measure o f intem al consistency, I examine the
calculation o f coeŒcient alpha fo r each test.
The programs I w ill be using to assist in this analysis include: ITEM AN , version
3.5, Assessment Systems Corporation, St Paul, Minnesota (1993); TestG raf developed by
J. O. Ramsay, M cGdl U niversity (2000); Bigstqrs, constructed by John M . Linacre and
Benjamin D W right and available through MESA Press (1996); and Ascal, available
through Assessment Systems Corporatitm (1989), and part o f the M icroC at Testing
System. W ith Ascal I w ill use both the two parameter and the three parameter item
response models.

Currrcwlw.
To gather evidence o f va lid ity, I w ill Erst match the content o f each test to the
curriculum in BC. In part this analysis has already been done and I w ill include a
summary o f iL Over the years various members o f the d istrict mathematics committee
have reviewed the test questions to ensure that ÛKy are included in the curriculum Arr
that grade level. As part o f these reviews, committee members have deliberated on the
level o f difG culty o f each question. The predicted degree o f difG culty was recorded by
the committee members on a three point scale - easy, average or hard. I include this
inform ation as part o f a table o f speciScations referred to later. I w ill further analyze the

questions, however, by identifying

o f the maAematical pnocesses (communication,

connection, estimation, problem solving, reasoning, technology and visualization (The
Common Cumculum Framework fo r K-12 Mathematics, W estan Canadian Protocol
Collaboration in Basic Education, p. 5)) are involved in each question. This inform ation
w ill fnovide a more complete descripticm o f not ju s t the learning outcome associated w ith
each que^on but also the predominant mathematical process associated w ith it. For each
test, diis w ill be done by drawing up a table o f q)eciûcations.
CoWWTCMt.
I w ill examine the concurrent va lid ity o f the tests. I w ill do this by comparing
each student's results on the DM ATs to the mathematics marks grade 4 or 6 teachers
assigned them during the 1999-2000 school year. A n additional measure o f va lid ity fo r
grade 4 students is the relationship between the D M AT scores and the FSA numeracy
scores 6om the FSA test they wrote in May, 2000. A high correlation between these
results would demonstrate a strong concurrent va lid ity.
DeZûMîWiow and ZzmrW row

the

This study has been lim ited to the test results from the grade 5 and 7 tests
administered in the 611,2000. Although data 6om the years 1996,1998 and 1999 were
available, they w ill not be included. The data fo r these years was lim ited to a
representative sampling o f students and schools. The data horn the year 2000 tests
included students horn a ll the schools in the school district. It was fe lt that the data horn
this one year provided a sufBciently large data set w ith w hich to analyze the re lia b ility
and the va lid ity o f the current test instruments. Including data 6om previous years would
have increased the relative uncertainty o f the results because o f the changes made to the

10

previous tests. Results 6om the grade 9 test were not used as these ksts were not
administered to students in the year 2000.
In 1998, the usefulness o f one part o f the mathematics tests was called into
question. A t that tim e, the tests at each o f the levels included a section termed
Performance Assessment. In this section a randomly selected groiq) o f studMits met w ith
an examiner one student at a tiiiK to solve a series o f problems. Each student was given
the opportuni^ to use concrete instructional aides and was asked to interact w ith the
examiner w hile solving the problems. They were encouraged to articulate the ir reasoning
w hile they worked on each problem. Examiners could prompt students using leading
questions wherever this was needed. Scores were recorded on a 4 point scoring rubric.
Follow ing the 1998 tests the members o f the d istrict mathematics committee decided to
discontinue using the perhmnance assessment part o f the test. It was & lt that the
inform ation provided by the test was not sufGciently reliable to warrant continuing w ith
it
This study compared student scores from tw o separate school years - the 19992000 school year and 611 o f the 2000-01 school year. Because movement between
schools (and school districts) has occurred, some students' scores could not be matched
and their test scores have not been included in the study o f va lid ity.

I have used the terms re lia b ility and va lid ity extensively in the preceding sections
o f this proposal and w ill consider each term in greater detail in the literature review that
fallow s. However, to help provide some understanding o f these terms, I include here
some general descriptions given by Sax (1997). For re lia b ility , he states that ''re lia b ility

11

describes Ihe extent to w hich measuremaat can be depended on to provide consistent,
unambiguous inform ation" and that reliable measurements "...re fle ct true rather than
chance aspects o f the tra it or a b ility measured" (p. 271). Fw va lid ity, he provides the
follow ing deGnition: "V a lid ity is deGned as the extent to w hich measurements are useful
in making decisions and ;woviding explanations relevant to a given purpose" (p. 304)
Each term w ill be considered in greater detail in the next sections o f this proposal.
The year end grade fo r students in grades 4 and 6 is given as a letter grade A , B,
C+, C, C-, I or F. This grade is norm ally the average o f the term grades reported fo r each
subject during the school year. The grades reported during the year are also reported
using the above noted le t^ grades. These grades norm ally correspond to the follow ing
range o f marks: F —a Gnal mark o f between 0% and 49%, I —an incomplete mark (it can
be a^usted w ith evidence o f additional achievement) o f between 0% and 49%, C- —a
mark o f between 50% and 59%, C - a mark o f between 60% and 66%, C+ - a m ark o f
between 67% and 75%, B — a mark o f between 76% and 83% and A —a mark o f between
84% and 100%. It was noted that a small number o f students were w orking on individual
education plans (lEPs). The speciGc nature o f the lEPs and the learning objectives
m cludedinthem w erenotnotedinthisstudy. Becausetherangeofm arksrepresentedby
the letter grades varied considerably, the relative order o f the marks was deemed the most
im portant Mature and fo r analysis the grades were given the A llo w in g values: F —0 ,1 1, lEP - 2, C- - 3, C - 4, C+ - 5, B - 6 and A - 7. The result 60m the DMATs fo r each
student is recorded as the number o f correct responses compared to total possible
responses. The results fo r students in grade 4 v k o wrote the M inistry o f Education FSA
test are listed on a 5 point scale: 1 - "N ot Y et W ithin Expectations", 2 - between "N ot Yet

12

W ithin Expectations" and "Meets Expectations", 3 - "Meets Expectations", 4 - between
"Meets Expectations" and "Exceeds Expectations" and 5 - "Exceeds Expectations".

Summary
The re lia b ility and va lidity o f t k DM ATs need to be established so that teachers
ami administrators can confidently use the test results to examine existing mathematics
^g ra m s . These results w ill enable personnel in the schools to plan, rehne and develop
mathematical instruction.

13

C H A P TE R !-LITE R A TU R E REVIEW
Mathematics Achievement

The fallow ing are some o f dK observations made by researchers about the context
w ithin w hich mathematics is ta u ^ it and learned. 1 w ill be crwsidering Eve main sources
in this section. The Erst, Webb and Romberg (1992), provides an overview o f the
NaEonal C ouncil o f Teachers ofMathemaEcs (NTCM ) 1989 Curriculum and EvaluaEon
Standards and how the adoption o f these new standards has lead to the recent reE)rms we
see in the BriEsh Columbia mathematics curriculum . The second source is the Province
o f B ntish Columbia, Curriculum Branch, MathemaEcs K to 7: Integrated Resource
Package (IRP) (1995). I w ill examine the raEonale Err its connecEons to the N C TM 's
standards. The th ird , Erurth and EEh sources 1 wiU consider together as there are many
elem aits that are common to them a ll. The th ird source is the B ntish Columbia M irnstry
o f EducaEon, S kills and Trairung and provides infbrmaEon Eom the 1995 Bntish
Columbia Assesanart ofMathemaEcs and Science. Its authors, M arshall et al. (1997),
outline the technical inErrmaEon about the test and the recommendaEons that came Eom
it. The E)urth source, also Eom 1995, is the IntemaEonal AssociaEon E»r the EvaluaEon
o f EducaEonal Achievem m t (lE A )'s Third IntemaEonal MathemaEcs and Science Study
(TIM SS). The authors o f this report, MuUis et al. (1997), descnbe the results o f this
intemaEonal assessment. Canada was one o f 45 countries w hich parEcipated in this
assessment (M ullis et al., 1997; McConaghy, 1998). There were results reported at Eve
d ifk re n t grade levels. The report 1 reviewed gave results Eom the grade 3 and 4
assesanaits. The EEh, by Kober (1991), is a meta-study o f the issues encompassing the

14

teaching and learning o f mathanatics. This is stg^ported by a recent meta-study by
fCilpÆdaicl^EivfaiïcMni aa<l]Findell(/&0(13). I wiU tK;(%KariibiiD{;sK)nie cdFtbwsifWRies raisex! h i
this study.

AWono/ CowMCf/

(A/CZ%dP

As a mathematics ibeadier, I Imve often beard about the standards advanced by the
NCTM . Indeed, it appears that much o f the rationale upon which current mathematics
curricula is haunded connects to the rnrdlieroadiiasnzfDrm initiated try the NCTM in the
late 1980s. Webb and Romberg (1992) recount how this reform came about. The NCTM
established a commission in 1986 to develop a set o f new curricular standards to be
"incorporated into quality school m athanatics programs" and which w ould lay out "the
instructional conditions necessary fo r students to learn mathematics." (p. 37) The
commission was also asked to develop standards fo r evaluating school programs and
student perkrm ance in l i ^ o f the new curriculum standards. This w ork was undertaken
durh% the summer o f 1987, reviewed by N CTM members during the 1987-88 year and
Gnalized during the summer o f 1988. The members identiGed 6)ur standards in each o f
Gve goals lA iiich had to be met i f students were to be termed mathematiGally literate. The
h)ur standards stipulated were: 1) mathematics as problem solving, 2) mathematics as
communication, 3) mathematics as reasoning, and 4) mathematical coimections (p. 37).
The Sve goals identiGed by the commission were: 1) learning to value mathemaGcs, 2)
becoming conGdent in one's own a b ility, 3) becoming a mathematical problem solver, 4)
learning to communicate mathematically, and 5) learning to reason m athem atically
(Webb & Romberg, p. 37). Ite s e new standards were adopted at the 1989 NCTM
Annual Meeting in Orlando, Florida.

15

The new standards set out that "m athem atital knoWedge, because o f its dynamic
and m ultiplex nature, be acquired through investigation, exploring, reasoning, making
connections, ami communicating." (Webb & Romberg, 1992, p. 38) This was a rm lical
shiA Aom a curriculum that focused extensively on computational skills deAned by
mathematical algorithms presented by the teacho^. Students were passive recipients o f
that knowledge. In contrast, the N CTM {u:oposed that students become active
participants through discussirm, reAection and communicaAon, oral and w ritten. Errors,
because o f the new focus on e:q)loration and invesAgaAon, became a recognized element
o f reasoning. MathemaAcal thinking, rather than sim ply being nght or wrong, could be
credited as showing clear logic even i f the answers happened to be wrong. In short, these
iKW standards promote cogniArm.
AriAsA

MztAemaAcs /ntggrnrcA Resource factage
The prescribed mathemaAcs curriculum in BC is outlined in the Province o f

BnAsh Columbia, M inistry o f EducaAon's Integrated Resource Packages (IRPs)
MathemaAcs K to 7 and MathemaAcs 8 to 12 (1995). This curriculum came into eAect
September 1995. It comprises, by grade, a ll the mathemaAcs learning outcomes required
o f students.
The IRPs were developed using a vanety o f sources. One resource, in parAcular,
which comes through cleaAy in the introductory raAonale to these documents, is the
Curriculum and EvaluaAon Standards for School MathemaAcs fN C TM l. The
MathemaAcs IRP states that students, in becoming mathemaAcaUy literate, must develop
"the abüity to explore, to coryecture, to reason logically, and to use a variety o f
mathemaAcal methods to solve problans." (MathemaAcs K to 7 IRP, 1995, p. 2) The

16

document continues, "The new curriculum places greater em ;*asis on {^oba bility and
statistics, reasoning and communication, measurement, and problem solving." There can
be close connecticms drawn between what is w ritten in the Mathematics IRP and what
Webb and Romberg (1992) w rite about the NCTM Standards. In particular, there is a
strong emphasis on p foblen solving. This is one o f the N CTM standards. There is also
strong emphases on communication and reason, both o f w hich one 6nds in the NCTM
standards. One also Gnds sections in th^ IRPs that discuss the importance o f developing
positive attitudes and connecting and ^ip ly in g mathematical ideas. Again, these are
6)und as goals in the NCTM sW dards. As staled previously, there are many direct
connections that can be made between the BC mathematics curriculum and the
Curriculum and Evaluation Standards fo r School Mathematics (NCTM , 1989).
Otheryhctors
Marshall et al. (1997), MuUis et al. (1997), Kober (1991) and K ilp atrick,
SwaGbrd and Findell (2003) a ll report on the im pact o f student attitudes on mathematics
achievemart. They each have indicated that students w ith a positive attitude towards
mathematics experienced positive results on their mathematics achievement tests. Kober
(1991) points out that the converse is also true, students w ith negative attitudes g œ ^a lly
achieved poorly. She maintains that this negative attitude toward mathematics arises in
tw o ways. First, some students believe that only a few talented students could achieve
excelloice in mathematics. It is therefore acceptable socially fo r most students to do
poorly at mathematics. And second, some students viewed mathematics as sim ply a
collection o f fixed facts and formulae w hich were passed on by teachers. There was
therefore no need fo r them to th in k about mathematics process and concepts. Mastering

17

maüiematics, to students w ith this m ind set, is a passive process. It becomes
memorizatioD and regurgitation. Reasoning and cognition are not required. But as has
already been described, the mathematics curriculum and die N CTM standards urge active
learning and reasoning.
This Ixings me to my next observation concerning mathematics achievement.
Kober (1991), and Webb and Romberg (1992) both point out that the mathematics
curriculum must be relevant to students. Webb and Romberg qieak o f relevance when
they w rite that students become mathematically literate 'liy learning to value
mathematics" (p. 39). In the Mathematics IRP, relevance is promoted by recognizing the
importance o f problem solving in mathematics instruction and the importance o f making
connections to everyday life . As students move through the grades they should be
exposed to increasingly diverse and complex mathematical problems to solve. Being
able to successfully communicate solutions to problems is recognized as an im portant
part o f this a b ility. Connecting and zgiplying mathematical ideas is another im portant
element o f this especially w ith in the context o f a rtgridly changing w orld and an
increasii% ty pluralistic culture. Kober (1991), in a sim ilar way, speaks o f relevance in
terms o f showing ways in w hich mathematics is a part o f the ev^yday demands o f li& ,
not ju s t in school, but on the jo b as w ell. She maintains that the best ways to promote
relevance are to build %noblem solving skills and to challenge students to reason. She
adds that w ith the increasing prmninence o f technology, use o f technology further
enhances the cre dibility o f mathematical instruction.

18

Factors A H W ing Test Measuremait

R e lia b ility can be referred to as the degree to w hich test results can be replicated
when the test is administered to the same individual under sim ilar conditions (Crocker &
A lgina, 1986). H istorically, dûs consistency o f test measures can be approached 6om
two perspectives. First, the perspective o f the individual, where the individual's test
score consists o fa true measure o fth e tra it tested and a measure o f error. Second, the
perspective o f how the individual maintains a consistent position in the total group
(Stanley, 1971). In Feldt and Brennan (1989) this same idea is expressed more
quantitatively as the standard error o f measurement and the re lia b ility coefBcient o f a test
measure. In Cunningham (1998), this is referred to as a measure o f consistency betweai
sim ilar test items in tw o tests, where the same tra it is being measured and where the
relative order o f the test scores achieved by individuals w ould remain unchanged.
Establishing the re lia b ility o f the mathœ iatics achievement tests is a necessary
part (Lyman, 1998; Cunningham, 1998) but im tthe only objective o f this study. There
are several statistical measures that can be used to compute the re lia b ility coefficients o f
the tests. D ifferent coefGcients measure different aqiects and not a ll & ctors are measured
by any given calculation. Those 6ctors that exert a varying influence on a test are said to
contribute error variance to the test.
Lyman (1998) discusses 6ve main factors that contribute to error variance. The
6rst relates to the student taking the test. Individual students w ill vary 6om test to test in
their m otivation and prq>aration fa r the test. They may also vary in a m yriad o f other
ways. Their health, mental alertness, stamina, competitiveness, w illingness to take risks

19

could vary 6om day W d ^ . In 6 c t Ihee could be a ll sorts o f reasons w hy a student's test
scores could vary 6om test to test. Many o f these Actors are beyond the control o f test
administrators. The d istrict achievement tests included directions to test administrators
and to students designed to m inim ize the efkcts o f those Actors by standardizing test
procedures.
A second influence could be attributed to the examiner or marker. The examiner
could mvalidate the test by giving extra help, pom ting out mistakes, giving extra tim e or
in oAer ways helping Ae student taking the tesL This too is generally m inim ized w iA
A e standardization o f test procedures. Markers can mtroduce error variance w hm
subjective assessment cntos into Ae marking o f a test. Error variance can be m inim ized
w iA w ell laid out m arking criteria and ruA ics but it can never be conqrletely eliminated.
Because the A strict maAematics achievement tests have a section o f short answer
questions, some marker subjectivity is expected and interrater re lia b ility w ill need A be
considered.
A th ird A ctor affecting re lia b ility is Ae content o f Ae achievement tests. To test
a ll learning outcomes A r a given grade is unreasonable as Ae resulting test would
become long and unwieldy. There is a need to establish A e generalizability o f the
questions listed on Ae test. This w ill be considered m coigunction w iA A e Table o f
SpeciGcations.
A A u rA A cA r aSecting re lia b ility is A e influences o f tim e. Too little tim e
between tests, means A at some students w ill remember A st items and Aere A re have a
bigger advanAge; too much tim e between Asts, means that student learning could aûect
the A stribution o f marks. This hzqrpens because difA rent inA viduals leam at diSerent

20

Tates. Although this is a Ihctor aOecting re lia b ility, it's not a consideration in this study
because the mathematics achievement tests are administered only once. The re lia b ility
being considered in this study is the internal re lia b ility o f a single adm inistration te st
A G fth 6 cto r affecting re lia b ility is the situation in w hich the test is administered.
The conditions prevalent at the tim e o f test taking w ill influence the results o f the test.
This factor is difG cult to enter into any test re lia b ility measures. M ost o f these influences
are best addressed by the examiner. W ith students iM io have greater experience taking
test, this 6 c to r is less signiScant. W ith younger, less experienced students this may be a
signiGcant factor influencing Ae re lia b ility o f the tesL
H istorically, there are three different approaches to computing re lia b ility
coefGcients (Cunningham, 1998; Feldt & Brennan, 1989). The Grst is parallel forms. In
this method two diGGaent but parallel tests are constructed and administered to students.
The extent o f correlation between the test scores measures the re lia b ility. Parallel forms
provide two types o f inform ation. They indicate whether students have changed (given
tim e between test administrations). They provide inform ation about test items and
indicate whether or not sim ilar items test the same traits. The second is test-retest
re lia b ility. In this method the same test is administered tw ice. The resulting re lia b ility
measure is considered weak, however, because in cases where the period o f tim e between
tests is short, a higher than expected coefficient can result W ien students remember and
copy their previous responses. A third method is the single adm inistration method. In
this method, test items are divided into groups w hich are then tested fo r internal
consistency. Because the district achievement tests are one form administered to students
only once, the single adm inistration mediod is used in this analysis. There are several

21

methods o f calculating correlations to r a single administered test - Spearman-Brown
form ula, tau-equivalent and coefBcient alpha to m ention three (Cunningham, 1998; Feldt
& Brennan, 1989). The calculation used in this analysis is C oefficient A lpha The
advantage to this method is that it can be used on tests scored dichotomously or on tests
where a range o f scores is assigned.

Lyman (1998) considers va h d i^ to be the most im portant attribute o f a good test.
M cM illan and Schumacher (1997) point out that test va lid ity refers to the extent to w hich
test scores can be held to be m eaningful. They insist that va lid ity refers to the infsrences,
uses and/or consequences o f a test's measure. Traditionally, authors have referred to test
va lid ity under three main aspects (Lyman, 1998; Cunningham, 1998; Gipps and Murphy,
1994): construct v a lid ity, content va lid ity and criterion validity.
CoMstrwct Fa/fdïty
The Grst type o f validation I wish to consider is construct-related validity. This
relates to what Cunningham (1998) refers to as constructs. Constructs are broad
descriptors o f behaviours. Examples w hich Cunningham gives include intelligence,
creativity and reading comprehension. For the purpose o f this study, constructs could
also include reasoning, problem solving, estimating, communicating and investigating.
Some o f these behaviours can be quantified using questionnaires that give a range o f
responses (L ike rt type scale). The responses can then be assigned a concrete numerical
index. Construct v a lid ity can provide an indication o f how w ell the numerical index
realty reflects the construct Its purpose is to cla rify exactly what is being measured by
the test. There is no single coefBcient that one can establish Instead, this va lid ity is

22

measured by evidence inferred 6om the rest results. Cunningham states that, "Before we
can accept the view that a test measures what it purports to measure, a logical case must
be established h)r why in&rences about a construct, based on test scores, are legitim ate."
(p. 40). He goes on to discuss three phases used to establish construct-related va lid ity.
The Grst step is to determine Wiethm^ or not a single entity is being measured; next, to
describe the theory on which the construct is based including what the construct is and
how it w ill be observed and fin a lly to examine the results to see i f its interaction is as
predicted.
In the fallow ing sections two other farm s o f test v a lid ity w ill be discussed. They
are content-related and criterion-related va lid ity. There is a contemporary view o f test
validation W iich draws these two aspects o f v a lid ity measures into a single unifying
view. This framework, proposed by Messick (1989), unites content- and criterion-related
va lid ity w ith an a ll inclusive concept o f crm struct-validity. Sax identiGes six aspects o f
construct va lid ity: "content, substantive, structural, generalizability, external and
consequential." (1997, p. 314) He notes that content-relevance validation interprets how
w e ll the test items cover the domain tested. It further identiGes the degree to w hich the
items can be considered "relevant, represaitative and socially desirable." (1997, p. 314)
The substanGve aspect o f validation points to the need fo r evidence fa r va lid ity Gom a
variety o f sources. Structural vahdity refers to the scoring criteria itself. The values
assigned to test items need to relate logically to the respondent's task. A test
demonstrates generalizability validation in so 6 r as the properGes and interpretaGons
placed on respondents scores can be generalized across the to ta lity o f the construct
domain. The external aspect o f test validaGon is sometimes referred to as convergent and

23

discrim inant validity. In it, test results are expected to correlate strongly w ith related
concepts but show a weak correlation to non-related concepts. F inally, the consequential
aspect o f validation takes into account the values o f 6im ess, bias and social im plications
associated w ith the interpretation o f test scores.
Messick's position that test validation combines traditional ideas about test
va lid ity under a unified concept - construct va lid ity - postulates that a ll measures o f
va lid ity, em pirical and inferential, provide valuable inform ation about the nature o f the
assessment instruments (Messick, 1989).
Content vofû&ty
A test is said to have content related va lid ity when test items relate strongly w ith
the domain o f knowledge taughL This va lid ity measure is non-statistical in nature. In
& ct, some consider it suspect in its subjectivity. Concern has been expressed that this
va lid ity can be easily biased by test producers and vendors (Cunningham, 1998).
However, this va lid ity measure is o f particular importance fo r achievement tests such as
the mathematics achievement test under consideration in this p^)er because it expresses
the degree to which the items included in the test match the actual learning outcomes
related to a course. This includes elements o f knowledge included in the course and the
skills to apply the knowledge taught. Cunningham (1998) sets out a method o f testing
content related va lid ity using a table o f specifications. This table includes a hst o f
instructional objectives (learning outcomes), items associated w ith them, the number o f
items used to assess each objective and the cognitive level o f each item.

24

CnfenoM reW a f
The third form o f va lid ity is criterion-related v a lid it). It provides a more concrete
form o f va lid ity because it is statistieally based. It provides evidence that the test
measures criteria that compare w ith other standards that are closely related but external to
the test (Lyman, 1998; Cunningham, 1998). It sounds ideal but unfortunately has some
drawbacks. The Êrst lim itation is the d ifh cu lty in identij^dng ^ipropriate criteria
measures that are outside o f the test and o f a quality that give a good point o f referoice.
This could prove to be a d iffic u lty in this study because, w ith a new mathematics
curriculum , the objectives may be quite diGferent 6om existing standardized achievement
tests. The second lim itation is the interpretation o f the resulting correlation coefGcients.
There are many Actors that a fk c t the size o f the correlations measure. How are they to
be interpreted?
C ritaion-related va lid ity because o f its statistical nature is o fta i highly sought
after (Lyman, 1998). In general, the higher the measure o f correlation is between the test
and criterion standards the better. There are Actors, however, that need to be considered.
Some tests may not lend themselves to establishing criterion A r comparison. A lternately,
the test may lend its e lf to comparison w ith criterion standards but the criA rion may be
altogeAer different or have different emphasis. This needs to be a consideration in Ae
present study. Even though Aere are standard mathematics achievement tests available,
w iA Ae changes m Ae curriculum , is it possible to compare A e current learning
outcomes w iA c rita io n selected 6om previous curriculum models? A real terms, can
criA rA that Acused on computation be va lidly compared w iA questions that Acus on
reasoning and/or communication o f maAematical concepts.

25

As w ith re lia b ility, a lim itation to va lid ity could be the range o f scores. Where
test scores are homogeneous, discrim ination is d ifS cn lt Where there is a wide range in
test scores, greater discrim ination is possible and test va lid ity is comparably greater.
Because a b ility levels o f studaits in the d istrict vary greatly, it is expected that this 6 cto r
w ill increase va lidity.
Finally, researchers states that criterion-related va lid ity may be concurrent or
predictive (Lyman, 1998; Sax, 1997). For concurrent va lid ity, the test scores and the
criterion values are taken at about the same tim e. For predictive va lid ity, there is a lapse
o f tim e between the test scores and the criterion values. The focus o f this study w ill be
the examining o f concurrent validity.
Item Analysis
In this study o f the DM ATs, analysis o f the test items is an im portant
consideration. A general model fo r test construction described by Henrysson (1971)
includes: a pretryout stage where the test is planned, a tryout stage where the test is
administered to a representative sample (300 students or more) and a tria l adm inistration
stage. In terms o f the DM ATs, a ll these stages have been completed and the data from
the tests administered in 2000 provides an opportunity & r an in depth analysis o f the test
items.

A m ^o r advantage to classical item analysis is that it can be conducted on small
samples ju st as w ell as large samples. The main indices examined in classical analysis
are item difBcuhy aixl item discrim ination.

26

The measure o f item difS culty fo r a test item is deûned as '^the proportion o f
examinees who get that item correct" (A llen & Yen, 1979, p. 120). This measure is
usehil in determining whether or not an item is suitable fo r the a b ility level o f 6 e
students. I f an item is too easy a ll examinees w ould get it correct and the measure o f
difG culty would be 1.0. % on the other hand, an item is too difR cult, a ll examiiKes
would get it incorrect and the measure o f difG culty would be 0. These values represent
extremes w hich under normal circumstances one does not erKounter. Even so, they show
that measures o f item difG culty approaching these values should be held suspect and the
items should be examined closely fo r usefulness. AUen & Yen suggest that an item
provides maximal inform ation about the difkrences between examinees when the level
o f difG culty fa r the item (p j is .5. They indicate that this varies depending on the type o f
question. W ith m ultiple choice questions, because there is a guessing factor that must
also be considered, they suggest using a value o f .6 as a maximal value fo r a four
choice test. Depending on the test, these values act as a target to provide maximum
inform ation about the examinees.
There are some exceptions to the p value targets suggested by AUen & Yen. The
authors add thatp values used as a target need to reflect the overall purpose o f the test. In
tests designed to ide ntify students in need o f remedial intervention, items w ith high p
values fo r the general population (easy hems) need to be used. In tests designed to
identify high achieving students fo r awards or fo r special enrichment type programs,
items whh low p values (difG cult item s) are needed. Because the DM ATs are student
achievement measures, a range o fp values is desirable. A llen & Yen (1979) suggest that
the most suitable range fo r item di@ culty is about .3 < p < .7.

27

A measure o f item discrim ination is determined in either o f two ways —an item discrim ination index (D ,) or an item /total-test-score point-biserial correlation

(A llen

& Yen, 1979). These statistics norm ally produce sim ilar interpretations. Both are used
to calculate the degree to w hich an item discriminates between high and low scoring
examinees. O r to put it another way, either o f them can be used to indicate W iether a
student who does w ell on the test as a W iole (high scoring) is more hkely to get a
particular item correct than a student who does poorly on the test as a whole (low
scoring). N orm ally, high values o f individual item discrim inatioris are desirable fo r a
test.
The item discrim ination index (D,) is determined by calculating the difference
between the proportion o f high scoring examinees wdio correctly answer an item and the
proportion o f low scoring examinees who correctly answer the item . A form ula fo r this
calculation is:
i> , = ^
n,
where L/} is the number o f examinees in an established upper range o f scores on the
whole test and w to also got the item correct, JLjis the number o f examinees in the low er
range o f scores on the whole test and w iio got the item correct and n, is the number o f
examinees in the upper and low er ranges (AUen & Yen, 1979). W hile the iqiper and
lower ranges m ight logically be the top quarter or th ird and the low er quarter or third, the
proportion that is chosen by software designers is actually 27%. Research has shown that
the sensitivity and stability o f the item discrim ination index is often greatest when using
the upper 27% and the low er 27% o f the examinees (Crocker & A lgina, 1986).

28

The item /total-test-score point-biserial correlation is a comparison o f scores on
the item to to ta l test scores. This is a Pearson correlation and the harmula fo r this
calculation is:

_

where

I A

is the mean o f the scores among examinees who responded correctly to item i,

and s% are the mean and standard deviations fo r aU examinees and

is the item

difG culty (A lle n & Yen, 1979).
In classical analysis, knowing the level o f item discrim ination, D or

is

valuable. For items that are w ell behaved, the discrim ination should be positive. This
means that more high scoring examinees select the correct response fo r the item than low
scoring examinees. Negative discnmmation values would indicate exactly the opposite,
more low scoring examinees select the correct response than high scoring examinees.
Items w ith a low or negative discrim ination fo r the correct response are suspicious and in
general should be removed horn the tesL
Aem Response TAeory
Item response theory (IR T) was created in an effort to overcome shortcomings
associated w ith classical test analysis. The most notable o f these shortcomings is that in
classical item analysis, characteristics that are associated w ith the examinees cannot be
separated from characteristics associated w ith the test (Hambleton, Swaminathan &
Rogers, 1991). In terms o f a student w ritin g a test, the classical measure o f the student's
a b ility is the student's test score. However, the scores are a function o f the difG culty o f
the test items. For an easy test, the student's test score could be quite high whereas iv ith

29

a difR cult test that student's test score could be quite low . Item response theory has been
developed to overcome this interdependence o f examine and test characteristics.
IR T rest on two basic postulates (Hambleton, Swaminathan & Rogers,1991). The
frs t postulate is that an examinee's per&rmance can be predicted using factors referred
to as traits o r abilities. The second postulate is that there is a relationA ip between the
responses to an item and the examinees' traits that can be described by a continuous and
increasing function. This function is called the item characteristic fra ctio n and when
applied to examinee a b ility and the probability o f a correct response to an item , it is
termed an item characteristic curve (IC C ) (Crocker & A lgina, 1986). A n ICC w hich is o f
special significance is that based on a normal distribution. This is termed a normal ogive.
It has several special properties. First, going fo m the le ft to the righ t the curve rises
continually. Second, the lower asymptote approaches zero and the upper asymptote
approaches one. Third, it is directly related to a normal distribution and therefore graphs
proportions that are functions o f the z-scores (Crocker & A lgina, 1986).
IR T models rely on maximum likelihood probabilities and as such may or may
not be applicable fo r use in analyzing some sets o f data. We are cautioned to assess the
f t o f the model to the data. Where f t can be ve rife d , IR T models provide the
opportunity to estimate examinee a b ility independent o f test items and to establish item
characterisfcs that are independent o f the group tested. (Hambleton, Swaminathan &
Rogers, 1991)
Aem cAwocteMsric cr/rves.

A n item characterisfc curve (IC C ) is described by A lle n &Yen (1979) as a
g ra ^ c display o f the re la fonship between the probability that an examinee w ill correcfy

30

respond to an item and the examinees relative score on the tesL This relationship is
supported by Crocker and A lgina (1986) who note that true scores on a test are related to
the latent tra it gr^ihed in an ICC. I f test scores are used as the best estimate o f the true
score then sim ilarities between ICCs and classical item statistics can be noted. The ICC
fo r an item w ould have estimates represented on a graph w ith total test scores ranked
along the horizontal axis and the proportion o f examinees' responses located on the
vertical axis. The resulting curve provides an opportunity to examine the degree o f
difG culty and level o f discrim ination.
Ramsay (2000) used ICC displays in the development o f the program TestG raf
The program is used to display the probability that examinees w ill choose certain options
depending on the prohciency o f the examinee. I use it here to illustrate how the classical
measures o f difG culty and discrim ination can be related to ICCs. In TestGraf the degree
o f difG culty fo r an item is defined as the prohciency level (e)q)ected score) that
corresponds to a probability o f .5 on the vertical axis. In other words, it measures the
estimated a b ility score (or rank) at w hich 50% o f the examinees that had that score
correctly responded to the question. For an easy item more lower scoring examinees get
the item correct and the .5 proportion w ould be reached at a low score or percentile. For
a more difG cult item few higher scoring examinees wiU get the item correct and the score
(or rank) at w hich .5 o f the examinees w ith that score get the item correct could be quite a
high score or percentile.
In TestGraf item discrim ination can also be measured using the ICC. The
discriinination is dehned as the slope o f the ICC. This is one way in w hich the item
analysis using TestGraf d i^ rs 6om item reqxmse logistic (IR ) models. In the IR models

31

the measure o f difRcuhy and the level o f discrim ination are both measured at the
estimated total test score (or rank) at w hich 50% o f the examinees w ith that score (or
rank) correctly respond to the item . The analysis w ith TestGraf^ however, provides the
opportunity to observe how discrim ination is lik e ly to vary between the different groups
o f examinees at d iff^ e n t ranges o f the expected score. An item may display an ICC
which shows great discrim ination fo r low scoring examinees and have low discrim ination
6)r high scoring examinees. As A U at & Yen note, "ICCs can be useful in id a iti^ in g
items that per&rm differently fo r different groups o f examinees" (A llen & Yen, 1979, p.
129).
The one drawback to ICC fa r analysis is that large samples are required to make
realistic estimations o f the response curves. This is particularly im portant fo r the
extremes o f the test scores, the high scoring examinees and the lowest scoring examinees
w boe fewer examinees fnovide data hrr estim ating the ICC.
One pwamcter logwtzc mWel.

The one parameter IR model, also called Rasch model, provides an analysis o f
items where the only parameter o f interest is the item d iffic u lty . It is assumed that other
parameters do not a fk c t the model. For this model a ll ICCs have an identical shape
because discrim ination is assumed to be equal. The ICCs only d iffe r in the ir placement
along a d iffic u lty /a b ility continuum. This model is based on the premise that the odds fo r
success o f an examinee are based on the product o f an examinee's a b ility ( ^ and the
easiness o f the item where easiness is dehned as 1/h w ith h being the difG culty o f the
item . Hambleton (1989) shows that based on this premise, the form o f the resulting ICC
can be deGned using the fmrmula:

32

1+

The f

is the probability that examinee n w ill answer the rA item correctly,

is the

a b ility measure o f examinee n and 6, is the level o f diS iculty o f the item , i .
The a b ility measures o f the groiq* o f examinees can be transformed so that the mean
a b ility is 0 and their standard deviation is 1. The parameter o f interest (6,) is measured as
the location on the a b ility distribution where 50% o f the examinees o f that a b ili^ would
get the item correct. Negative values fo r a b ility are located to the leA o f the mean;
there&re, a negative

value represents an easier item . S im ilarly, a positive 6, values

would indicate a more difG cult item . In Rasch analysis, it is a common practice to center
the item difG culty at zero. Parameter values fo r this model then typica lly are values
ranging Aom -2 to 2.
Two parmneier /ogisirc modle/.

The two parameter IR logistic model provides an analysis o f items where the
parameters o f interest are the ito n difG culty and the item discrim ination. It is assumed
that no other parameters aflect the model. It is deGned by an ICC formed by the
follow ing function:
0 = 1,2, 3,..., k).

f

is the probability that an examinee n w ith a b ility ^ w ill answer item i correcGy and

a and h are parameters that characterize item i. The variable k is the number o f items on
the test and T) is a scaling factor v h ich brings the resulting curve close to a normal ogive
(Hambleton, 1989).

33

In this model a b ility vaines are as deSned & r the 1 parameter logistic model.
The parameters a, and 6, are usually referred to as the item discrim ination (a,) and as the
item difG culty (6/). The item difG culty represents the point on the a b ility scale where an
examinee has a 50% probability o f answering the item correctly. The item discrim ination
is the slope o f the ICC curve at the point 6. In theory there are no upper or low er lim its to
the value o f a; however, in practice items w ith a negative value fo r a would be discarded.
The slope o f ICC generally is not greater than 2 so the e fkctive range fo ra is considered
0 to 2.

The three parameter IR T model provides analysis fo r items where three
parameters are o f interest: difG culty, discrim inatioi^ and low er asymptote (pseudo­
chance level). It is defined by an ICC formed by the follow ing function:

f ((^ J = c, + (1 f

3

,

.

.

.

,

k).

is the probability that an examinee n w ith a b ility ^ w ill answer ita n i correctly and

a,, 6; and c, are parameters that characterize item i. The variable k is the number o f items
on the test and

is a scaling 6 cto r W iich brings the resulting curve close to a normal

ogive (Hambleton, 1989). Parameters D, a and 6 have the same means as fo r the two
parameter model except that at on the abihty scale, the probability o f a correct response
is (1 + cJ/2 rather than .50. The low er asymptote (c) value represents the probability o f
an extremely low scoring examinees getting the item correct. It can be considered a
measure o f guessing at an item . Because c is a probability o f getting the item correct its
range o f values is 0 to 1. For m ultiple choice tests the in itia l estimate fo r c is the inverse
o f the number o f possible responses.

34

Summary o f the Thesis Topic
The focus o f this paper is the re lia b ility and va lid ily o f the DM ATs & r grade 5
and 7 students. These tests together w ith student grades from the 1999-2000 school year
and results 6om the May 2000 FSA grade 4 tests provide considerable data w ith \\h ich to
undertake this study. Other inform ation, including an analysis ofth e construction and
adm inistration o f Aese tests and a comparison o f the item response models w ill be
considered in the course o f this study.
The C ontribution this Study w ill Make to the Literature
This study is prim arily an em pirical study o f concurrent va lid ity examining the
relationship between loca lly developed DMATs and student results on classroom based
assessment o f mathematics achievement as w ell as provincial tests o f student numeracy.
Although the procedures follow ed w ill 6xms on traditional va lid ity theory, it w ill
examine aspects o f v a lid ity more closely associated w ith the position taken by Messick
(1989). This project w ill provide suRxrrt to School D istrict 57 personnel in that it wiU
provide inform ation about the DMATs. It w ill also add validation inform ation to the
general body o f knowledge related to test va lid ity.

35

CHAPTER 3 - METHOD
Research Design
This is an em pirical study using test data and statistical analysis to establish the
re lia b ility and va lid ity o f the mathematics achievement tests developed by School D istrict
57. The test data used fo r this study consists of: scores from the DM ATs administered in
the fa ll o f 2000, mathematics grades fo r students enrolled in grades 4 and 6 during the
1999-00 school year-term (marks as w ell as Gnal grades) and FSA scores fo r those grade
4 students who wrote the FSA numeracy test in May, 2000. This data set was collected
hom school Principals and 6om School D istrict personnel at the board ofBce. It was
cross checked to ensure accuracy and stored in a computer data base.

The subjects o f this study are the grade 5 and grade 7 DM ATs and indirectly the
teachers and the SDMC members who designed and constructed them. M y analysis o f
them w ill include examining the test items, examining the role played by those involved
in the construction o f the tests and the rating o f the examinees and items, and examining
the item analysis methods used to assist in the analyses o f the test data.
A ll grade 5 and grade 7 students in School D istrict No. 57 who wrote the D M AT
tests during the fa ll o f2000 are participants indirectly in this study. A lis t o f the schools
and the numbers o f the students in each school who wrote the D M AT is included in
Appendix B. The instructions fa r administering the tests, sent to the schools w ith the test
booklets, outlined which students were to w rite the test and w hich students were to be
excluded. Principals were directed that a ll grade 5 and a ll grade 7 students were to be
included excq>t where inclusion w ould result in undue harddiip to the student. Students

36

were to be excluded if: they exhibited moderate to severe intellectual disabilities, severe
behaviour disorders, m ultiple disabilities, autism, had extended absences, or were not
capable o f r^ponding. In the case o f students on Individual Education Plans (lEP),
teachers were asked to include them i f they were capable o f responding provided they
had ^qnopriate assistance sim ilar to that provided in regular classroom situations or as
described in the student's lEP.
The data set gathered hom the D M AT was used fo r the item analysis o f the tests.
These data were also used to establish the internal consistency o f the tests. However, to
establish va lid ity, not a ll these data could be used because not a ll o f it could be matched
w ith speciGc students. There were two general categories fo r w hich a match between
DM AT scores and term and year end marks could not be made. The Srst category
included aU those students 6)r whom identiGcation on the D M AT scores was a problem.
Some o f these students had recorded school numbers correctly but when the lists w oe
sent to the schools, principals were unable to decipher the names. Others had school
numbers recorded incorrectly and tracking down their records was impossible because it
was impossible to know w hich school they attended. In Table 22 in Appendix B, under
"Miscoded Data", I have noted the numbers o f students fo r whom D M AT results were
known but fo r whom the school they attended could not be established.
The second category, by & r the largest, included those studaits who wrote the
DMATs in the &U o f2000 and moved behrre data on their term and Gnal grades could be
collected. The collection o f data Gom the schools did not commaice u n til February,
2002 and was not completed unGl March, 2003. This meant that most students had
changed schools and their records had been sent to the ir new schools. In most instances.

37

movement o f grade 7 students was predictable - they moved to the high schools in their
catchment area. Some students, however, could not be tracked. Either they had moved
out o f d istrict (in some instances, out o f province) or their new school was sim ply not
known. In Table 23 in Appendix B, there is a summary o f the students fo r whom school
records were not available. This turned out to be 185 students in grade 5 (14.3%) and
177 students in grade 7 (15%).
M zaswgf
DA&4T Grodk 5
The grade 5 D M AT consisted o f tw o parts each presented in a separate test
booklet. The Brst part was made up o f 30 m ultiple choice items. Each item consists o f a
stem follow ed by 5ve possible responses. Each item was valued at one mark. The
correct responses ^ypeared randomly distributed among the Sve possible. The test
booklets included a cover sheet where students were directed to ide ntify themselves by
name (last and Srst), grade/class, school, whether or not they were o f aboriginal ancestry
and whether or not they were enrolled in a Montessori program. The cover sheet also
included an instructions section that provided students w ith the directions required to
complete the test
A ^

2

goodness o f h t test can be used to determine whether or not the distribution

o f correct responses is in fact random. There are 30 items and 5 response alternatives,
there&re the most lik e ly distribution o f responses fo r it to be random would be 6 correct
responses fo r each o f the 5 alternatives. For the grade 5 te st the observed distribution
was: alternative A - 5, alternative B - 5, alternative C - 9, alternative D - 5 and alternative

38

E - 6. The observed y

/"y

value is 2. The critica l y

Jv rv=.05

is 9.488 har d f = 4, therefore the

distribution o f ita n s is statistically random.
The second part o f the grade 5 D M AT was made up o f six items designed to be
answered directly in the test booklet. These six items were in some cases subdivided.
The arrangement o f the items and the distribution o f marks was as follow s: item 31 is
divided into sections A and B, each w orth 1 mark; item 32 is divided into sections A and
B, each w orth 1 mark; item 33 is worth 5 marks; item 34 is divided into sections A , w orth
4 marks, and B and C, each worth 1 mark; item 35 is w orth 2 madcs and item 36 is
divided into sections A and B, each w orth 1 mark. The Short Answer part o f the grade 5
test was, in total, wm th 19 marks. Because this part o f the test was contained w ithin a
separate booklet, there was, again, a cover sheet used to record student inform ation (the
same inhrrm ation as fo r the m ultiple choice section). As fo r the m ultiple choice section,
there was an instruction section which provided students directions fo r the completion o f
this part o f the tesL A copy o f the Grade 5 D M AT booklets and the answer key booklet
is included in Appendix C. These documents are included only fo r the defence o f this
thesis but w ill not be published so as to maintain security o f the ita n bank.
D M fT Grade 7
The grade 7 D M AT consisted o f tw o parts each presented in its own test booklet.
The Grst part was made up o f 25 m ultiple choice items each consisting o f a stem w ith
fiv e possible responses. Each item was valued at one mark. The test booklets included a
cover sheet where students were directed to ide ntify themselves by name (last and firs t),
grade/class, school, \^ e th o : or not they were o f aboriginal ancestiy^ and whether or not

39

they were enroHed in a Montessori program. It also included an instructions section
w hich provided students w ith the directions required to complete the test.
A y goodness o f f it test was again used to determine whether or not the
distribution o f correct responses is lik e ly random. There were 25 items & r the grade 7
test and 5 response alternatives. As w ith the grade 5 test the most lik e ly random
distribution would result whenever we get 5 correct responses &>r each o f the 5
alternatives. The expected distribution is there&re 5. For the grade 7 test, the observed
distribution was: alternative A - 5, alternative B - 5, alternative C - 9, alternative D - 6
2

and alternative E —0. The observed y value is 82 . The y
/i/

2

c'v=.05

is 9.488 fa r d f= 4 .

We can therefore conclude that the distribution o f responses is random.
The second part was made up o f ten items designed to be answered directly in the
test booklet. These ten items were in some cases subdivided. The arrangement o f the
items and the distribution o f marks was as follow s: item 26 was w orth 4 marks; item 27
was divided into sections A , w orth 1 mark, and B, w orth 2 maiks; item 28 was w orth 2
marks, item 29 was divided into section A , B and C, each w orth 1 mark; item 30 was
w orth 2 marks; item 31 was worth 1 mark; item 32 was w orth 2 marks; item 33 was also
w orth 2 marks; item 34 was divided into sections A and B, each w orth 1 mark; and item
35 was divided into sections A and B, each w orth 1 mark. In total, the Short Answer test
was Old o f 23 marks. As was the case w ith the grade 5 Short Answer test, the grade 7 test
booklet had a cover sheet used to record student inform ation (the same inform ation as fo r
the m ultiple choice section) and it provided students directions fo r the com pletion o f the
test. A copy o f the Grade 7 D M AT booklets and the answer key booklet is included in
Appendix C. As fo r the grade 5 D M A T test booklets and the answer key, these

40

documents aie included fo r the defence o f this thesis but w ill not be published so as to
m aintain security o f the item bank,
f& f - fbwndW ion Ski/Zf
The FSA - Numeracy fo r 2000 consisted o f 4 parts. The 6rst and third parts were
m ultiple choice items made

o f a stem w ith four possible responses. The second and

fourth parts were made up o f w ritten response items where students were asked to record
the ir answers directly into the test booklets. The correct responses 6)r the m ultiple choice
items were distributed w ith the follow ing &equencies: Part A (Items 1-16) alternative A -4
times, alternative B-4 times, alternative C-5 times and alternative D-3 tim es, Part C
(Items 19-34) alternative A-3 times, alternative B-4 times, alternative C-6 times and
alternative D-3 times. Parts B and D were each made rq) o f two items and each item was
w orth 4 marks.
Term awZ Tear

Gradies

The data required to test fo r evidence o f concurrent va lid ity were student marks
fo r the year 1999-2000. It was & lt that although hnal grades could present an accurate
assessment o f student achievement, term marks w ould also be gathered to provide
additional inhum ation as to the relation o f D M AT results to student m aiks. This meant,
however, that because term marks fo r the year in question were stored in hard copy form
only (on student report cards), each student's 61e had to be reviewed m anually. The
marks that were o f particular interest in this study were the mathematics marks. No other
marks were recwded.
Some principals submitted only the year end marks fo r students. In these cases I
decided against pursuing principals fo r the term madcs as w ell. As a consequence, fo r

41

7.9% o f the grade 5s and 3.6% o f the grade 7s, only fina l marks were collected. Table 23
in Appendix D includes numbers and percentages fo r each o f the categories o f the
collection o f term and/or 6nal marks.
For students who wrote the grade 5 D M AT, i f the scores they received horn the
FSA they wrote in grade 4 were in their school file s I was able to record it; however
M inistry o f Education ofGcials had directed school principals to send the reports home to
the student's parents. In some schools, copies were kept; in others, they were not. As a
result, data on FSA results were available 6)r only 47.3% o f the students.
frocedîwrçr

The development o f the DMATs originated in the early 1990s w ith the members
o f the school d istrict Mathematics Committee (SDMC). W ith the help o f Assistant
Siq)»intendent Bendina M ille r, D irector o f Instruction Norm Munroe and w ith technical
assistance 6om Iris M cIntyre the SDMC designed a three part achievement test to be
used to assess how w ell grade 6 students were meeting the learning otgectives o f the
grade 6 mathematics curriculum . This test was piloted in the spring o f 1995. The
development o f tests & r grades 4 and 8 follow ed and were modelled aAer the already
developed grade 6 test. Teachers, w ith knowledge o f the mathematics curriculum in each
o f the grades targeted, were recruited and asked to review banks o f mathematical
questions and problems and to develop the desired tests. The SDMC members then
reviewed these tests, made m inor revisions and approved them fo r use.
The original tests consisted o f three parts. The hrst part was made up o f m ultiple
choice items, the second part was made

o f short answer items and the th ird part was a

42

performance assessmait made up o f three or faur mathematics problems. The
Performance Assessment part o f the test was administered, one-on-one, to randomly
selected students. Students, in this part o f the test, were askW to talk about how they
would solve a given mathematical problem. They were given tools and/or mathematics
manipulatives which they could use to solve the problem. As they worked on the
problem, the examiner asked them questions and/or encouraged them to ta lk about what
they were thinking. A scming rubric was established by the SDMC membas to be used
to rate responses.
It was orig inally designed that the M ultiple Choice and Short Answer parts o f the
test would be administered to 20% o f the students at each grade level. The schools in
^ lic h the tests w œ to be administered were selected randomly but included a m ix o f
large and small schools, inner city and community based schools as w ell as rural and
urban schools. It was fe lt that this w ould give a representative sample 6om w hich to
assess overall progress in im plem aiting the mathematics curriculum . The series o f tests
were originally designed to be administered every other year rather than annually. The
DMATs fo r grade 5, 7 and 9 were firs t administered in the fa ll o f 1996.
Each o f the years that the tests were administered (1996,1998,1999, and 2000),
the SDMC members met be6)re the tests were sent to the schools to review the items and
make any changes that were deemed necessary. The SDMC members met again, after
the tests were administered and marked, to analyse the results and make
recommendatimis about the implementation o f the mathematics curriculum to teachers,
school and board ofBce administrators, and to make recommendations about any changes
in the tests that they thought would be necessary fo r subsequent years. One o f the

43

primary objectives o f the SDMC was to analyse the test results over tim e to see what
trends could be observed. That, in part, was why grades 4 ,6 and 8 were chosen and why
a two year cycle o f testing was selected. To maximize continuity, SDMC members
attempted to keep to a m inim um the changes made to the test items. Some changes did
occur. Notably, the Performance Assessment part o f the test was discontinued after
1998; the test was administered yearly rather than every two years beginning in 1998; a ll
students were tested in 2000 rather than a representative sample; the whole o f the grade 9
test was discontinued aAer 1998; the wording and the distribution o f marks fo r some o f
the grade 5 short answer items was changed after 1996 and tw o o f the grade 7 m ultiple
choice items were replaced after 1996 by what was fe lt were more suitable items.
Over tim e, priorities w ith in the school d istrict have changed. In the fa ll o f 2002
the decision was made to discontinue administering the DM ATs. A t the tim e o f w riting,
these tests remain an assessment tool that, although not in use currently, could be brought
out o f storage and once again put into use.
DrchotoMKwy aW AW Apoinf

TYoceff

For each test, the m ultiple choice sections were machine marked. The short
answer sections ofthetestsrequiredteam sofm aikers. Eachyearthatthetestsw ere
' administered, d istrict teachers were recruited to mark the short answer sections. Each
tim e, the teachers in itia lly reviewed the answer keys and m arking guides to maximize
uni& rm ity o f m arking. They reviewed the recommendations o f previous marking teams.
Then a few tests were marked togeth^, again to ensure unihrrm ity. Then test booklets
were distributed and each marker started to mark the test booklets. Student scores were
recorded on the bubble sheets. The marking teams compared a ll problem papers (\^ e re

44

responses were unnsnal or difG cnlt to assess because o f an unusual approach). A t the end
o f the marking sessions the marking teams drafted notes to the mathematics committee
and recommendations to the next set o f markers.
For the year 2000 tests, re lia b ility between the markers was tested by having them
mark the same tests & r twenty students. The twenty tests were randomly selected over
the space o f two days. Ten tests were selected on each o f the two days. The data were
tabulated and the degree o f agreement betweoi the markers was measured. A summary
o f the correlations between markers is shown in Table 14 found in chapter 4.

The data that were used in this study consisted o f student scores on the fa ll 2000
DM ATs, term and Goal grades fo r the school year 1999-2000 hn students who wrote the
6112000 DM ATs and FSA scores 6 r spring 2000 fo r grade 5 students who wrote the fa ll
2000 DMATs.
Student scores on the DM ATs were aval6ble through a central data base located
at the d istrict school board ofGce. The inform ation fo r each student included: a twentyone d ig it identiGer number, Grst and last name, grade, gender, whether the student was o f
aboriginal ancestry or not, the school attended, whether the student was enrolled in a
regular school program, a French immersion program or a Montessory program, choices
fo r a ll items in part 1, scores on each item fo r part 2, a M ultiple Choice score, a Short
Answer score and the date on w hich data were recorded.
Principals in d istrict elementary and high schools were contacted (see Appendix
D fo r a copy o f the letters) and term marks as w ell as fin a l grades were requested 6 r a ll
those students who wrote the DMATs in the year 2000. To facilitate the collection o f

45

data, lists including the names o f students who wrote the tests were sent out to the
schools where the students had last attended. In some schools the in& im ation was
collected by school sta ff and returned to me. M ost school principals invited me to come
to the school to review school files and copy the required data. N ot a ll the names o f
students were identiGable. Some students had moved, many to other schools in the
d istrict but there were many also who hM moved out o f the d is tric t I did not attempt to
fo llo w up on students who had moved to another d is tric t Some students had incorrect or
incomplete coding fo r the school code and were consequently not listed w ith any
particular school. For these students, once they were identiGed, school lists were revised
and they were then included. Some students could not be identiGed or located and
infbrmaGon about the ir term and Gnal marks could not be retrieved.
Final grades fo r students who wrote the DM ATs were available through the
schools on a school based data base; however, term grades fm the students required a
review o f each student's school Gle. Only the mathematics grades fa r the school year
1999-2000 were used. The in6>rmaGon was taken direcGy Gom the student's report card.
The grades were recorded as letter grades including: A , B, C-I-, C, C-, I and F. Where
students were w orking w ith individual educadon plans, the grades were recorded as lEP.
Most schools had grades fo r three terms plus a Gnal grade. Two schools in School
D istrict No. 57 use a report card that records grades fo r Gve terms plus a Gnal grade.
They are Highland TradiGonal School and Central Fort George. For consistency, when I
analysed the data, I converted t k Gve grades 6)r students at these schools to three grades
by averaging the Grst and second grades fo r a Grst term grade, the th ird and faurth grades
fm a second term grade and retained the GAh grade fo r the th ird term grade. The Goal

46

grades were leA unchanged. For the overaU analysis I converted the grades to numbers as
follow s: A - 7, B - 6, C+ - 5, C - 4, C- - 3, lEP - 2 ,1 -1 , and F - 0. M issing data were
coded as 9 and were excluded from any calculations that w ould skew the results.
For the FSA scores fo r students who wrote the grade 5 DM AT, I recorded the
scores student received on the Numeracy part o f the FSA they wrote in 2000. Results
were recorded on a scale that included N ot Yet Meeting Expectations, Meeting
Expectations, Exceeding Expectations and measures between each o f these categories.
For the purposes o f this analysis the FSA scores were located on a Eve point scale w ith
N ot Yet M eeting Expectations measuring a 1, Meeting Expectations measuring a 3,
Exceeding Expectations measuring a 5. The points between these values were assigned
the measures 2 and 4 respectively. The raw scores 6om these tests were not available to
me - only the inform ation from the Enal reports.
ZWovfyWysfg
The analysis o f these data involved the use o f the computer programs: Iteman
version 3.05; TestGra^ Department o f Psychology, M cG ill U niversity, M ontreal,
Quebec; Bigsteps, a Rasch-Model computer program constructed by John M . Linacre and
Benjamin D W right and available through MESA Press; Ascal, Assessment Systems
Corporation, 1984, part o f the M icroCat Testing System and SPSS version 11.0. These
programs were used to assist in the analysis o f test items and to calculate summary
statistics and re lia b ility indices.

47

Compwfer f rogramï C/setf

Iteman provides a classical analysis o f the student response data. It calculates
endorsement rates as proportions (or percentages) and calculates item -total correlations
fo r each response in order to determine the degree to which each item contributes to the
re lia b ility o f the test. In a sim ilar way, it calculates proportions fo r the alternate responses
to determine i f these options are functioning as intended. Iteman determines the degree
to w hich student responses accurately reflect a b ility in two ways. It establishes the
discrim ination index "D " by calculating the proportion correct in the upper and lower
(27%) a b ility groups and comparing these values. A second measure o f discrim ination is
a choice o f item -total correlations —a point biserial correlation or a biserial correlation i f
the data are dichotomous. Iteman calculates statistics fo r the test as a whole. These
include: Aequency, mean, variance, standard deviation, skew, kurtosis, re lia b ility, and
median p-value.
Iteman can compute two types o f item -total correlations —a point biserial
correlation and a biserial correlation. T te correlation chosen h)r the analysis o f each
D M AT was the point biserial. For Iteman this is a Pearson product-moment correlation
between the item scores and the number-correct (total) 6>r the tesL The point biserial
correlation is calculated fo r each alternative. This provides an opportunity to examine
each alternative and assess how w ell or poorly it is behaving.
Iteman calculates a discrim ination index fo r each dichotomous test item (the
m ultiple choice item s). The index is the diSerence between the proportion correct in the
high a b ility group and the low a b ility g ro ip . These values can range 6om -1.0 to 1.0. It

48

is indicated in the Iteman Users Manual that negative values and low values (less than
0.20) may indicate that the test item is flawed or perform ing poorly. Higher values,
however, w ould indicate that die test item diEferentiates between high scoring and low
scoring examinees. In general, a discrim ination index o f over .40 is considered great,
between .30 and .40 is considered average, and although scores o f between .20 and .30
can sd ll represent an acceptable level o f discrim ination, scores o f lower than .20 are
marginal at best. A n example o f the output generated by Iteman is presented in Figure 1.
Item S ta tis tic s

A lte rn a tiv e S ta tis tic s

Seq.
No.

Scale
-Item

Prop.
correct

Disc.
Index

Point
B iser.

25

1-25

. 53

.60

.52

1
2
3
4
5
Other

.02
.08
.53
.10
.24
.02

.04
.15
.23
.21
.31
.00

.01
.04
.83
.01
.11
.00

-.0 7
- . 18
.52
-.2 7
- . 18
-.2 4

26

1-26

.43

.59

.49

1
2
3
4
5
Other
Page 5

.18
.26
.04
.06
.43
.03

.23
.41
.08
.07
.14
.00

.09
.10
.02
.06
.73
.00

-.1 4
-.2 6
-.1 3
- . 03
.49
-.2 5

A lt .

Prop. Endorsi no
Total LOW Hi gh

Poi nt
B iser. Key

Ffgw e 7. Sample output fa r Iteman fo r the m ultiple choice items in the grade 5 D M AT.
The output shown in Figure 1 is taken from the Iteman analysis o f the grade 5
test, items 25 and 26. In this analysis, the data were divided into two parts —the
dichotomous items (M ultiple Choice) and the m ultipoint items (Short Answer). The
program assigned each item an identiGer as noted in the column entitled "Scale-ltem ". In
the above, item 26 was assigned 1-26. The correct response fo r this item is noted by the
asterisk located in the column "K ey". For item 26. the GAh response is the correct

49

*

response. The three columns under '*Item Statistics" show: the degree o f difG cu l^,
"Proportion Correct", w hich fo r this item is .43, and two measures o f the level o f
discriminaticm und«^ the healings "D iscrim ination Index" equal to .59 and "P oint
B iserial" equal to .49. In Figure 1, it can be noted that because the discrim ination index
is deGned as the difkrcnce between the proportion o f low scoring examinees who chose
the correct alternative and high scoring examinees who chose the correct alternative, fo r
item 26, this diSaence, .73 - .14, is .59 as indicated above.
Under the heading "A lternative Statistics", there are six columns. They show: the
numbers o f the different responses listed under "A lternatives", the correct response noted
under "K ey", the proportions o f students choosing the d ifk re n t alternatives listed under
the headings "T otal", "Lo w " and "H igb " and the correlation between the selected
alternative and the total test scores given in the column "P oint B iserial". In Iteman, the
category "Low " is deSned as the 27% lowest scoring examinees and the category "H igh "
is defined as the 27% highest scoring examinees. For item 26, we see that 43% o f a ll
examinees chose the correct req>onse alternative 5 (this inhum ation also appears in the
section "Item S tatistic"), 14% o f the group categorized as "Lo w " chose the correct
alternative and 73% o f the group categorized as "H igh " chose the correct alternative.
The point biserial correlation appears in both the "A lternative S tatistic" section and the
"Item Statistic" section. In Figure 1, the point biserial values fo r item 26 are negative fo r
a ll distractor responses and positive fo r the correct response. This shows that
proportionally more low scoring examinees chose alternative responses than high scoring
examinees and fo r the correct response proportionally more high scoring examinees
chose the reqxmse than low scoring examinees. This is how we want the responses to

50

behave. The label "O ther" in the column "A lternatives" w ill include data not appearing
in the 5 altanatives listed on the test. In most instances this w ould be students who
selected two or more ahematives fo r one item.
A lte rn a tiv e S ta tis tic s

Item S ta tis tic s
Item -Scale N per
C orrelation Item

A lt e r n ative

Proportion
Endorsing

536

1
2
3
Other

.92
.08
.00
1.38

.36

616

1
2
3
Other

.46
.54
.00
1.07

1.276

.66

1073

1
2
3
4
5
6
Other

.18
.25
.30
.24
.04
.00
.19

1.211

.67

967
Page 6

1

.24

Seq.
No.

scale
-Item

Item
Mean

Item
v a r.

31

2-1

1.080

0.074

.29

32

2-2

1.537

0.249

33

2-3

2.713

34

2-4

2.480

Key

Ffgw g 2 Sample output fo r I^m an & r the short answer items on the grade 5 D M AT.
The output shown in Figure 2 is taken from the Iteman analysis o f the grade 5
test, items 31,32,33 and a part o f 34. As mentianed earlier, the data in this analysis were
divided into two parts —the dichotomous items (M ultiple Choice) and the m ultipoint
items (Short Answer). The program assigned each item an identiGer as noted in the
column entitled "Scale-Item ". In Figure 2, item 31 is identiGed as 2-1 because it is the
Grst item in the second part, the mulGpoint secGon. The column "Item Mean" calculates
the average response. For this analysis, i f an examinee was missing data fo r an item then
the item was excluded Gom the calculaGon o f the mean. For item 31, the average

51

■f

response is 1.080 or very nearly 1. These data come 6om the 536 examinees ib r whom
the data was included.
This calculation is, however, somewhat confusing because when the answer
sheets were completed the raters used alternative 1 to represent an incorrect attempt at
solving the problem w id i a score o f 0, alternative 2 to rq)resent a score o f 1 and
alternative 3 to represent a score o f 2. The answer sheets were leA blank whenever an
examinee did not even attanpt to solve the problem. There&re, fo r this item an ''Item
Mean" o f 1.080 6 r 536 examinees means that very few examinees tried the item and o f
those w lx) did most (92%) got this item completely wrong, a few (8% ) got it partially
correct and no examinees got this item completely correct. Iteman provides this
inform ation in the section labeled "A lternative S tatistic". The "O ther" category under
"A lternative S tatistic" show the number o f examinees not included in categories 1,2 or 3.
This would include a ll those students who did not even attempt the item (761 examinees).
The + under the column "K ey" shows that the scores are listed in ascending order.
Because o f the way the items were scored, the actual mean w ould be 1 point less than is
noted in the column "Item Mean", the "Item Variance" would remain unchanged as
would the "Item-Scale C orrelation".
The other two categories o f inform ation under the heading "Item S tatistic" are
also calculated based on the items that include data. The category "Item Variance"
includes a calculation o f the variance o f the responses to the item and "Item-Scale
C orrelation" includes a Pearson correlation between responses to the item and mean
scores fo r the examinees.

52

This program, developed by J. O. Ramsay o f M cG ill U niversity, was designed to
provide inform ation in graphical farm about questionnaires and conventional exams
(m ultiple choice and short answer test items). TestGraf makes use o f statistical methods
to produce estimates o f examinees' responses.

;\Te6tGraf98\tgrafgr5 data

Ite m 26

Probability
5%

50%

25%

95%

75%

1.0

^ 5

0.

0.

0.

12

16

20

28

E xpected Score

Figwre j. Sample o u ^u t o f TestGraf 6)r item 26 o f the grade 5 DM AT.

53

TestGrafwas used to produce response curves (IC C s) fa r each test item by plotting the
probability o f response 6)r each response option along a range o f expected scores
(measured in whole tests scores and paicentile ranking). This results in response curves
as shown in Figure 3.
Figure 3 shows a typical TestGraf ou^mt to r a test item —in this case Item 26 o f
the grade 5 test. The correct response, response 5, is favored by more proGcient students
and fo r students scoring in the G flh percentile and higher has an overall slope Gom the
low er le ft comer o f the graph (lowest test scores) to the igiper rig h t comer (highest test
scores). Alternate responses, cm the other hand, are more favored by less proGcient
students and exhibit an overall slope Gom a high point on the leG to a low er point on the
rig ^t. In Giis sample we see alternate response 4 follow ing ju s t this pattern. The alternate
response 2 acts as a great distractor even among low scoring examinees and is preferred
up to the fortieth percentile. The alternate responses 1 and 3 are chosen by & w o f the
students at t k lowest scoring level but gain preference quickly and by the fifth percentile
are also preferred responses u n til about the fbrdeth percentile where the correct response
becomes the preGrred response.
In TestGraf^ the measure o f difG culty is deGned as that point along the expected
scores at w hich .50 o f those examinees are expected to get the item correct. In this
example, about .50 o f the examinees at the sixty-G für percenGle are expected to get the
item correct The level o f difG culty is measured as .65 (an expected score o f 23) and this
test ita n could be considered o f moderate difG cu l^.
In TestGraf^ the slope o f the response curve at the point o f .50 probability gives
an irKÜcaüon o f the degree o f diamminaGon exhibited by the item . A steep slope would

54

suggest a high degree o f discrim ination. A shallow slope w ould suggest a low degree o f
discrim inatioa. In Figure 3, the steep slope o f the curve at the point where tl^ probability
is .50 diows that this item has a relatively high degree o f discrim ination. In fact, fo r this
item , the slope o f the curve is uniform ly steep through a wide range o f expected scores.
This indicates that this item has a h i^ degree o f discrim ination through a wide range o f
examinee proGciency.
TestGrafprovides additional inform ation by slmwing how test item may be
behaving at particular points o f the response curves. One should be careful, however, at
the extreme ranges o f the data because die numbers o f examinees providing inform ation
fo r creating the ICC are, at these points, small.
For comparison purposes, the item shown in Figure 3, is the same iterh used to
examine the Iteman o u ^u t in Figure 1. From Figure 1 we found that item 26 is a
moderately difG cult item w ith good discrim ination. The difG culty (p), the discrim ination
index (D ) and the point biserial
response 1) = -. 14,
-.13,

values are: p = .43, D = .59, rp& (fo r alternative

(fo r alternative response 2) = -.26,

(fo r alternative response 4) = -.03 and

(fo r the correct response 3) =

(fo r alternative response 5) = .49.

The Bigsteps program is a Rasch-Model computer program constructed by John
M . Linacre and Benjamin D W right and available through MESA Press. The version
used fo r this analysis was 2.61. Linacre and W right (1996), in the user's guide to the
program, indicate that the program is designed to provide an analysis that balances
statistically the ef&cts o f item difG culty and person a b ility. In so doing it provides
another means o f examining the test items. The Bigsteps procedure uses PROX (normal

55

^p ro xim a lio n ) and UCON (uncondititm al maximum likelihood, jo in t maximum
likelihood) estim ation methods to obtain progressively closer and closer approximations
o f the test diÆ culty/ability regression curve (Linacre & W right, 1996).
For th is analysis Bigstep was programmed to begin w ith a central estimate fo r
each person measure, item calibration and rating scale category step calibration. A rough
convergence to the observed data pattern was obtained by several iterations o f the PROX
algorithm . The UCON algorithm was dien used to establish more exact estimates,
standard errors and h t statistics. The UCON method that was used involved progressive
proportional curve htting to hnd improved estimates. The m esures are reported in
Logits (log-odds units) and the h t statistics, In h t and O utht, are reported as mean-square
residuals (these have ^)proxim ate chi-square distributions). These mean-square residuals
are normalized through a cube root function to provide a t-statistic fo r assessing the
probability o f a response.
—tw o/w gm eter ynWeZ.
Ascal is (me o f the analysis programs available through Assessment Systems
Corporation, 1984 and is part o f the M icroC at Testing System. The program used fo r this
analysis was Ascal ™ version 320. The authors o f the User's Manual indicate that
Ascal is an Item Response Theory calibration program which uses examinee responses to
provide estimations o f iq) to three test parameters: discrim ination, difG culty and low erasymptote (psuedo-guessing). The estim ation procedure involves dividing the data into
20 categories, called hactiles. A curve approxim ating a normal distribution is used fo r
the in itia l estimation. Each item 's lack o f h t to the model is established using chi-square
statistic. The program repeats calculations through a series o f iterations to generate a

56

curve that progressively a p ^xim a le s the distribution o f items - the ICC. In the 2
parameter model, the item characteristic curve is used to estimate discrim ination and
difG culty. The lower-asymptote (psuedo-guessing) parameter is elim inated by setting the
number o f response alternatives to zero. This program is lim ited to analysis o f
dichotomous items only.
Æ cal —rAree pwamerer /Mode/.
Ascal ™ version 320 was the program used to analyze the diree parameters:
discrim ination, difG culty and lower-asymptote (psuedo-guessing). The authors o f the
User's Manual indicate that the procedure used in this model is the same as was noted fo r
the 2 parameter model, except that a th ird parameter, the lower-asymptote, is now
included in the estimations. For this program, the in itia l estimate fa r Ae lower-asymptote
parameter is the reciprocal o f the number o f alternate responses fo r the items. Because
there were 6ve choices fo r each o f the m ultiple choice items, the in itia l value was set at
0.200. This program, as w ith the 2 parameter model, is lim ited to analysis o f
dichotomous items only.

The SPSS program that was used in this analysis was SPSS v 11.0. It was used to
calculate summary statistics on aU data related to term and Gnal grades. This included
frequencies, t-tests and a ll correlaGons.

57

CHAPTER 4 - RESULTS
Analysis o f Test Items

The in itia l analysis conducted was a classical analysis using the program Iteman.
The next analysis used TestGraf to provide an assumption free analysis most closely
associated w ith IR T. TestGraf used the test data to produce ICCs fo r each test item . This
inform ation was used to supplement the findings o f Iteman and to reveal some o f the
properties o f the test items fa r speciGc groups o f students - often those at the more
extreme ranges such as the lowest and the highest scoring. Three programs were then
used to provide logistic item response analysis. Bigsteps was used to gain additional
inkrm ation about the difG culty and discrim ination o f the dichotomous and short answer
items. This program provided a one parameter or Rasch analysis. A two parameter and
three parameter analysis was obtained using the program Ascal. Each model was used to
provide additional inform ation about the test items by estimations o f difG culty,
discrim ination and guessing when comparing test items. In the follow ing sections TU
describe the inform ation about the test items that is gained from each.
Gradk J

In Table 1, the values calculated by Iteman fo r item difG culty, the item
discrim ination index and point-biserial correlations fa r each o f the mulGple choice test
items are shown. Values printed in bold p rin t are fo r those o f the keyed correct
responses. A high p-value (near 1) fo r the keyed correct response indicates a relatively

58

easy item. S im ilarly, a low p-value (near 0) 6)r the keyed correct response indicates that
the item was relatively d ifh c u lt
Table 1

, Item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

1

D

2

'

P

1 /-pt

P

. .. ^

.43
.37
.01
.09
.04
.56
.12
.44
.03
.47
.15
.33
.01
.35
.07
.50
.07
.44
.89
.22
.51
.51
.04
.21
.18
.49
.51
.53
.19
.33
.17
.57
.07
.52
.54
.11
.47
.53
.03
.20
.01
.44
.06
.35
.01
.22
.03
.34
.02
.60
.18
.59
.54
.08
.15
.68
.56 .65
.43 1 .04

-.17
-.1(1
-.12:
-.26
-.19
-.20
- .iir
-.20
-.12:
J4
.44
-.14
-.09
.43
-.04
-.22
-.23
-2 0
.45
-.1^1
-.15
-.27
-.08
-.08
-.07
-.14
-.21
-.26
.48
-.09

.04
.01
.03
.04
.74
.03
.12
.18
.06
.00
.14
.03
.06
.10
.31
.08
.04
.54
.14
.02
.29
.03
.89
.79
.08
.26
.69
.06
.04
.05

-.12:
-.13
-.13
-.18
.47
-.23
-.21
-.1<)
-.22
-.09
-.22
-.1:1
-.09
-.12:
-.M
-.16)
-.18
.47
-.23
-.12:
-.18
-.15
37
.41
-.18
.50
-.17
-.11
-.12

3
. P 1
.12
.95
.08
.74
.05
.78
.80
.08
.10
.09
.10
.88
.48
.06
.04
.54
.12
.11
.29
.03
.23
25
.01
.03
33
.04
.08
.07
.03
.44

Responses
1
4
1
5
1 P
fp» 1 P
-.08
26
-.30
.44
-.22
37
38
-.11
-.19
-.24
-.Ij)
32
.42
-.15
-.11
.48
.16
-.12:
-.12:
-.15
-.06
-.05

-.07
32
-.13
-.15
-.14
-.11
38

59

39
.02
.14
.04
.05
.02
.03
36
.02
.00
.11
.02
.15
.04
37
.07
.70
.09
.04
.89
.08
.03
.03
.12
.10
.06
.06
.07
.16
.13

33
-.11
-.25
-.1()
-.If)
-.1()
-.11
.43
-.09
-.11
-.16)
-.13
-.145
-.I"?
29
-.1*7
.49
-.1/7
-.13
32
-.15
-.11
-.21
-.3(1
-.27
-.03
-.13
-.24
-.1!)
-.11

.01
.01
.68
.04
.12
.01
.03
.08
.71
.01
.10
.03
.10
.24
.09
.11
.05
.09
.05
.02
37
32
.05
.01
24
.43
.06
.61
.08
.30

Other
fp6
P
-.11
-.141
.50
-.1^1
-.25
-.1()
-2 0
-.15
.44
-.13
-.04
-.1()
-.23
-.19
-.06
-.11
-23
-.14:
-.17
-.07
.40
32
-.145
-.09
.1!(
.49
-.23
.57
-.24
-.11

.01
.00
.02
.02
.01
.01
.01
.03
.03
.01
.04
.01
.04
.03
.01
.02
.01
.05
.02
.01
.03
.02
.02
.02
.02
.03
.03
.04
.04
.04

-.1::
-.11
-.11
-.1:7
-.09
-.18
-.145
-.145
-.24
-.145
-.17
-.21
-.16
-.18
-2 0
-.21
-.24
.18
-.25
-.21
-21
-.22
-.23
-.26
-2 4
-2 5
-.27
-.26
-.27
-.27

H igh values fa r the alternate responses, depending on the degree o f discrim ination, could
indicate that the response acts as an excellent distractor or that the item is flawed.
Consider item 15 in Table 1. The correct response is the 6)urth response (values
are shown in bold p rin t). Because the proportion correct is .37, this is tied w ith item 21
as the more diJBGcult items on the test Response 2 is the preferred distractor w ith .31 o f
the students selecting i t Response 1 also appears popular w ith .19 o f aU students
selecting i t Responses 3 and 5 had respectively .04 and .09 o f the students select them.
The category labeled "O ther Response", which 5)r this item is .01 o f the students, would
include those students who selected more than one response & r an item.
Consider, & r comparison, item 2 in Table 1. The correct response is the third
response. This tim e the proportion correct is .95. The item can be considered a very easy
item w ith almost a ll the students (hoosing this response. Responses 1 ,2 ,4 , and 5 were
respectively chosen by .01, .01, .02 and .01 o f the students. For this item , there were no
students included in the "O ther Responses" category.
In going horn most d iffic u lt to least difB cult the items w ould be placed in the
follow ing order: 1 5 ,2 1 ,1 ,2 6 ,3 0 ,1 9 ,1 3 ,1 1 ,1 4 ,2 5 ,1 6 ,1 8 , 8 ,2 8 ,2 2 ,2 9 ,3 ,2 7 ,1 7 ,9 ,4 ,
5 ,6 ,2 4 ,7 ,1 2 ,1 0 ,2 0 ,2 3 , and 2.
Iteman calculates endorsement rates fo r each response fo r three diSerent
groupings o f the students —total score, low score and high score. This inform ation helps
in analyzing item performance. The data in Table 1 indicate that no item is so diB icult
and exhibits so little discrim ination that it appears students are only guessing at it. This
would be suggested where students selected tw o or more responses equally and where
there was a lack o f discrim ination showing high a b ility as w ell as low a b ility students

60

were selecting the possible responses. A n examination o f the data, especially fo r the most
difB cult items, shows that this is not the case.
Table 1 also includes the values h)r the discrim ination index and the point-biserial
correlations fo r each reqwnse as calculated by Iteman. A positive value fo r the pointbiserial indicates that low scoring students chose the item less frequently than high
scoring studmts. This is a desirable feature h>r a ll the correct responses. A negative
value indicates exactly the opposite - low scoring students chose the item more
Aequently than high scoring students. This is what good distractors should do and so a
negative point-biserial value is desired 6)r a ll alternate responses. Small values
(discrim ination index and point-biserial) 5n an item show that the item discriminates
poorly between high and low scoring examinees w hile large values show a good
discrim inatirm between the two groups.
When we use the four categories o f discrim ination discussed in chapter 3, the
items can be categorized as follow s:
#

Great D iscrim ination (D > .40)- 3 ,4 , 5, 8 ,9 ,1 1 ,1 3 ,1 4 ,1 6 ,1 7 ,1 8 ,1 9 ,2 1 ,2 5 ,2 6 ,
27,28,29 and 30,
Average D iscrim ination (.30 < D < .40)- 1 ,6 ,7 ,1 5 ,2 2 and 24,
Acceptable D iscrim ination (.20 < D ^ .30)- 10,12,20 and 23 and
M arginal D iscrim ination (D ^ .20)- 2

#
#
#

Looking closer at the items that display the least discrim ination, we Snd that the five
items are also the easiest items in Table 1. This means that a large proportion o f the
students (low scoring as w ell as high scoring) got these items correct, A low level o f
discrim ination can be e^qpected. Proportion selected fo r these items, comparing high
scoring and low scoring students, is as follow s:
#
#

Item 2- low scoring is .89 and high scoring is .98,
Item 10- low scoring is .75 and high scoring is .97,

61

# Item 12- low scoring is .75 and high scoring is .96,
# Item 20- low scoring is .77 and high scoring is .97 and
# Item 23- low scoring is .74 and high scoring is .96
These scores show that even though the measure o f discrim ination overall is low , the test
items s till did what they were supposed to, that is, more high scoring stuiknts got the
items correct and lew er low scoring students got them correct.
For a Gnal comparison o f values in Table 1, the discrim ination index D and the
point bisorial correlations

fo r the correct response are compared. For most items the

two values are comparable. In some instances they are the same (this is coincidental).
For items 2, 10,12,20,23,26 and 28 the difkrence between the tw o indices is .10 or
more. An examination o f the items shows that fa r items 2 ,1 0 ,1 2 ,2 0 , and 23 the value o f
D is quite low and the values ofr^^ are higher. These are easier items and the diSerence
between the proportion o f high scoring examinees who got these items correct and the
proportion o f low scoring examinees who got these items correct is m inim al. For items
26 and 28 the opposite is the case —the value o f D is high and the values o f

are lower.

Figures 4 and 5 show sample output generated by the Testgraf program. They
display data h)r items 2 and 15 respectively. In Figure 4 we see the response curves
calculated fo r item 2. This item was identiGed in the previous secGon as the easiest item
on the grWe 5 test. It is included here to show the type o f curve Testgraf generates fo r
this type o f item. For comparison purposes, the degree o f diŒ culty (p), discriminaGon
index (D ) and point-biserial proporGons
= .95, D = .09,

(response 1) = -.10,

(response 4) ==-.11 and

Gom Gie Iteman analysis fo r this item are: p
(response 2) = -.13,

(response 5) = -.14.

62

(response 3) = .26,

The most prominent characteristic o f the graph in Figure 4 is the response curve
shown 6)r response 3. This is the correct response and preferred by over three quarters o f
the students by the Gfth percentile. B y the tw enty-fîAh percentile almost aU the students
selected this response. The curve is more or less fla t beyond this p o in t

] : \T e B t G r a f 9 8 \t g r a f g r 5 d a ta

Ite m
P r o b a b ilit y
5%

50%

25%

75%

95%

1.0

0.8

0.6

0 .4

0.2

: i

_,

8

12

16

:

24

E xpected Score

Figure 4. The ICC produced by TestGraf fo r item 2 o f the grade 5 DM AT.

63

32

g

}
36

Note that w hile the point biserial was low and the D index was ^proaching zero, the ICC
produced by TestGraf showed a strong discrim ination w ith in the Grst quartile. Beyond
the firs t quartile the ICC flattens out and discrim ination is negligible. This response
curve is representative o f the type o f curve Testgraf produced fo r the easier items on the

];\T e s t G r a f 9 B \t g r a f g r 5 d a ta

Ite m 15
P r o b a b ilit y
5%

25%

50%

75%

95%

0

8

.6

.4

2

.0
8

28

12

Expected Score

Ô. The ICC produced by TestGraf fa r item 15 o f the grade 5 D M AT.

64

32

3€

grade 5 test. In particular, response curves 6 r items 10,12,20,23 and 24 are very sim ilar
to what is shown is Figure 4.
IuguK5sh<nvsdœn%%x%Ke(%Hvescah%dahaiforhemI5. hem 15ivasidendGed
ititlisjprevicrus sectioiiEwsixaecrFtlKsrrwostilifGxailt items (%nlAie gfzwie 5 test, ftrr
com parison,^, D and
(response 2) = -.14,

values are as fo llo w s :^ = .37, D = .33,
(response 3) = -.11,

(reqxrnse 1) = -.04,

(response 4) = .29 and

(response 5)

= -.06. In the ICC produced by TestGraf we observe that up to the SAieth percentile
most alternate responses appear preferred. Response 2 is notably the strongest distractor.
Beyond the AAieth percentile the correct response, numba^ 4, makes noticeable gains.
This graph in addition to showing that the item is a d i& c u lt item also shows high
discrim ination between high and low scoring students. In Sgure 5 we see that w ith
TestGraf the greatest level o f discrim ination occurs at about the seventy-Afth percentile.
This item appears to discrim inate best fo r high scoring students. This response curve is
representative o f the type o f curve Testgraf produced 6)r the more dilB cu lt items on the
grade 5 test. Response curves fa r item s 1 1 ,1 3 ,1 4 ,1 6 ,1 8 ,1 9 ,2 1 ,2 6 and 30 are very
sim ilar to what is shown is Figure 5.

For the grade 5 data, the Bigsteps program ran the in itia l estimates w ith the
PROX algorithm and required tw o iterations to arrive at an acceptable level o f
convergence. The UCON algorithm was programmed to conduct 25 iterations and by the
Anal iteration the maximum lo g it change had decreased to the point ^\here the Anal
change was .0007 and the Anal lo g it value was -.0239. A ll 1297 persons and 36 items
were included A)r aU estimaAons; that is, no persons or items were excluded by Bigsteps

65

due to lack o f f i t This analysis was programmed to include categories 6 r m ultiple
choice items as w ell as short answer items so that the difB culty and discrim ination levels
o f the short answer items could be examined.
The Bigsteps program includes in the output a sununary o f the test statistics fo r
pa-sons, items and global Gt. These are included in Table 2.
Table 2
Grodle J

f/n n g

o f the M zm wes —fe r f oMs

aw f Ae/Mf
Raw
Score

Count

Persons
26.7
36.0
Mean
8.5
.0
S.D.*
Item s
963.5
1297
Mean
488.8
0
ST).*
* MNSQ —mean squared

Measure

M odel
Error

-2.74
.80

.30
.04

InGt
MNSQ* ZSTD*
.88
34

.78
.05
.00
.94
.01
.35
ZSTD - standard score

-.5
1.2

OutGt
MNSQ* ZSTD*
.83
.27

-.7
1.0

-5.2
.83
-4.3
4.8
.38
5.7
S.D. - standard deviation

In Table 2, the calculated summary statistics fo r pasons and items are shown.
The mean and standard deviations were calculated. Estimated measures as w ell as
standard errors o f die estimates w ^e also calculated. O f particular interest in this
analysis are the statistics fo r items. The ioGt and outGt statistics are also shown.
The InSt statistic is an infbrmation-WGi^ted fit statistic which is sensitive to
unexpected behaviors which affect responses to items that are near to the person's ability
level. The mean-square inGt statistic is intended to be 1. For this analysis the inSt, for
the items, was .78 and the standardized score was -5.2. The outGt statistic is an outlierSŒsitive Gt stadstic and is sensiGve to unexpected behaviors by persons on items far

66

Tables
q/"Grodk J D M 4T

mg ^zggfgpj, A^eafwe^ q/"Dt^cwZfy, DMcrzmzwz^oR

W Kffzf
Raw 1Measure
Item
Number Score
505 1
1
1.24
1228 1
2
887
.03
3
961
4
-.19
962
-.19
5
1008
-.33
6
1041
-.42
7
725
.52
8
922
-.08
9
1151
10
-.71
657
11
.74
1135
12
-.67
626
.84
13
684
14
.65
475
l.S 'l
15
699
16
.60
909
-.04
17
703
.59
18
613
.88
19
1156
20
-.73
480
21
1.32
808
22
.27
1148
23
-/71
1025
-.38
24
682
.66
25
558
26
1.06
898
.00
27
796
28
.30
29
845
.16
570
1.02
30
579
.99
31
947
32
-.15
2911
-2.67
33
2393
-2.30
34
1327
l.K )
1 Î5
1673
-1.63
36
*M N S Q -m ean s(luared

Error

InGt
MNSQ* ZSTD*
.06
.96
-1.6
.05
.23
-9.9
.66
-8.3
.05
.61
“8.6
.05
.05
.58
-9.3
.58
-8.6
.05
.05
.51
-9.9
.06
.83
-5.0
.05
.66
-7.8
.34
.05
-9.9
.85
.06
-4.9
.05
.38
-9.9
.06
.87
-4.4
.06
.85
-4.9
.06
.98
-0.6
.06
.80
-6.2
.62
.05
-9.0
.82
.06
-5.6
.06
.86
-4.8
.34
.05
-9.9
.90
.06
-3.6
.87
.06
-3.5
.33
-9.9
.05
.52
-9.9
.05
.79
.06
-7.0
.06
.85
-5.6
.65
.05
.69
-8.7
.06
.06
.72
-7.3
.06
.91
- io
.06
1.07
2.3
2.29
.05
9.9
.03
i. ii;
4.9
.03
1.21
5.2
.04
1.7
i. i Î 1
.03
.73
-5.7
Z S T D - standard score

67

Oum t
MNSQ* ZSTD*
-.4
.99
.25
-9.9
.68
-7.9
.63
-8.4
.60
-9.1
.60
-8.5
.54
-9.8
.84
-4.5
.66
-7.9
.38
-9.9
.87
-3.8
.42
-9.9
.90
-3.1
.86
-4.1
1.09
2.2
.82
-5.5
.63
-9.0
.83
-5.0
.87
-4.1
.37
-9.9
.91
-2.3
.88
-3.0
.36
-9.9
.55
-9.9
.78
-6.7
.84
-4.9
.65
-8.5
.69
-8.7
.72
-7.2
.95
-1.6
1.05
1.5
2.28
9.9
1.45
8.9
1.63
9.9
1.34
4.9
1.1:2

PointBiserial
.26
.23
.42
.37
.40
.29
.31
.36
.36
.27
.37
25
.35
.37
.22
.40
.44
.39
.37
.25
.34
24
.30
.34
.44
.41
.42
.49
.41
.31
.42
.38
.50
.54
.41
.38

6om their a b ility level. For A is analysis the o u tfit, fo r the items, was .83 and the
standardized score was -4.3.
O f particular interest in this analysis is how w ell the short answer items fit this
model and how they relate to the m ultiple choice items. The level o f diEGculty in Table 3
is shown under the column entitled ^Measure". The level o f discrim ination is shown
under the column entitled "P oint-biserial". The Bigsteps program has calculated the level
o f difS culty ordered ûom most d iffic u lt to least d iffic u lt as: 1 5 ,2 1 ,1 ,2 6 ,3 0 ,3 1 ,1 9 , 13,
11.25.14.16.18, 8 ,2 8 ,2 2 ,2 9 ,3 ,2 7 ,1 7 , 9 ,3 2 ,4 ,5 ,6 ,2 4 ,7 ,1 2 ,2 3 ,1 0 ,2 0 ,2 ,3 5 ,3 6 ,3 4
and 33. S im ilarly, the level o f discrim ination (Bigsteps uses point biserial) ordered 6om
most discrim inating to least discrim inating is: 34 ,33 ,28,1 7,2 5, 3 ,2 7 ,3 1 ,2 6 ,2 9 , 35, 5,
16.18, 32, 3 6 ,4 ,1 1 ,1 4 ,1 9 , 8, 9 ,1 3 ,2 1 ,2 4 ,7 ,3 0 ,2 3 ,6 ,1 0 ,1 ,1 2 ,2 0 ,2 2 ,2 and 15.
The Bigsteps pngram also identiGes those items that are not a good Gt to the
overall response characteristic curve. Mean squared values in the in fit column o f Table 3
that are close to 1 demonstrate a good 6 t Ita ns that have a mean squared value
approximately 0.4 more or 0.4 less than this are not a good Gt. NoGce that the lowest
value o f the standard score (-9.99) and the highest value o f the standard score (9.99)
appear to correspond to a mean squared value o f approximately ±0.4. The items that do
not appear to be a good Gt are: 2 ,5 ,6 ,7 ,1 0 ,1 2 ,2 0 ,2 3 , and 24.
ZogMtic JtcTM

Two farom etcr AfbdeZ o f

Table 4 is included here and displays the grade 5 data fo r both the 2 parameter
and 3 parameter logistic models as realized by the program Ascal.

68

Table 4
logM/ic Aem

q/^GrfWe J D M 4 T [/$mgj4fcoZ Two a W 77%ree
AAzaawef q/^D^cw/fy, Difcr;mz»af;on, fwedb-gw effm gaW Fzf

r
1 Item

1

2 Parameter M odel
6
1
a
1
.725
-2.811
-.746
-1.111
-1.038
-1.520
-1.586
-.286
-1.010
-2.214
-.027
-2.267
.089
-.124
1.048
-.161
-.813
-.182
.128
-2.339
.664
-.817
-1.943
-1.418
-.094
.285
-.781
-.391
-.639
.328
**
p<.01

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
*;X .05

.406
.739
.841
.715
.808
.576
.637
.586
.653
.670
.621
.598
.576
.626
.400
.732
.860
.684
.655
.634
.569
.400
.812
.696
.875
.811
.848
1.115
.515

/

1
1

29.137*
33.128*
27.860
21.174
25.241
18.310
38.621**
30.676*
15.447
15.127
21.593
18.320
28.604
48.203**
30.786*
18.077
13.431
15.395
22.088
18.495
20.121
19.930
29.267*
25.559
17.148
20.601
11.016
10.793
22.536
30.437*

6
1.067
-2.673
-.381
-.648
-.672
-1.262
-1.169
.481
-.851
-2.017
.360
-2.083
.580
.672
1.407
.006
-.531
.274
.384
-2.180
.974
-.151
-1.938
-1.238
.168
.510
-.575
-.171
-.083
.922

3 Parameter M odel
1
a
1
c
1
.521
.736
1.076
.874
.980
.586
.713
1.308
.658
.678
.819
.575
.925
2.166
1.060
.814
.969
1.018
.772
.612
.871
.443
.748
.684
1.048
1.038
.871
1.31^
1.077
1.336

.120
.180
.190
.230
.190
.150
.210
.310
.110
.170
.170
.190
.210
.340
.250
.080
.160
.210
.110
.190
.150
.210
.140
.150
.130
.110
.140
.130
.270
.270

23.914
48.7Ï9**
18.482
18.898
10.347
13.887
19.110
18.622
23.897
18.004
19.088
22.512
17.892
26.084
14.780
24.231
15.098
19.861
10.258
14.698
17.122
14.932
29.180*
17.075
14.165
20.370
12.343
10.985
7.960
12.245

For the 2 parameter model, the program was directed to run through twenty
iterations however only six were needed to achieve a maximinn parameter change o f
0.00657.

69

These items when ordered by degree o f d ifh cn lty going 6om most difR cnlt to
least difB cultare: 1 5 ,1 ,2 1 ,3 0 ,2 6 ,1 9 ,1 3 ,1 1 ,2 5 ,1 4 ,1 6 ,1 8 , 8 ,2 8 ,2 9 ,3 ,2 7 ,1 7 ,2 2 ,9 ,
5 ,4 ,2 4 ,6 , 7 ,2 3 ,1 0 ,1 2 ,2 0 and 2. These items when ordered by discrim ination going
&om greatest to least are: 2 8 ,2 5 ,1 7 ,2 7 ,3 ,2 3 ,2 6 ,5 , 2 9 ,2 ,1 6 ,4 ,2 4 ,1 8 ,1 0 ,1 9 , 9 ,7 ,2 0 ,
14,11,12, 8 ,1 3 ,6 ,2 1 ,3 0 ,1 ,1 5 and 22.
A n examination o f the chi-square values in table 4 reveals a range horn a low o f
10.793 fo r item 28 to a high o f48.203 fo r item 14. From a table o f critica l values fo r chisquare we get the Show ing values: 28.869 fo r a .05 level o f signiGcance and 34.805 fo r a
.01 level o f signiGcance. Using this, items 1,23, 30, 8, 15,2, 7, and 14 would lie outside
the .05 conGdence interval and items 7 and 14 w ould lie outside the .01 conGdence
interval. However, the program authors point out in the program manual that i f there are
a large number o f examinees, a ll items could show a statistically signiGcant lack o f Gt.
There are 1297 examinees fo r this te st The authors suggest that a better criterion G)r
measuring the degree o f Gt is chi-square values that are considerably larger than those fo r
other items. When they are ordered, the greatest dif&rence between consecutive values
is 9.582 found between items 14 and 7, the two items w ith the highest chi-square values.
O verall, this is a relaGvely small diGerence and so it appears that no chi-square values are
considerably larger than others fo r this data. AU items ^rpear to Gt the model.
Zogfstzc ./tern Kesporwe vlnafysis Usmg

f

v

f

s

c

a

f

Table 4 also includes the grade 5 data fa r the 3 parameter model estimates
produced by Ascal. The program was again set to run through twenty iterations. This
tim e it ran eleven iteraGons and stopped when the maximum parameter change was
0.00322.

70

These items when ordered by degree o f d ifh cn lty going 6om most diÆ cult to
least difB cultare: 1 5 ,1 ,2 1 ,3 0 ,1 4 ,1 3 ,2 6 , 8 ,1 9 ,1 1 ,1 8 ,2 5 ,1 6 ,2 9 ,2 2 ,2 8 ,3 ,1 7 ,2 7 ,4 ,
5 ,9 , 7 ,2 4 ,6 ,2 3 ,1 0 ,1 2 ,2 0 and 2. These items when ordered by discrim ination going
6om greatest to least are: 14,28,30, 8 ,2 9 ,3 ,1 5 ,2 5 ,2 6 ,1 8 , 5 ,1 7 ,1 3 ,4 ,2 1 ,2 7 ,1 1 ,1 6 ,
19,23,2, 7 ,2 4 ,1 0 ,9 ,2 0 ,6 ,1 2 ,1 ,2 2 . These items when ordered by the guessing
parameter going 6om the greatest degree o f guessing (the highest values) to the least
degree o f guessing (the lowest values) are: 1 4 ,8 ,3 0 ,2 9 ,1 5 ,4 , 13, 1 8 ,2 2 ,7 ,3 , 5,12,20,
2 ,1 1 ,1 0 ,1 7 ,2 1 ,2 4 ,6 ,2 7 ,2 3 ,2 5 ,2 8 ,1 ,2 6 ,1 9 ,9 and 16.
The degrees o f freedom fo r the 3 parameter model o f Ascal are 17 because the
model begins its estimates by breaking the data into 20 fractiles. The critica l values fo r
chi-square w ith 17 degrees o f Beedom are 27.587 (signiScance level .05) and 33.409
(signiGcance level .01). When we look at the chi-square values in Table 4 we Gnd that
the values range from a low o f 7.96 fo r item 29 to a high o f 48.719 fa r item 2. Two items
have chi-square values outside the .05 signiGcance level, item 23 - 29.18 and item 2 48.719. Item 2 also lies outside a .01 conGdence interval. However, as noted w ith the 2
parameter model, die program authors point out that i f tha e are a large number o f
examinees, a ll items could show a staGsdcally signiGcant lack o f Gt. There are 1297
examinees fo r this test and a better criterion fo r measuring the degree o f Gt is, therefore,
chi-square values that are considerably larger than those f ir other items. The difference
betw eai consecuGve chi-square values when they are ordered is in almost a ll cases small
(3.096 being the largest). The excepGon is the diGerence between items 2 and 23 (the
tw o largest values). The diGerence in this instance is 19.539. This diG^ence in overall

71

terms is small, but because it is considerably diG erait 6om a ll the others and item 2 can
be held suspect. A ll items, w ith the exception o f item 2, ^ipear to 6 t this model.

There appears to be general agreement between the item analysis models about
the order o f difB culty fo r the items in the grade 5 tesL W id i each program used, item 15
was identihed as the most d ifh c u lt S im ilarly, w ith each program item 2 was found to be
the easiest o f the M ultiple Choice items and w ith Bigsteps we found that the Short
Answer items 33,34, 35, and 36 were judged easier again than item 2. Items 31 and 32
were 6>und to be moderately d ifh cu lt w ith item 31 follow ing ita n 30 (compared w ith the
order taken from Iteman) and item 32 follow ing item 9 (compared again w ith the order
taken 6om Iteman). A measure o f die difG culty o f these two items is found in the
numbers o f examinees whom attempted them. O f the 1297 examinees, there were 536
examinees who attanpted item 31 and there were 616 who attempted item 32.
Item discrim m atioii, between the item analysis models, was not quite as clear.
Where item 2 was judged to have the lowest discrim ination when using classical analysis,
in the Rasch analysis it was judged second lowest (behind item 15), in the 2 parameter
logistic model it was judged twenty-second lowest and in the 3 parameter log istic model
it was the tenth lowest. In classical analysis, item 28 was judged to have the greatest
discrim ination. It had the greatest discrim ination, when considering only the M ultiple
Choice items fo r the Rasch analysis and the 2 parameter logistic model. In the 3
parameter logistic model, it had the second greatest discriinination A llo w in g item 14.
W ith the Rasch analysis we found that the level o f discrim ination fo r the six short answer

72

items was w ith in the h a lf o f the items w ith greatest discrim ination (34 - firs t, 33 - second,
31 - sixth, 35 - ninth, 36 - GAeenth and 32 - sixteenth).
Table 24, comparing the order o f difG culty and discrim ination between these
programs, is included in Appendix E. In general, the point biserial correlations A)r
alternative responses Ib r each item in the classical analysis agreed quite closely w ith the
response curves fo r the alternative reqwnses shown in TestGraf.
Several items have been Gagged as lacking Gt in the Rasch analysis (using
Bigsteps) and die logistic model analyses (using tw o parameter and three parameter
Ascal) o f the grade 5 D M AT data. In Table 5 a summary o f the items that lacked Gt is
shown. The items in bold p rin t are Giund in a ll three analyses.
Tables
AcTMS m the Grodle 5 DA64TErMWrzng Zack

F it in the Zogirtie ttem Response

Items Lacking F it
Bigsteps

2. (.23), 23 (.33), 10 (.34), 20 (.34), 7
(.51), 24 (.52), 5 (.58), 6 (.58)
1 ,2 3 ,3 0 ,8 ,
15,2
23

Ascal, 2 Parameter
Ascal, 3 Parameter

7,14
2

a The numbers in bold p rin t are identiGed as lacking Gt in each o:'th e analyses.
We Gnd that items 2 and 23 are idendGed as lacking Gt in a ll three item response
models. Using values found in Table 1, we see that item 2 has p = .95, and D = .09, or
= .26; item 23 hasp = .89, D = .22, and f),* = .37. Both items ^p e a r to be easy w ith a
lack o f discrim ination; however, in TestG raf we Gnd that altemadve response 4 fo r item
23 acts as a great distracter among the low scoring group (p = .85 at the Grst and second
percentile).

73

Item 7 is identiGed in both the Rasch analysis and the two parameter logistic
model as lacking Gt. Again using Table 1, we Gnd tha tp = .80, D = .35 and

= .38.

Here again we Gnd that the item e^)pears easy; this tim e w ith moderate discriminaGon.
The ICC displayed using TestGraf shows desirable characterisGcs G)r a ll responses.
In the Rasch analysis, items 10,20,24,5, and 6 are also identiGed as lacking Gt.
Item 10 has p = .89, D = .22 and

= .34. Item 20 has p = .89, D = 20 and

Item 24 has

= .41. Item 5 has p = .74, D = .47 and f),» = .47. Item

= .79, D = .34 and

= .32.

6 has p = .78, D = .33 and rpt = .37. Each o f these Gve items is easy, w ith the two easiest
(10 and 20) exhibiting low discriminaGon and the three others exhibiting an average to
strong discriminaGon. In the ICCs displayed using TestGraf^ item 5 shows a slight
negaGve correlaGon Gar the correct response pnor to the GGh percentile w ith a strong
posiGve correlaGon thereafter. The other items display response curves as expected.
For die two parameter logisGc model, items 1,30, 8, and 15 show a possible lack
o f Gt. Item 1 has p = .39, D = .37 and
.38. Item 8 hasp = .56,

= .43 and

= .33. Item 30 has p = .44, D = .43 and T},* =
= .38. Item 15 hasp = .37, D = .33 and

= .29.

Three o f the four items (1,30, and 15) ^p e a r to be difG cult items w ith good or exceUent
discriminaGon. The fourth item is a moderately difG cult item again w ith excellent
discriminaGon. In the ICCs displayed using TestGraf^ we observe that fo r item 1, the
resp>onse curves G)r the correct reqionse and altemaGve show slight irregulariGes at the
low score range (below the tenth percentile) and at the high score range beyond the
ninety-GAh percenGle. The response curve in item 30 fo r altemaGve reqxtnse 5 shows a
posiGve discriminaGon iq) to the Gfleenth percenGle and then shows an expected negaGve
discriminaGon. The response curve fo r the correct response in item 8 shows a strong and

74

imexpected negative discrim ination ib r the highest scoring students. It q>pears that
something in this item is causing the strongest students to choose alternative response 5.
And hnally, the respmise curves in i t ^ 15 diows a strong distr^^tor in alternative
response 2; however, a ll response curves appear to behave as they should.
In most o f the above items it appears that lack o f fit may stem hom two sources:
an easy item lacking in discrim ination or a difS culty item w ith strong distracters. Item 8
has a response curve that can be considered atypical because responses by the highest
scoring examinees is contrary to what is expected.
Grodk 7
C/rKsicnZvdnnZysw Using ftcTnan
In Table 6, the values calculated by Iteman fa r the difB culty and the
discrim ination index and point-biserial correlations fo r each o f the m ultiple choice test
items are shown. Values printed in bold are those o f the keyed correct reqxmses. In the
column displaying proportion selected, a high p-value (approaching 1) fo r the keyed
correct response indicates a relatively easy item . In a sim ilar way, a low p-value
(^proaching 0) fa r the keyed correct response indicates that the item was relatively
difG cult. H igh values fo r the alternate responses, dq)ending on the degree o f
discrim ination, could indicate either that the response acts as an excellent distracter or
that the item is flawed.

75

Table 6
Classical Analysis o f Grade 7 D M AT Items Using Iteman, Measures o f D ifB culty and
Discrim ination

Responses
1Item 1 D

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

.43
.50
.38
.30
.20
.45
.31
.39
.35
.61
.34
.38
.18
.52
.20
.21
.37
.62
.46
.46
.40
.49
.31
.44
.36

1
1
1 P 1 r/,6
.11
.08
.02

-.19
-.15
-.!()

.76

35

.01
.03
.01
.35
.26
.11
.32
.05

-.07
-.09
-.13
-.03
-.13
-.!()
-.03
-.03

.76

.20

.07
.17
.46

-.1:1
-.11
-.04

.77

.41

.01
.06
.18

-.1C)
-.24
-.24

.69

35

.12
.05
.21

-.26
-.12
-.17

35

36

2
P
.08
.05
.04
.06

4

3

1

1

5

1 Other
|
1 p 1 rpA 1

1 p
-.19
-.19
-.17
-.15

.71
.60
.73

.41
.43
.40

-.25
-33

.88

30

.15
.05

.07
.08
.14

-.15
-.26
-.24

.70
.84
37

.41
.40
37

36
.40

37
31

.20
.06
.14
.14

-.19
-.17
-.11
-.10

.12
.17
.07
.29
.05

-.04
-.15
-.13
-.14
-.14

33

.45

.76

35

.05

-.17

.10
.07
.02
.11
.10
.16
.09

-.14
-.17
-.07
-.16
-.16
-.13
-.20

32

30

.75

34

.05
.34

-.19
-.08

.07

-.18

.61

35

.12
.16
.05
.04
.10
.10
.13

-.15
-.11
-.17
-.19
-.10
-.05
-.14

.04
.08
.19
.01
.05
.06
.05
.07
.19
.12

-.18
-.12
-.26
-.07
-.11
-.18
-.19
-.17
-.06
-.23

36
31

37
37

.03
.15
.01
.09
.02
.31

-.03
-.22
-.11
.04
-31
-.49

.67
39

.44
.43

.04

-.12

.72

.49

.04

-.17

30

.40

.06

- .iir

.04
.17
.01
.01
.01
.12
.01
.04
.12
.18
.13
38
.01
.10
.01
.02
.05
.04
.03
.13
.05
.02
.05
.13
.07

-.13
-3 2
-.13
-.05
-.1:»
-3 4
-.07
-.08
-.1C)
-.18
-.05
-.10
-3)1
-.22
-.08
-.i:r
-.16
-.05
-.18
-.01
..IS )

-.18
-.15
-.1(5
-.03

.01
.02
.01
.01
.00
.01
.01
.04
.05
.03
.02
.01
.02
.02
.01
.02
.01
.02
.02
.03
.01
.01
.02
.02
.06

-.12
-.11
-.14
-.13
-.10
-.12
-.13
-.10
-.17
-.13
-.15
-.12
-.10
-.15
-.17
-.16
-.15
-.11
- .iir
-.13
-.14
-.15
-3 0
-.16
-.16

Consider item 9 in Table 6. The correct response is the second. Because the
proportion correct is .26, this is tied w ith item 11 as the most difG cult item on the test.
Response 1 is the preferred distractor w ith .26 o f the examinees selecting it. Response 4

76

also appears popular w ith .19 o f a ll Ae examinees selecting it. Responses 3 and 5 each
had .12 o f the examinees select them and the category labeled "O ther Responses", had
.05 indicating that this p rc ^ rtio n o f students selected more than one response as their
answer. This item warrants closer scrutiny because o f the high proportion o f students
selecting each o f the responses. This item could be sufGciently difG cult that most
students sim ply guess at an answer; however, the negative point biserial values fa r a ll
distractoTS and a positive point biserial fo r the correct response indicate that this item is
behaving as desired.
Consider also item 11 in Table 6. The correct response is the fourth. It too has
.26 o f the students who chose this fo r the correct response. For the alternate responses
we Gnd: response 1 was the preferred response w ith .32 o f the students selecting it,
response 2 had .20 o f the students selecting it, response 3 had .07 o f the students selecting
it and response 5 had .13 o f the students selecting i t This tim e, the category labeled
"O ther Responses", had .02 o f the students who selected more than one response as their
answer. This item also behaves as desired as we shall see in the next section.
Consider, fo r comparison, item 5 in Table 6. The correct response is the second.
This tim e the proportion correct is .88. The item can be considered an easy item because
o f the high proportion o f students choosing this response. Responses 1 ,3 ,4 , and 5 were
respectively chosen by .01, .05, .05 and .01 o f the students. For this item , there were no
students included in the "O ther Responses" category.
In Table 6 we Gnd that the items when placed in order going Gom most difG cult
to least difG cult are: 9 ,1 1 ,1 2 ,1 6 ,2 5 , 8 ,2 0 ,1 0 ,2 4 ,1 4 ,2 ,1 8 ,1 9 ,2 1 ,6 ,1 ,2 2 ,3 ,2 3 ,4 ,
1 3 ,15 ,17,7 and 5.

77

In Table 6 the discrim ination index and the point-biserial calculation fo r each
response are also displayed. Low values would show that the item discrhninates poorly
between high and low scoring examinees. In contrast, high values w ould show that the
item discriminates w ell between high scoring and low scoring students.
Using the four categories h)r discrim ination discussed in chapter 3, the items can
be categorized as follow s:
# Great D isciim ination (D > .40)- 1 ,2 ,6 ,1 0 ,1 4 ,1 8 ,1 9 ,2 0 ,2 1 ,2 2 and 24
# Average D iscrim ination (.30 < D < .40)- 3 ,4 , 7, 8 ,9 ,1 1 ,1 2 ,1 7 ,2 3 and 25
# Acceptable D iscrim ination (.20 < D < .30)- 5,15, and 16 and
# Marginal D iscrim ination (D <.20)- 13
Looking closer at the items that display the least discrim ination, we Gnd that a ll three o f
the four items are listed among the easier items shown in Table 6. This means that a
large proportion o f the students (low scoring as w ell as high scoring) are getting these
items correct. A low level o f discrim ination can be expected fa r these items. A
comparison o f the high scoring and low scoring students is as follow s:
#
#
#
#

Item 5- low scoring is .76 and high scoring is .96,
Item 13-low scoring is .67 and high scoring is .85,
Item 15 -low scoring is .64 and high scoring is .85 and
Item 16- low scoring is .24 and high scoring is .44.

For items 5,13 and 15, even though the measure o f discrim ination overall is low , the test
items are s till doing \^ ia t they should, that is, more high scoring students got the items
correct and fewer low scoring students got them correct. Item 16, however, even though
it shows this same pattern and consequently may prove to be a w ell behaved item ,
because it can be considered a more difG cult item , has surprising low discrim ination.
This item warrants addihonal consideration. In the next section, TU be examining more
closely items 9,11 and 16 and what TestG raf shows about the items.

78

Figures 5 ,6 ,7 , and 8 show the output generated by the TestG raf program fo r
items 1, 9,11 and 16 respectively. In Figure 5 we see the response curves calculated fo r
item 1. r ve included it fo r comparison purposes because aU reqwnses appear w ell

C :\T e s tG r a E 9 8 \g r 7 tg r a f d a ta

Ite m
P ro b a b ilitz y
5%

25%

50%

75%

95%

.0

.8

6

.4

.2

20

24

28

Expected Score

f igz/re 6. The ICC produced by TestGraf for item 1 of the grade 7 DMAT.

79

36

behaved, it is moderately easy and it shows a high level o f distaiinination. M ore
speciGcally, the degree o f difR cnlty (p), discrim ination index (D) and point-biserial
proportions
2) = -.19,

& r this item are: j? = .71, D = .43,
(response 3) = .41,

(response 1) = -.19,

(response 4) = -.18 and

(response

(response 5) = -.13.

C : \T e 8 t G r a f 9 8 \g r 7 t g r a f daba

Ite m 9
P r o b a b i lit y
5%
X.U —
T

25%
)

!

50%

75%
j

95%

:

0 .8

.J

1

'

!

1

I

/"

1
1

0 .6

/

'

/

1
!
1

0 .4

0 Q

1

i'z

i

4

!

— -^1

12

/

'
____ ......i_

_ ______ ___ ___i___
8

1

16

20

24

i

1 5

'

28

32

36

E xpected Score

Mgwe 7. The ICC produced by TestGraf for item 9 of the grade 7 DMAT.

80

The response curve calculated fo r item 9 is displayed in Figure 7. This item has p
= .26, D = .35, and /pt (response 2) = .37. Although a difG cult item , item 9 appears to
behave ju s t as it should. Up to ^yproxim ately the seventy-SAh percentile, a ll alternate
response effectively distracted students; however, beyond this point the students chose
the correct response. This item demonstrates a high level o f discrim ination and
ef& ctively shows which students are high scoring.
C :\T e 8 t G r a f 9 8 \g r 7 t g r a f d a ta

Ite m 11
P r o b a b ilit y
5%
1 .0

_

0 .8

_

0 .6

_

i

;
!
!
!

\

,
1
1
I
.
^

0 .2

-/

25%
Î

X.

, \

50%
;

75%
i

95%
1

!
1
1

:
1
:

1
1
1
1
i
!

,
1
1
1
L.------/

'

'

1/

'

,

,

^7''s

?

t

i
?
!

^
1 /
1 /

i
1
1
1
1

1/

/

/

y/l
y/ i
/ i
,

...

0 ,0
4

8

12

16

20

24

28

E xpected Score

Ffgwe & The ICC produced by TestGraf for item 11 of the grade 7 DMAT.
81

32

... ' - " ï
36

The response cun'e calculated 6)r item 11 is displayed in Figure 8. This item has:
= .26, D = .34, and

(response 4) = .37. This item displays many o f the

characteristics o f item 9 as displayed in Figure 7. It also spears to behave as it should.
Up to approxim ately the eightieth percentile, alternate responses effectively distracted
students; however, beyond this point the students chose the correct reqionse. This item
demonstrates a high level o f discrirnination and e fkctive ly shows which students are high
scoring. This is the sort o f item that would w ork w ell fo r selecting the highest scoring
students.
The response curve calculated h)r ita n 16 is shown in Figure 9. This item has: ^
= .32,2) = .21, and

(response 3) = .20. This is the only item on the grade 7 D M AT that

has a distractor w id i a positive point biserial value -

(response 4) = .04. This item ,

displays some o f the characteristics o f items 9 and 11, but also appears to behave poorly
& r students above the ninety-AAh percentile. Up to (approximately the eighty-hflh
percentile, alternate responses efG xitively distracted students; beyond this point the
students chose the correct response except fo r students above the ninety-&Rh percentile
where many were effectively distracted by response 1. The level o f discrim ination
throughout the item is low and high scoring students are choosing the wrong réponse.
This is not a desirable characteristic h)r a test item. The lack o f discrim ination could
indicate that students are guessing at this item .

82

: ; \T e s L G r a f 9 8 \g r 7 t g r a f d a ta

Ite m 16

P r o b a b ilit y
25%

50%

75%

95%

/" \

0.6

0 .4

><J '

\ /

7

"l'Ük—

0. 0

24

28

32

36

E xpected Score

fYgw e P. The ICC produced by TestGraf fo r item 16 o f the grade 7 DM AT.

The summary statistics fo r persons and items fo r die Bigsteps analysis is shown in
Table 7. Also shown are die mean and standard deviations fo r persons and items.
Estimated measures as w ell as standard errors o f the estimates were also calculated and
are shown. O f particular interest are the inGt and outGt statistics fo r the test items. The

83

mdGt statistic is an information-weighted fit statistic which is sensitive to unexpected
behaviors w hich a fk c t responses to items that are near to the person's a b ility level. The
mean-sqnare inG t statistic is intended to be 1. For this analysis die inGt, fo r the items,
was .90 and the standardized score was -3.8. The calculated ioGt is w ell below the
expected value (1 ) indicating that there may be dependencies in the data. The outGt
staGsGc is an outlier-sensitive Gt staGsGc and is sensiGve to unexpected behaviours by
persons on items 6 r Gom then a b ility level. For this analysis the outGt, fo r the items,
was .94 and the standardized score was -3.0. Again this is below the expected value o f 1
and seems to indicate dependencies in the data.
For the grade 7 data, the Bigsteps program in itia lly required two iteraGons w ith
the PROX algorithm to arrive at an acceptable level o f convergence. The UCON
algorithm

programmed to conduct at most 25 iteraGons. The maximum lo g it change

decreased to the point where the Gnal iteraGon produced a change o f .0008 Gn a lo g it
value o f -.0156. There were 1175 persons and 35 items included fo r a ll estimaGons
w hich meant that no persons or items were excluded by Bigsteps due to lack o f Gt.
Table 7
Rarch

qf

7 D M 4T [Am g B/gstepf, Anw w ny

Raw 1 Count
Score 1
Persons
1Mean
24.8 1 35
8.6 1 X)
1 S.D.*
Item s
1Mean
832.9 1 1175
450.4 1 .0
1 S.D.*
MNSQ - mean squared

Measure

M odel
Ehror

-2.28
.76

.30
.04

InGt
MNSQ* ZSTD*
.95
.25

.00
.05
.90
.95
.01
.48
ZSTD —standard score

84

the Afeam ref - Pcrarow

-.3
1.0

OutGt
MNSQ* ZSTD*
.94
.23

-.3

-3.8
.94
-3.0
7.5
.46
7.5
S.D. - standard deviaGon

This analysis was programmed to include categories 6)r m ultiple choice items as w ell as
short answo" items so that the difB culty and discrim ination levels o f the short answer
items could be examined.
In Table 7, the calculated summary statistics fo r persons and items is shown. The
mean and standard deviations were calculated. Estimated measures as w ell as standard
errors o f the estimates were also calculated. O f particular interest in this analysis are the
statistics fa r items. The inGt and outGt staGsGcs are also shown.

85

Tables
Grade 7

T GfZMg Bigsfepf, Afeamreg q/^Di^cwZfy, Z);$enyMmadon

a»d Fd
Raw
Measure E rror
Score
MNSQ* ZSTD*
1
830
.54
-.15
.05
-9.9
2
708
.18 I .05
.66
-9.4
853
-21
.05
.53
3
-9.9
888
4
-.29
.52
.05
-9.9
1038
5
-.65
.05
.35
-9.9
823
.56
6
-.13
.05
-9.9
987
7
.37
-9.9
-.53
.05
8
429
1.05
.06
.89
-3.1
303
.07
.87
-3.0
9
1.56
465
.93
-8.4
10
.06
.71
304
11
1.55
.89
-2.6
.07
12
363
.90
1.30
.06
-2.5
888
.64
13
-.29
.05
-8.8
624
14
.42
.05
.71
-8.0
15
890
-.30
.59
.05
-9.9
374
16
1.26
.06
02
1.01
17
907
-.34 L .05
.48
-9.9
712
.17
18
-9.9
.05
.59
19
785
-.03
.05
.58
-9.9
20
464
.93
.06
-5.1
.81
809
.05
-9.8
21
-.09
.63
847
.05
.47
22
-.19
-9.9
878
-2 7
23
.05
.55
-9.9
.54
.80
-5.6
24
584
.06
410
1.12
.87
-3.4
25
.06
2591
-2.85
8.6
26
.03
1.341
1912
-2.11
.04
.83
-4.3
27
659
.32
1.97
9.9
.05
28
1734
-1.88
.04
3.0
29
1.141
707
.18
2.16
9.9
30
.05
31
633
.40
.05
.65
-9.9
.03
1.87
9.9
762
.05
1.70
1107
-.80
.05
9.9
33
851
1.69
34
-.20
.05
9.9
1031
-.64
1.77
35
.05
9.9
* MNSQ - mean squared
ZSTD - standard score
Item
Number

86

OutGt
MNSQ* ZSTD*
.58
-9.9
.72
-7.8
.56
-9.9
.56
-9.9
.38
-9.9
.60
-9.9
.40
-9.9
.91
-2.0
.98
-.3
.72
-7.0
.94
-1.0
.95
-1.0
.71
-7.4
.74
-7.1
.63
-9.8
1.21
3.9
.51
-9.9
.61
-9.9
.61
-9.9
.85
-3.7
.68
-8.4
.50
-9.9
.59
-9.9
.82
-4.8
.90
-2.1
8.3
1.35
.95
-1.2
1.88
9.9
1.27
5.4
2.05
9.9
-9.4
.66
1.77
9.9
Ï.7 5
9.9
1.63
9.9
1.77
9.9

PointB iserial
.35
.36
.33
.28
22
.34
.32
.26
.31
.50
.28
.26
.08
.37
.16
.09
.30
.45
.36
.34
.26
.43
.25
.29
.29
.45
.41
.39
.52
.38
.46
.45
.47
.44
.43

O f particular interest in this analysis is how w ell the short answer items 6 t this
model and how they relate to the dichotomous items. In Table 8, these values and those
fo r the m ultiple choice items are shown. The level o f difB culty in Table 8 is shown under
the column entitled "Measure". The level o f discrim ination in Table 8 is shown under the
column entitled "Point-biserial".
The level o f difB culty ordered horn most difBcuh to least difG cult as calculated
using the Bigsteps program is: 9 ,1 1 ,1 2 ,1 6 ,2 5 ,8 ,2 0 ,1 0 ,2 4 ,1 4 ,3 1 ,2 8 ,3 0 ,2 ,1 8 ,3 2 ,
19 ,21 ,6,1, 22,34, 3 ,2 3 ,1 3 ,4 ,1 5 ,1 7 ,7 ,3 5 ,5 ,3 3 ,2 9 ,2 7 and 26. S im ilarly, the level o f
discrimination (point biserial) ordered &om most discrim inating to least discriminating is:
29 ,10,33,31, 1 8 ,2 6 ,3 2 ,3 4 ,2 2 ,3 5 ,2 7 ,2 8 ,3 0 ,1 4 ,2 ,1 9 ,1 ,6 ,2 0 ,3 ,7 ,9 ,1 7 ,2 4 ,2 5 ,4 ,
11 ,8,1 2,21 ,23 , 5 ,1 5 ,16andl3.
The Bigsteps program also identiGes those items that are not a good h t to the
overall response characteristic curve w ell. Values in the inGt column o f Table 8 that are
close to 1 demonstrate a good G l As was the case w ith the grade 5 data set, items that
are 0.4 more or 0.4 less than this are not a good Gt. The sixteen items that do not appear
to Gt w ell are: 1 ,3 ,4 ,5 ,6 ,7 ,1 5 ,1 7 ,2 2 ,2 3 ,2 8 ,3 0 ,3 2 ,3 3 ,3 4 , and 35.
Zogifizc Acm üücapowc

C/rmg fAc Two fw om cicr A/W e/

vfrcoi

Table 9 displays the grade 7 D M AT data G)r both the two parameter and three
parameter logisGc models as reahzed by the program Ascal. For the two parameter
model, Ascal was programmed to run a maximum o f twenty iteraGons; however, only six
were required to achieve a maximum parameter change o f 0.00657.

87

Table 9
Zogzffzc jfem

q/^Grade 7 DAdWT Ugmg Two and T%ree f aranfefer

AfbdeZ,y qydfefd A ^aaw ef q/^Df;;^eid(y, T)zjcnn»nadon, ffaedo-gaej^a^mg and Fd

1

1
1

Item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
*;X .05

2 Parameter Model
&
1
1
/
-1.018
-.482
-1.140
-1.458
-2.297
.984
-1.556
.758
1.271
.372
1.249
1.032
-1.818
-.139
-1.830
1.819
-1.304
-.356
-.749
.494
-1.093
-.837
-1.558
.013
.885
**-;K.01

.623
.634
.613
.532
.616
.625
.849
.495
.587
1.003
.598
.540
.400
.690
.400
.400
.697
1.223
.710
.626
.492
1.004
.463
.539
.473

5

38.356**
53.155**
28.795
34.396*
24.480
15.253
16.201
16.415
55.672**
46.579**
26.776
17.444
37.317**
10.153
28.228
130.614**
22.322
13.983
24.638
29.737*
20.532
22.307
18.658
18.639
23.027

-.823
.254
-.867
-1.199
-2.060
-.573
-1.326
1.061
1.330
.489
1.321
1.252
-1.188
.147
-1.624
1.917
-1.185
-.315
-.558
.847
-.501
-.744
-1.085
.327
1.262

3 Parameter Model
a
1
c
1
.660
1.533
.633
.556
.629
.713
.886
.617
1.282
1.465
1.962
.866
.400
.848
.400
1.629
.688
1.176
.730
1.227
.580
1.061
.492
.613
.726

.100
.310
.150
.130
.170
.190
.180
.120
.140
.080
.160
.130
.280
.130
.170
.280
.100
.030
.110
.190
.220
.060
.200
.120
.170

/
42.291**
29.988*
22.335
30.256*
37.053**
12.352
29.174*
16.648
22.379
40.941**
21.281
16.681
23.448
12.346
26.902
20.182
29.922*
22.657
26.968
5.918
15.935
12.987
12.055
11.287
19.594

The items W ien ordered by degree o f difB culty going 6om most difG cult to least
difG cult are: 1 6 ,9 ,1 1 ,1 2 ,2 5 , 8 ,2 0 ,1 0 ,2 4 ,1 4 ,1 8 ,2 ,1 9 ,2 2 ,6 ,1 ,2 1 ,3 ,1 7 ,4 ,7 ,2 3 ,1 3 ,
15 and 5.
The items when ordered by discriroination going from greatest to least are: 18,
2 2 ,1 0 ,7 ,1 9 ,1 7 ,1 4 ,2 ,2 0 ,6 ,1 , 5, 3 ,1 1 ,9 ,1 2 ,2 4 ,4 , 8 ,2 1 ,2 5 ,2 3 ,1 3 ,1 5 and 16.
A n examinaGon o f the chi-square values in Table 9 reveals Giat they range from a
low o f 10.153 fo r item 14 to a high o f 130.614 fo r item 16. In the same manner as was

88

|
1

described above (two parameter model o f Ascal - Grade 5), the two parameter model o f
Ascal has 18 degrees o f Èeedom. From a table o f critica l values lo r the chi-square
statistic we get the value 28.869 6)r a .05 level o f significance and the value 34.805 fo r a
.01 level o f signiGcance. Using Ihis infbrmaGon, items 2 0 ,4 ,1 3 ,1 ,1 0 ,2 ,9 and 16 lie
outside the .05 conGdence interval and items 1 3 ,1 ,1 0 ,2 9, and 16 lie outside the .01
conGdence interval. This in its e lf as noted fo r the grade 5 data does not automaGcally
rule out a G t The suggested cnterion, noted in the program operating manual, fo r
measuring the degree o f Gt is chi-square values that are considerably larger than Giose G>r
other items. In Table 9, we have at least one item that spears to have a chi-square value
that is considerably larger than the others - item 16. Because o f its apparent lack o f Gt,
item 16 can be considered suspect. Items 9 ,2 ,1 0 1 and 13 are less suspect as they do not
display the same dramahc diGerence between chi-square values as between items 16 and
9 but they warrant a careful look because o f the high chi-square values they have. A ll
other items appear to Gt this model.

A summary o f the ouqait fo r the 3 parameter model o f Ascal is also included in
Table 9. The program was again asked to run a maximum o f twenty iteraGons. This tim e
it ran eleven iteraGons and stoiq)ed when the maximum parameter change was 0.00322.
These items when ordered by degree o f difG culty going Gom most d iS c u lt to least
difG cult are: 1 6 ,9 ,11,25,12, 8 ,2 0 ,1 0 ,2 4 ,2 ,1 4 ,1 8 ,2 1 ,1 9 ,6 ,2 2 ,1 ,3 ,2 3 ,1 7 ,1 3 ,4 , 7,
15 and 5. These items when ordered by discriminaGon going Gom greatest to least are:
11,16,2,10, 9 ,2 0 ,1 8 ,2 2 ,7 ,1 2 ,1 4 ,1 9 ,2 5 ,6 ,1 7 ,1 , 3 ,5 , 8 ,2 4 ,2 1 ,4 ,2 3 ,1 3 and 15.
These items when ordaod by the guessing parameter going Gom the greatest degree o f

89

guessing (the highest values) to the least degree o f guessing (the lowest values) are: 2,16,
13,21,23,20, 6 ,7 ,2 5 ,1 5 ,5 , I I , 3 ,9 ,1 2 ,1 4 ,4 ,8 ,2 4 ,1 9 ,1 ,1 7 ,1 0 ,2 2 and 18.
The degrees o f freedom fo r the 3 parameter model o f Ascal are 17 because the
model begins its estimates by b ro kin g the data into 20 fractiles and tests the 3
parameters; difG culty, discrim ination and guessing. The critica l values fo r chi-square
w ith 17 degrees o f freedom are 27.587 (signihcance level .05) and 33.409 (signihcance
level .01). A look at the data reveals that items 7 ,1 7 ,2 ,4 , 5,10 and 1 lie outside a .05
conGdence interval and items 5,10 and 1 lie outside a .01 conGdence interval. O f greater
signiGcance is that none o f the chi-square values seem to be considerably greater than a ll
the others. When they are ordered, the greatest difference between successive chi-square
values is 6.797 (item 5 w ith chi-square 37.053 and item 4 w ith chi-square 30.256). We
can conclude that fo r the 3 parameter model o f Ascal a ll the items demonstrate a suitable
Gt.
A fw nary o f Gradk 7

JZgjgwwe vfwzfysü

There appears to be general agreement between the item analysis models about
the order o f difR culty fo r the items in the grade 7 te st W ith each program item 9 was
idenGGed as either the Grst or second in the order o f d iffic u lty . S im ilarly, w ith each
program item 5 was found to be the easiest o f the mulGple choice items and w ith the
Rasch analysis using Bigsteps we found that the short answer items 26,27,29, and 33
were judged easier again than item 5. Items 31,28,30, and 32 were found to be
moderately difB cult w ith item 31 being the most difG cult o f the short answer items eleventh when a ll items are ordered Gom most to least difG cu lt The other short answer
ita n s, items 34 and 35, were found to be moderately easy and were twenty-second and

90

th irtie lh when placed in order o f difR culty. A measure o f the difR culty o f these items is
found in the numbers o f examinees whom attempted them. O f die 1165 examinees fo r
whom statistics on the short answer items were recorded, there were 636 who attempted
item 31,391 who attempted item 28,386 who attempted item 30,438 who attempted
item 32, 548 who attempted item 34,603 who attempted item 35,632 who attempted
item 33,926 who attençted ita n 29,1049 who attempted item 27, and there were 1093
who attempted item 26.
Item discrim ination, between the item analysis models, was not quite as clear.
Some exception were: item 13 was ranked least discrim inating in a ll programs, item 15
was ranked ather th irty-th ird or thirty-fou rth in discrim ination, and item 8 was ranked
tw enty-ninth in three o f the programs (it was twenty-second in Iteman). W ith the Rasch
analysis using Bigsteps, most o f the short answer items were judged to have great
discrim ination. 16)und that the ten short answer items came w ithin the top thirteen items
when ranked horn greatest to least in discrim inatioiL See Table 25 in Appendix E, where
the orders o f d ifh cu lty and the ordas o f discrim ination between these programs are
compared.
In general, the point biserial correlations fo r alternative responses fo r each item in
the classical analysis using Iteman agreed quite closely w ith the respxmse curves &>r the
alternative responses shown using TestG raf
Several ita n s were Ragged as lacking fît in the item response analysis o f the
grade 7 DM AT data using the programs Bigsteps, and the two parameter model and the
three parameter model o f Ascal. A summary o f the items that lacked 6 t is shown in
Table 10. The items in bold p rin t were found to lack fit in each o f the programs.

91

Table 10
Aem; m fAe Grodk 7

ZacA

Zogüfzc Akm Æe^^Rge

v4nafy6:gg

Items Lacking F it
Bigsteps

5 (.35), 7 (.37), 22 (.47), 17 (.48), 4 . (.52),
3 (.53), 1 (.54), 23 (.55), 6 (.56), 19 (.58),
15 (.59), 18 (.59), 34 (1.69), 33 (1.70), 35
(1.77), 32 (1.87), 28 (1.97), 30 (2.16)

Ascal, 2 Parameter

20,4

Ascal, 3 Parameter

7,1 7,2 ,4

13,1,10,
2, 9,16
5 ,1 0 ,1

a These items ^p e a r in bold p rin t and are identified as lacking St in each o f the analyses.
We Snd that items 1 and 4 are identified as lacking St in a ll three item response
models. Using values S)und in Table 6, we see that item 1 hasp = .71, D = .43, and 7^» =
.41; item 4 hasp = .76, D = .30, and 7),» = .35. Both items ^ypear to be moderately easy
w ith good discriminaSon. In TestGraf we Snd that fo r item 1 , the discrim in a tions
(measured by the 7 ^) o f the correct reqwnse and the ahemaSve response 1 are the
op;x)site o f what is expected up to the tenth percentile; and in item 4 the altemaSve
response 3 acts as a great distractm among the low scoring group (p=.85 at the firs t and
second percentile).
Items 5 ,7 , and 17 were identiSed in both the Rasch analysis using Bigsteps and
the three parameter logistic model using Ascal as lacking St. Again using Table 6, we
hnd that item 5 has p = .88, D = 2 0 and

= .30, item 7 has p = .84, D = .31 and 7),* =

.40; and item 17 hasp = .77, Z) = .37 and 7^6 = .41. Here again we 6nd that the item
appears easy, more so even than items 1 and 4; this tim e item 5 has low discrim ination

92

w hile hems 7 and 17 have an average discrim ination. The response curves & r aU three
items ^p e a r normal. The items appear w e ll behaved.
In the tw o parameter and three parameter logistic models using Ascal, items 10
and 2 were identiGed as lacking Gt. Item 10, using Table 6, has p = .40, D = .61, and
= .51 and item 2 hasp = .60, D = .50, and r};* = .43. Item 10 appeam to be a difB cult item
w ith excellent discrim ination. Item 2 appears easier and also has excellent
discrim ination. An examination o f the response curves found in TestGraf shows that fo r
item 10, the correct response exhibits a negative discrim ination between the tenth and the
th irtie th percentiles and the alternative response 4 acts as a great distractor fix low
scoring examinees (it has p = .60 in the Brst and second percentiles); and B)r item 2, the
correct response exhibits a slight negative correlation between the tenth and twenty-GAh
percentiles.
In the Rasch analysis using Bigsteps, items 3 ,2 3 ,6 ,1 9 ,1 5 ,1 8 ,3 4 ,3 3 , 35,32,28,
and 30 are also identiBed as lacking Bt. O f these, only the Brst six items can be
compared using data Bom Table 6 and item ICCs shown using TestG raf Item 3 has p =
.73, D = .38 and
D = .45 and
= .20 and

= .40. Item 23 hasp = .75, D = .31 and rp* = .34. Item 6 hasp = .70,

= .41. Item 19 hasp = .67, D = .46 and
= 25. Item 18 has p = .61, D = .62 and

= .44. Item 15 has p = .76, D
= .55. Each o f these six items

can be termed moderately easy. Moreover, w ith the exception o f item 15, aU exhibit at
least average discrim ination. Items 6,19 and 18 appear to have excellent discrim ination.
In TestG raf the response curves fo r item 3 show that alternative response 4 is an
excellent distractor fo r most low scoring examinees (p = .70 fo r examinees in the Brst and
second percentile), w hile the other response curves ^p e a r normal. In ita n 23, the

93

coirect response exhibits a slight negative discriin in atio ii among the low scoring
examinees (to the tenth percentile). In items 6 and 15 there is a slight negative
discrim ination in the high scoring examinees. In item 19 the response curves appear to
behave ju st as expected. In item 18, response 4 acts as an excellent distractor and is
favored by examinees up to the 50^ percentile. The response curve fo r the correct
response shows a slight negative discrim ination fo r the highest scoring students.
In the tw o parameter logistic model using Ascal, items 2 0 ,1 3 ,9 , and 16 show a
possible lack o f Gt. Item 20 has ^ = .39, Z) = .46 and
.18 and
and

= 2 0 . Ito n 9 h a s = 26 , D = .35 and

= .43. Item 13 has p = .76, Z) =
= .37. Item 16 has^ = .32, D = 21

= .20. Three o f the four items (20,9, and 16) ^p e a r to be difG cult items. The

fourth item (13) is a moderately easy item . In only tw o o f the items (20 and 9) do we
Gnd average to excellent discrim inadon. In TestG raf we see that fo r item 20, the
response curve fo r the correct response shows a slight negaGve discriminaGon fo r high
scoring examinees (over the ninty-GAh percentile). The response curves in item 13 are
Gat and show GtGe discriminaGon. The reqxmse curve fo r the correct response in item 9
shows a strong posiGve discriminaGon fo r high scoring examinees (above the seventyGAh percentile). And Gnally, the response curves in item 16 fo r the correct response and
the strongest alternative response (response 1) show a large reversal o f the expected
discnminaGons fo r the high scoring examinees.
In most o f the above items it appears that lack o f Gt may stem Gom three sources:
an easy item lacking in discriminaGon, a difBcuhy item w ith strong distractors or an item
where the responses misbehave Grr a small proporGon o f the examinees. Item 16 has a

94

response cwve that can be considered atypical because responses by the higher scoring
examinees is contrary to what is expected.
R eliability o f C riterion Measures
Two separate runs o f the Iteman program were conducted 6>r each o f the DM AT
tests. The Grst run identified two subtests to be analyzed. The Grst subtest included a ll
mulGple choice test items. The second subtest included aU short answer test items. The
internal consistency estimates (coeGScient alphas) fo r these subtests are shown in Table
11. In the second run, the test data was separated into Gve subtests. The Grst four
subtests consisted o f mulGple choice test items categorized by provincially defined
learning strands. The Gfth subtest was a ll d io rt answer test items. The internal
consistency estimates fo r these subtests are also shown in Table 11. Both runs were
made w ith the grade 5 D M AT data and the grade 7 D M AT data.
Overall test staGsGcs fo r the iniGal run o f the grade 5 data set using Iteman were:
fo r mulGple choice test items, a mean o f 19.165 (out o f 30 item s), standard deviaGon o f
5.631, skewness o f -0.407, kurtosis o f -0,407 and a standard error o f measurement o f
2.277; fo r short answer test items, a mean o f 7.716, standard deviaGon o f 3.619, skewness
o f 0.158, kurtosis o f -0.736 and standard error o f measurement o f 1.157. The test
staGsGcs fo r the in itia l run o f the grade 7 data set using Iteman w oe: fa r mulGple choice
test items, a mean o f 14.607 (out o f 25 item s), standard deviaGon o f 4.353, skewness o f
-0.185, kurtosis o f -0.307 and a standard error o f measuranent o f 2.136; fo r short answer
test items, a mean o f 10.362, standard deviation o f 5.020, skewness o f 0.184, kurtosis o f 0.726 and standard error o f measurement o f 1.146.

95

Tablell
f/K CZfMfiW ^/zo/yfM [/ymg Aeman

1
Dichotomous Items
1 No o f Items |
Alpha

M ultipoint Items
No o f Items |
Alpha

0.837a
Grade 5
30
6
0.707b
Numbers
16
5
Patterns and Relations
0.589b
Shape and Space
5
0.507b
Statistics and Probability
4
0.432b
Grade 7
0.759.
25
10
Numbers
13
0.665b
Patterns and Relations
3
0.324b
5
Shape and Space
d.322b
Statistics and Probability
4
0.325b
Tiese are the coefBcient alp las fo r dichotomous items in t Me Grst run.
b These are the coefBcient alphas fo r the curriculum strands in the second run.
c These are the grade 5 and grade 7 coefBcient alphas 6)r both runs.

0.898c

0.948c

It can be noted that when the firs t subtest (in the hrst run) is sectioned into smaller
units the re lia b ility decreases. This is to be expected. Sax (1997) points out that one o f
the factors affecting re lia b ility is the number o f items on the test. B y sectioning iq) the
dichotomous test items we are in essence creating four smaller tests. From Table 11 we
observe that die coefRcient alphas fo r the tw o sets o f tests (dichotomous items and
m ultipoint items) are 0.759 and higher. This indicates a strong internal consistency and
leads to the conclusion that both the Grade 5 and the Grade 7 tests exhibit a high degree
o f internal re lia b ility.

As indicated previously, the short answer test items were graded by a team o f
markers. R eliability was assessed w ith randomly selected tests copied and marked by all
markers. This process was undertaken w ith both groups o f raters - grade 5 and grade 7
w ith a few m inor diflerences. There were three grade 5 markers and 6ve grade 7

96

|
1

markers. The grade 5 madcers marked 19 papas in common and Ihe grade 7 markers
marked 20. And one o f the grade 5 markers was away on one o f the days that inter-rater
reliabdhy was assessed.
Table 12
Diytnhndon

Item
31A
3 IB
32A
32B
33

34A

34B
34C
35

36A
36B

Mean Total
Standard Deviation

^

Marks
0
1
0
1
0
1
0
1
0
1
2
3
4
0
1
2
0
1
0
1
0
1
2
0
1
0
1

/ô r the

Jÿgms m the Groiie 5 D M 4T

Rater 1
42.1
57.9
94.7
5.3
632
36.8
57.9
42.1
0.0
21.1
10.5
31.6
36.8
26.3
31.6
42.1
26.3
73.7
47.4
52.6
36.8
15.8
47.4
36.8
63.2
10.5
89.5

Rater 2
33.3
66.7
100.0
0
33.3
66.7
33.3
66.7
0.0
33.3
0.0
33.3
33.3
33.3
33.3
33.3
22.2
77.8
44.4
55.6
0.0
44.4
55.6
44.4
55.6
11.1
88.9

Rater 3
42.1
57.9
94.7
5.3
36.8
63.2
52.6
47.4
0.0
27.8
5.6
22.2
44.4
27.8
27.8
44.4
27.8
722
55.6
44.4
15.8
31.6
52.6
36.8
63.2
10.5
89.5

9.32
3.33

10
3.97

9.89
3.03

97

1
1

The overall scores as w ell as Ihe item by ita n percentages o f marks awarded are shown in
Table 12 fo r the grade 5 markers and Table 13 fo r the grade 7 maikers.
Table 12 dmws the proportions o f students assigned the diflerent marks by each
o f the raters fo r each o f the items on the grade 5 tesL That is, fo r item 31A : Rater 1 gave
42.1% o f the examinees a mark o f 0 and 57.9% o f the examinees a mark o f 1, Rater 2
gave 33.3% o f the same examinees a mark o f 0 and 66.7% o f them a mark o f 1 and Rater
3 gave 42.1% o f these examinees a mark o f 0 and 57.9% o f drem a mark o f 1. The jBnal
two lines show the mean mark and the standard deviation fo r the marks each rater
assigned this grorq) o f examirKes (19 in total). These Ggures show that Rater 1 can be
considered the most severe and Rater 2 can be considered the easiest marker. Even so,
these markers can be considered equal because using the effect size index

where

I
ef = -—'----- , f is a marker, and p and o are the sample mean and standard deviation
(T
respectively(Hurlburt, 1998) and taking an even larger difference by choosing the easiest
and most severe markers, we Snd that = .20. The critica l value w ith

2 is greater

than 4 6)r a significance o f .05; and so we can conclude that there is little difference
between the markers.
In a sim ilar way. Table 13 shows the percentages o f students assigned the
diSerent marks by each ofth e raters Rrr each o f the items on the grade 7 tesL As w ith the
grade 5 data (Table 12), the last tw o lines o f Table 13 show the mean and standard
deviation o f the overall marks each rater assigned this group o f examinees (20 in total).
These Ggures show that Rater 1 can be considered the most severe and Rater 4 can be
considered the easiest marker.

98

Table 13
^ A zfgrfyôr fAe STzorf ,4?myer Ae/w m f^Ae Grodle 7 D A t^T
Item
26

27A
27B

28

29A
29B
29C
30

31
32

33

34A
34B
35A
35B

Marks
0
1
2
3
4
0
1
0
1
2
0
1
2
0
1
0
1
0
1
0
1
2
0
1
0
1
2
0
1
2
0
1
0
1
0
1
0
1

Mean Total
Standard Deviation

Rater 1
IS
20
15
10
40
20
80
40
40
20
65
10
25
45
55
45
55
50
50
70
0
30
40
60
50
15
35
55
10
35
40
60
35
65
25
75
35
65
11.70
5.33

Rater 2
15
20
15
10
40
20
80
15
60
25
65
10
25
45
55
45
55
50
50
65
0
35
40
60
35
25
40
55
10
35
35
65
35
65
25
75
45
55
12.25
4.94

99

Rater 3
15
15
20
10
40
20
80
35
50
15
65
10
25
45
55
45
55
50
50
65
0
35
40
60
40
25
35
50
15
35
40
60
40
60
25
75
40
60
1 1 .^
4.98

Rater 4
15
20
20
5
40
20
80
20
55
25
65
10
25
50
50
45
55
60
40
65
0
35
35
65
40
25
35
55
10
35
35
65
40
60
30
70
40
60
12.55
6.03

Rater 5
15
20
15
10
40
20
80
40
40
20
65
10
25
45
55
45
55
55
45
65
0
35
37
63
50
15
35
55
10
35
35
65
40
60
25
75
40
60
12.11
4.88

As w ith the grade 5 data, we Gnd that these raters can also be considered equal. Using
the same calculations w ith the easiest and most severe ratas, we 6nd that (/ = .16 and the
critica l value fo r .05 significance w ith

4 is over 2. We can conclude that there is

little difference betw eai the markers.
In Table 14, the correlations between the markers fa r both sets o f tests - the grade
5 and tlK grade 7 is drown. For the grade 5 test, a factor that influenced these
correlations was that Rater 2 was away fo r one o f the two days that inter-rater re lia b ility
was measured. As a result rather than marking a ll 19 papers, this rater marked only 9.
The smaUo" number o f papers w ith w hich to compare results would lead to a lesser
correlation. Even though the correlation between raters 1 and 2 is a b it low , it is s till
sufBciently high to indicate an acceptable level o f agreement between these raters. For
the grade 7 test, the correlation betw eai raters was .95 or better, indicating excellent
correlation between these markers.
Table 14
Corre/afzon; RePygen AAirtgrf

Grade 5
2
.82

1
Rater 1
Rater 2
Rater 3

“

3
.91
.93

”

-

Grade 7
1
Rater 1
Rater 2
Rater 3
Rater 4
Rater 5

-

2
.94

3

.98
.97

“

-

4
.95
.98
.97
“

100

5
.99
.95
.98
.96

Content Related V a lid ity
As

mentioned in Chapter Three the design o f the DM ATs was determined by

the SDMC members in l9 9 5 . The teachers who were recruited to w ork on the actual tests
were instructed to develop items based on the learning outcomes listed in the B ritish
Columbia Mathematics Integrated Resource Package (IRP), 1995. The in itia l match o f
learning outcomes w ith items was completed by these subcommittee members. This was
brought to the SDMC to be ratiGed. The match between learning outcomes and test items
was further reviewed by the members o f the SDMC after each tim e the D M A T's were
administered. Over the years 1996,1998,1999 and 2000 several changes occurred.
Some items in the grade 7 test were discarded. The Performance Assessment part o f the
test was discontinued. The grade 9 test was discontinued.

There are 4 table; included in Appendix G w hich outline the Tables o f
SpeciGcadons fo r these tests. The inform ation in Tables 24 and 26 relate test items to the
Strands, Substrands and learning outcomes o f the B iiG d i Columbia IRPs. This
infbrmadon summarizes the analysis o f the SDMC members. The infbrm adon in Tables
25 and 27 relates the test items to the learning outcomes outdned in the Western
Canadian Protocol & r CoUaboradon in Basic Educadon. These two documents are
closely related w ith many o f the same learning outcomes. I have included these tables
because they outline the mathematical processes that are associated w ith each learning
outcome.

101

Table 15
jk /w m f/ze Griadb 5 aW Gra^jb 7 D M 4Z; GrgaMfze<af 6)/ ZR f Sb-awZy

Strand

Substrand

Nnmber

Number Concepts
Number
Operations
Patterns

Patterns &
Relations
Shape &
Space

Statistics &
Probability

Measurement

17

2,3,4,5,6,7,9,14

3D Otgects & 2D
Shapes
Transformations
Data Analysis

6

2,5

1,8,10,11,12,13
,15,16,17
1,3,4,6

5
4

1,2
1,2,3

3,4,5
4

3

1,2

3

Learning
Outcomes Used

Learning
Outcomes N ot
Used
4,5,7,8,9,10,13

Substrands

Numbers

Number Concepts
Number
Operations
Patterns

Statistics
and
Probabihtv

2,3,4,5,6,7,9,10
1,2,3,4,8

Learning
Outcomes N ot
Used
1,8
5,6,7

1,2

Strands

Shape and
Space

Learning
Outcomes Used

2

Chance &
Uncertainty

Patterns &
Relations

Grade 5
Number o f
Learning
Outcomes
10

Grade 7
Number o f
Learning
Outcomes
13
1

1,2,3,6,11,12
1

5

5

1,2,3,4

Variables &
Equations
Measurement

4

2

1,3,4

11

2,3,7,10

1,4,5,6,8,9,11

3D Objects & 2D
Shapes
Transformations
Data Analysis

5

1,2

3,4,5

2
9

2
6,7,8

1
1,2,3,4,5,9

5

3

1,2,4,5

Chance &
Uncertainty

102

There are four mam strands in tlK BC Mathematics IRP (1995) - Numbers,
Patterns and Relations, Sh^re and Space, and Statistics and Probability. These are further
divided into one, tw o or three substrands. The substrands are then Anther subdivided into
speciAc learning outcomes. For the purposes o f this analysis, I numbered the learning
outcomes in the order they ^p e a r in the IRP. Hence, the iSrst learning outcome in the
strand Numbers and in the substrand Number Concepts in the grade 4 curriculum ,
"estimate and then count the number o f objects in a set (0 to 1000), and compare the
estimate w ith the actual number" became Number, Number Concepts, 1. In the grade 4
curriculum there are 55 learning outcomes. Table 15 summarizes the distribution o f
items as they relate to the IRP strands. It also shows the specific learning outcomes that
have not been assessed directly in the DM ATs. As w ith the grade 4 curriculum , the grade
6 curriculum is divided into the four strands: Numbers, Patterns and Relations, Shape and
Space and Statistics and Probability. In the grade 6 curriculum there are also 55 learning
outcomes. Table 15 summarizes this inform ation as w ell.
When the distribution o f m arts is considered in the grade 5 D M AT the
strands have follow ing weighting: Numbers —37%, Patterns and Relations -1 1 % , Sh^re
and Space - 35% arui Statistics and P robability —17%. W ith the grade 7 D M AT a
sim ilar analysis shows that the strands can be given weighting as follow s: Numbers —
38%, Patterns and Relations - 15%, S h ^ and Space - 31% and Statistics and
P robability —15%. These proportions follow ed closely the test guidelines established by
the SDMC members \^ e n the tests were being designed.

103

It should be noted that items 26,33 and 34b o f the grade 7 test are not included in
Table 15 because they relate to learning outcomes that appear to most properly f it w ith
earliar grades.
Table 16
Annmwy

tAe M ztAem uiicj froce&sgs in the Grudk 5

Grudle 7 DM 4 7^ Drgu»ize<f

Numbers

Patterns and
Relations

Sh^)e and
Space

Statistics and
Probability

1 ,3 ,4 ,5 ,6 ,7 ,1 1 ,
1 4 ,1 5 ,1 6 ,31b
12,13
8, 9,10

26 ,27,28,29,
30

23,25,33

3 ,4 , 7,11,14,16,
31b
3 ,4 ,1 2 ,13,31b

26 ,27,28,29,
30
26 ,27,28,29,
30

18,31a, 32a,
32b,35
34c, 35, 36b
17,20,21,
36a
19 ,20 ,34a,
34b, 34c, 36a
34a, 34b

17 ,21 ,36a,
36b

23,24, 33

1,32
7 ,8 ,3 0

5 ,9 ,2 5 ,2 6 ,27a,
27b
9,24

2 3 ,35a, 35b

5

3

7, 8,30,33,
34a, 34b

5 ,2 6 ,27a

3 ,4 ,6

7, 8,30

5,24

32,33

2 5 ,2 6 ,27a, 27b

13, 32, 33,
34a, 34b

5 ,2 5 ,27b

Grade 5 D M AT
Communications
Connections
Estimations
Problem Solving
Reasoning
Technology
Visualization

1 ,2 ,3 ,4 ,6 ,7 ,1 1 ,
1 2 ,1 3 ,1 4 ,15,31b

22,24
25

Grade 7 D M AT

Connections

14,18,20,21,28,
29c
15,21

Estimations

11,12,17

Problem Solving

2 ,1 0 ,1 1 ,1 2 ,1 4 ,
1 5 ,2 0 ,2 2 ,2 8 ,29a,
29b, 31
2 ,1 0 ,1 1 ,1 2 ,1 4 ,
15,16,18,19,20,
2 2 ,2 8 ,29a, 29b,
29c, 31
2 ,1 0 ,2 2 ,29a, 29b,
31
1 5 ,1 8 ,2 0 ,2 8 ,29c

Communications

Reasoning

Technology
Visualization

4 ,6

4 ,6

104

The inibnnaüon pertaimng to mathematical processes relates more directly to the
constructs referred to in Chapter 2, in the section referring to Construct Related V alidity.
As mentioned in C hu ter 2, Cunningham maintains that constructs can only be inferred.
In Table 161 have summarized the inform ation 6om Appendix G about the math
processes related to the test items. Table 16 includes inform ation on both the grade 5 and
grade 7 tests. It is interestiog to note that there does not appear to be any items that relate
to technology in the grade 5 test; however, Table 15 shows diat there is a stül an excellent
match overall w ith the B ritish Columbia M inistry o f Education prescribed curriculum .
Criterion-Related V a lid ity
The strength o f the correlation between the D M AT scores and the mathematics
grades received by the students in grade 4 or grade 6 is evidence o f criterion-related
va lid ity, in this study,. An additional correlation can be calculated in the case o f the
grade 4 students because they also wrote a Numeracy test as part o fth e provincially
mandated Foundation S kills Assessment fa r grade 4. Because these measures are o f the
same curriculum at qiproxim ately the same tim e, the correlations are described as
concurrent v a lid i^ . N ot a ll the data on students' grades and the ir DM AT scores could be
matched. The specihc numbers o f students who wrote the DM ATs in each o f the
categories, students w ith term and Bnal marks, students w ith ju s t fin a l marks, and
students w ith not recorded marks, is shown in Table 17.

105

Table 17
M /m Aerf

PFrofe fAe GrW e J aW (Trajg 7 DA64Ty Cofggorizgf/ Zy fAe

Dofa CoZZecfgfZ

Students w ith term and Gnal maiks
Students w ith Gnal marks
Students wiGi no marks recorded
Students w ith FSA results
Miscoded students
Total number o f students

Grade 5

%

Grade?

%

1009
1108
189
614
12
1308

78%
86%
14%
47%

972
999
168

83%
86%
14%

-

19
1186

Be6>re calculatmg the correlation between D M AT scores and school grades, I
tested the data to determine whether or not there was a diSerence between the scores o f
the group 5)r w hich ta rn and/or Gnal grades were available and the group fo r w hich no
marks were available. This was done using a t-test on the data. The hypothesis being
tested fo r each grade was the same: there w ill be no signiGcant diHerence between the
D M AT scores 6 r the groiq) w ith marks recorded compared to the group that has no
marks recorded. This can be ea^xressed as Ho: //i =
Ha:

or //i-

- 0 and

or jWi- /fz # 0.
W ith the grade 5 data, the complete data group and the missing data groiq) varied

slightly in mean score and in standard deviation. The results were 26.72 (9.177) and
23.69 (8.679). A t-test resulted in

= -4.23 and w ith

= :^1.96 and

1295, it

showed that the sample diBerences were signiGcant.
W ith the grade 7 data the results were very sim ilar. The complete data group had
a mean score and standard deviaGon o f24.86 (9.063) and the missing data group had

106

22.80 (8.624). The t-test again resulted in

= -2.75 and w ith tc = ±1.96 and

1165,

it also Aowed that the sample difkrences were significant.
In Table 18 the correlation between die D M AT scores and student grades is
displayed. The data fo r this analysis was lim ited to the group o f grade 5 students fo r
whom we have Gnal and/or term marks. Table 18 includes correladons between the parts
o f the D M AT, between the D M AT and the FSA scores, and between the D M AT and the
students' term and Gnal marks.
Table 18

Corre/atzoMfybr tAe Grade 5 D M 4 T Between DM4T&orgf, FlSd &ores and Term and
Fzna/M zrks
MC

SA
.67

D M AT
.95
.87

FSA
.25
.19
.25

Term 1
.51
.37
.49
.26

Term 2 Term 3
Final
”
MC
.53
.53
.52
SA
.38
.42
.41
"
D M AT
.51
.53
.53
FSA
.27
.23
.23
*
Term 1
.79
.80
.87
“
Term 2
.86
.93
“
Term 3
.93
«
Final
Note. MC is the mulGple choice secGon (part 1) o f the D M AT w ith staGsGcs:
19.17,
5D = 5.63, n = 30 and SA is the short answer secGon (part 2) o f the test w id i staGsGcs: M
= 7.72,5Z) = 3.62, n = 16. Final re&rs to the Gnal grades.
In Table 18, we observe a general increase in the correlaGon between the overall
D M AT scores and the term grades going Gom the Grst term to the Gnal grade. This is to
be expected as the Gnal grade provides a measure o f the curriculum studied during the
whole year. There is also a strong correlaGon between the term and Gnal marks. This too
is to be e)q)ected as the Gnal mark reGects the total o f the ta rn grades. Using Cohen's
criteria forjudging eGect size (H urlburt, 1998), r = . 1 is a small effect size, r = .3 is a
medium eGect size and r = .5 is a large eGect size, we observe that: the correlaGon

107

between the D M A T scores and fin a l grades is large; the correlation between tl^
dichotomous items scores and Gnal grades is large; the correlation between the m ultipoint
items scores and Gnal grades is medium; and die correlation between the D M AT scores
and FSA scores is small.
In a sim ilar manner. Table 19 shows the various correlations between the
D M AT and term and Gnal student grades. W ith the grade 7 data there is no reference to
FSA semes because grade 6 student do not w rite FSA Numeracy tests.

Using Cohen's

criteria forjudg ing effect size: correlations between D M AT scores, dichotomous items
scores and m ultipoint items scores and Gnal grades are a ll large.
Table 19
Corre/oTronfybr the (Trodk 7 D M 4 T D M 4 T S c o r e s and Term oWFmoZ AAartr

MC

SA
.72

D M AT
.92
.94

Term 1
.52
.49
.54

Term 2
.55
.52
.58
.80

Term 3
.54
.53
.58
.77
.80

Final
.57
.55
.60
.88
.91
.92

“
MC
“
SA
D M AT
Term 1
«
Term 2
“
Term 3
“
Final
Note. MC is the m ultiple choice section (part 1) o fth e D M AT w ith statistics: A f = 14.61,
&D = 4.35, n = 25 and SA is the short answer section (part 2) o f the test w ith statistics: M
= 10.36, SO = 5.02, n = 23. Final refers to the Gnal grades.

Table 18 shows the strengths o f the various correladons that exist between the
D M AT scores, the FSA scores, term and fin a l grades Gir the grade 5 students. Sevaal
patterns appear to emerge. The correladons o f the total scores on the grade 5 D M AT and
the term and Gnal grades are closely ^rproxim ated by the same correladons measured
using the m iildple choice items only. On the whole, correladons o f D M AT scores to

108

term and ûnal grades increase as one progresses &om term 1 grades through to final
grades. This is to be expected because this D M AT was designed to test the whole grade
4 curriculum . It is assumed that the fin a l grade is the most accurate measure o f
achievement on a ll curriculum elements. As noted previously, the correlation o f DM AT
scores to FSA scores was small. S im ilarly, the correlation o f FSA scores to term and
Gnal grades was small; although a direct comparison could be misleading as the FSA
scores were measured on a Gve point scale. It appears that the grade 5 D M AT acted as a
better measure o f student proGciency than the FSA.
For the grade 7 DM ATs, we look at Table 19. Table 19 shows the strengths o f
the various correladons that exist between the D M AT scores and the FSA scores, term
and Gnal grades fo r the grade 7 students. The correladons are higher fo r the m uldple
choice secdon than the short answer secdon; however, the best comparison seems to be
between the total D M AT scores and the Gnal grades. Again, this is W iat one would
e)q)ect. On the whole, correladons o f D M AT scores to term and Gnal grades
progressively increase as one goes Gom term 1 grades through to Gnal grades. It is
assumed that the Gnal grade is the most accurate measure o f achievement on all
curriculum elements.
Addidonal V a lid ity Evidence
Further va lid ity evidence des in how the DM ATs have been received and
accepted in the school d is tric t For Gve years the tests were administered to students in
the school district. During that dme they have been accepted by principals and teachers
alike as a test o f students' mathonadcs abG i^. Re&rence was made in chapter 2 to 6ce
validity. This is an example o f i t In gathering data fo r this study many principals

109

expressed an interest in hearing the Gndings o f this research. In some school teachers on
hearing about the topic o f this study began conversing heely about the tests, even to the
specificity o f certain items on the tests. It impressed me that not only were they aware o f
the tests but in talking about them, they were accepting them as mathanatics tests. They
^p e a r to be a mathematics tests and have been received as such.

110

CHAPTER 5 - DISCUSSION
Summary
There were three main issues examined in this thesis, aU related to the
analysis o f the grade 5 and grade 7 DM ATs. The Grst issue raised was the re lia b ility and
the va lid ity o f the tests themselves. This analysis has provided an opportunity to examine
the strengths and weaknesses ofthe tests overall as w ell as those o f individual test items.
It also provided an opportunity to examine how these items and/or tests can best be used
in future assessments.
The second issue involved the process under w hich the DMATs were
developed and im plim oited. It involved the construction, adm inistration and scoring o f
the tests. The tests were designed in response to the im plim entation o f a new
mathematics curriculum and the new curriculum was its e lf a response to a new way o f
thinking about the learning o f mathematics. The school d istrict embarked on a process
designed to test how w ell the new curriculum was being imphmeated in the classroom.
This study provided an opportunity to study the process.
The th ird issue involved the analysis o f the test items. I chose to examine
the test herns using four diSerent approaches to test analysis - the Grst, a classical
approach w ith its concentration on classical test statistics, the second, an assumption hee
approach w ith its concentration on constructing and analyzing ICCs, the third, item
analysis based on Rasch analysis, and the 6)urth, a logistical item analysis using two and
three parameter models 6 r in&rm ation about item diS iculty, item discrim ination and
psuedo-guessing. The tools used in these analyses reflected these dif& re nt a^oaches.
For the classical analysis, 1 used Iteman to generate test and item statistics. For the

111

assumption 6ee analysis, I used TestGraf to construct ICCs that graphically presented the
data Tvitbout over riding assumptions about the item parameters. For the Rasch analysis,
I used Bigsteps to provide in&rm ation on item diO iculty and fo r two parameter and a
three parameter logistic analysis I used Ascal. In using tk s e dif& rent programs I had an
opportunity to compare them and indirectly to compare the diSerent ^proaches to test
analysis.
Suggestions w ill fo llo w m y conclusions but firs t I w ould like to summarize
the steps that were taken in the analysis o f the re lia b ility and va lid ity o f the DM ATs. M y
analysis began w ith observations about the internal consistency o f the tests. I used
Iteman to assist in the calculation o f the overall test statistics needed fo r this analysis.
There were two separate runs o f Iteman fo r each D M AT test. The Grst &cused on two
subtests —dichotomous test items and m ultipoint test items. The second further divided
the dichotomous test items into the four mathematics strands defined in the B ritish
Columbia, M inistry o f Education, IRP fo r mathematics. To study the re lia b ility o f the
marking o f the m ultipoint test items, 19 tests (fo r grade 5) and 20 tests (fo r grade 7) were
;Aotocopied and marked by each o f the markers. The results were compared to observe
i f there were significant diSerences between die markers. Correlations between the
markers were also calculated.
To obtain the data needed A ir observations about test va lid ity, I contacted
school principals and obtained FSA scores &>r a sample o f students Wm wrote the grade
5 D M AT and term and Gnal grades fo r a large sampling o f students who wrote the grade
S or grade 7 DM AT. The data was compared and correladons between the D M AT
scores, m uldple choice scores, short answer scores, term and Gnal grades and in the case

112

o f the grade 5 students the FSA scores were a ll included. This analysis was conducted
using the program SPSS. Inform ation 6om the SDMC was used in tabulating the test
items fo r a Table o f SpeciGcations based on the B ritish Columbia Mathematics IRP. The
Table o f SpeciGcations showing matbemaGcs processes was compiled using the
infbrm aüon Gx)m the SDMC and cross tabulating it w ith infbrmaGon taken Gom The
Western Canadian Protocol fo r CoUaboraGon in Basic EducaGon. InfbrmaGon Gom
teachers, administrators and school district personnel was used in determining face
vahdity observaGons.
In assessh% the process involved in the designing, constructing and
implementing the DM ATs, I used in&rmaGon Gom the SDMC. The analysis o f the
interrater reHabihty also contributed to the overall assessment o f the process as did the
analysis o f the test items.
The analysis ofthe test items involved the use o f several computer
programs and the data had to be organized in a way that was compatable w ith each. I
began by using Iteman and a classical ^iproach. As menGon previously, there were two
separate runs fo r each test using Iteman; the analysis o f the items is independent ofthe
number o f subtest involved. For the analysis o f the items I used the data G"om the Grst
run (dichotomous test items and mulGpoint test item s).
In terms o f the individual test items, fb r the grade 5 DM AT, those items that
display a poor discriminaGon were easia" items w hich w ould not norm ally be expected to
show much discriminaGon. Items which were identiGed as hard items generally showed
good to great discriininaGon, a desirable attribute o f an achievement test. There was no
evidence that items were excessively severe and so examinees sim ply guessed nor is

113

{here any evidence that would suggest that an items were keyed incorrectly. The
alternative responses appear to behave as they should.
In the analysis o f the individual test items, fb r the grade 7 D M AT, most
items that showed a low level o f discrim ination were easier items and a lower
discrim ination level is expected. There is an exception however in question 16. This
question has a diGBculty level measured as

= .43 and a discrim ination index o f D = .21 ;

so it a d iffic u lt question w ith poor discrim ination. N ot too surprisingly, question 16 also
has an alternative response (3) that has a positive point biserial. This is not a desirable
tra it fb r an achievement test. This is an item that should be considered fb r exclusion.
I fallow ed (his in itia l analysis by examining the TestGraf output. I was
able to con&m many o f the aspects fb r individual test items that I fbund using Iteman by
examining die response curves fw each item . The added feature w ith TestGrafwas diat I
was able to observe how the response curves varied over the changing student a b ility
levels. A b ility levels were measured along an expected scores axis. D ifB culty and
discrim ination were measured as a function o fth e students' abilities. This meant that it
was possible to predict at what levels the items displayed the greatest and/or least
discrim ination. In some instances, items that demonstrated poor discrim ination as
identiGed by Iteman were fbund to have excellent discrim ination among low scoring
students. This type o f question would be ideal fb r identifying students who are at risk. In
a sim ilar way it was possible to examine some items that were idenGGed, using Iteman,
as hard items and h i observe how the items showed great discriminaGon among high
scoring examinees. This type o f item w ould be ideal fb r identifying students who may
q u a li^ G)r special enrichment classes, programs or awards.

114

W ith the assumption 6ee analysis using TestGraf I was able to observe the
behavior o f item s along the ICCs geno-ated. This provided an opportunity to examine
discrim ination at various points along the d iS icid ty/a b ility continuum. The items in the
grade 5 D M A T that discriminated w ell 6)r low scoring students (below the 25*
percentile) were 2 ,4 ,6 ,7 ,9 ,1 0 ,1 2 ,2 0 ,2 3 , and 24. The items that showed a slight
negative discriinination fo r these students were 1 ,4 ,5 ,1 1 , and 15. The items that
discrim inated w e ll & r high scoring students (over the 75* percentile) were 1,15, and 21.
The items that showed a slight negative discriinination fo r these students were 8 ,9 ,2 1 ,
25, and 27. The other items discrim inate w ell & r students in the m iddle ranges. In the
grade 7 D M AT, items that discrim inated w ell fo r low scoring students (same percentiles
as fo r grade 5) were 3 ,4 ,5 , 7,15,17, and 23. The item that showed a slight negative
discrhnination fo r these students was 21. The items that discriminated w ell fo r the high
scoring students were 8, 9 ,1 0,1 1,20 , and 25. The items that showed a slight negative
discrim ination were 6,1 2,1 5, and 21. Item 16 showed quite a pronounced negative
discrinnnation fo r the highest scoring students. The other items discrim inate weU fo r
m iddle scoring students.
A fter reviewing the TestGraf output I conducted a Rasch analysis o f the
data using the program Bigsteps. M y ch ie f focus was to analyze the levels o f di@ culty
and discrim ination fo r the short answer questions; however, it provided an opportunity
fo r me to compare a ll the items short answer and m ultiple choice. I was able to compare
them and rank them in terms o f difS culty and discrim ination. In addition, I was able to
id e n tic items that did not appear to St a Rasch model. Knowing which items lacked St
provided the opportunity to once again examine them using TestGraf and Iteman.

115

The Gnai analysis I did was w ith the program Ascal where I examined the
data using a 2 parameter model as w ell as a 3 parameter model. The resulting output
gave me the opportunity to identify the items that lacked St. As w ith Bigsteps, I used this
inform ation to focus examination o f these items using Iteman and TestG raf
Conclusions

There are conclusions that can be reached about the test items themselves.
In the grade 5 test, the m ultiple choice ita n s appear to be behaving as expected. There is
a broad range o f difB culty and the individual items fa r the m ultiple choice section seem
to discrim inate w ell. The short answer items ^p e a r more severe. The distribution o f
scores fo r the short answer items is positively skewed (.158) indicating that the items
were &und to be generally difB cult. This appears to be die case eqiecially w ith items 31
and 32. I f items were to be changed, these two short answer items should be considered.
There is a strong correlation between the m ultiple choice items and students' grades so it
is conceivable that the m ultiple choice items alone could provide the SDMC the
inform ation needed about student achievement in mathematics.
In the grade 7 test, as w ith the grade 5 test, there is a broad range o f item
difG culty. Most ofth e items behave as expected. The exception is ite m l6. This item
has p = .32, Z)= 21 and

= .20 (6om Table 6). The ICC as shown using TestGraf is

fla t horn the Gfteendi percentile to about the eightieth percentile and even w ith high
scoring examinees it behaves unusually (see Figure 9). The item , when we used the two
parameter logistic model o f Ascal, showed the greatest lack o f Gt. This item should be
replaced. The distribution o f scores fo r the short answer items was positively skewed

116

(. 184) indicadng that this part o f the test is generally diS icnlt. The correlations in table
20 show a strong relationship between the short answer scores and students' m arts. Even
though it is a hard test it s till spears to give valuable inform ation and should therefore be
retained.
The manner in which marks are assigned fo r the short answer items should
be reviewed. For an achievement test, is it im portant to make a distinction between
students who attempt a question and get zero and students who do not even attempt the
question? The analysis o fth e short answer section is made more difB cult by choosing
this distinctioiL I f the inform ation is not needed and/or useful, it may be advisable to
sim ply assign zero to a ll in this sort o f situation. Item 34 in the grade 5 test needs to be
separated into two items i f the same scoring sheets are to be used. The item is out o f six
marks and the scoring sheets have only room & r 0 up to 4.
Finally, w ith the level o f difB culty known fo r each o f the test items, grade
5 and grade 7 items can be reorganized so that they progress generally 6om easier items
at the beginning to the more difB cult items toward the end. Such an organization w ill
provide most students a better opportunity to accurately show what they know.
There appears to be solid evidence that the SDMC was successfW in
constructing an assessment instrument that showed how students were achieving in
mathematics. In part, whether a test was valid or not is a function o f whether or not it has
met its original purpose. W ith the DM ATs, the original purpose was to assess student
achievement during a period o f transition Bom the old mathematics curriculum to the
new mathematics curriculum . The tables o f speciBcation, tables 26 and 28 in appendix
G, show clearly how the test items matched the new curriculum . Although the test items

117

did not incliide every learning outcome (to do so would have meant including 55 o f
them), the resulting correlation between the DM ATs and students' Snal maHcs are
sufRciently strong to consider that the test items can be generalized across the
curriculum . The tables o f q)eci6cations, tables 27 and 29 in appendix G, support this
assessment They outline the mathematical processes that are in evidence fo r existing test
items. These same mathematical processes are vdiat one finds in a ll the learning
objectives. The one exception, in this, is the technology component in the grade 5
DM AT.
The members ofthe SDMC decided to construct a mathematics
achievement te st As was discussed earlier (Chapter 2), the recommended range o f
difB culty fo r items in an achievement test is .30 < p < .70. For m ultiple choice test items
we could skew these values slightly higher. A range o f .40 < p < .80 w ould be
acceptable. For the grade 5 test and using data 6om Table 1, there are twenty-three out
o f th irty items that 611 w ith in this range. O f those outside the range & u r items are easier
(p > .80) and three items are harder (p < .40). The overall measure o f kurtosis (-.407)
siq>ports what we hnd about the distribution o f items by difB culty. We can expect the
scores to be more w idely distributed. In a sim ilar way using data from Table 6, in the
grade 7 test we Bnd that sixteen out o f twenty-hve items 611 w ith in the recommended
range. O f 6ose outside the range, two are easier items and seven can be considered
harda^ items. An overall kurtosis o f -.307 siq)ports d iis wider, fla tte r distribution o f
scores.
A comparison o f D M AT scores w ith term and fin a l grades and, in the case
o f 6 e grade 5 D M AT, w ith FSA scores provides an opportunity to compare 6 e DMATs

118

and the FSA —Numeracy test The correlations in Tables 18 show that the correlation
between the D M A T scores and 6nal grades is .53, a strong e fk c t using Cohen's criteria
forjudging e fk c t size. S im ilarly, it shows that the correlation between the FSA scores
and the fin a l grades is .23, a small to medium eGect using Cohen's criteria. It can be
concluded that the D M AT is a more pow erful measure o f mathematics achievement than
the FSA.
The strong correlation between the markers o f the short answer items is
another indicator o f how successful the SDMC members were in designing and
implementing the DMATs. The raters were w ell aware o f how the tests were to be
graded. It ^ypears that the grading process was such that raters clearly understood how
marks were to be ^)portioned. By way o f an overall conclusion to the test construction
process, the members o f the SDMC can be commended fo r the successful design,
construction and im plim entation o f these mathematics achievement tests.
Using a variety o f analysis programs to study the D M AT data has provided
an opportunity to compare the usefidness o f each. W ith Iteman 16)und I had an excellent
starting point in considering these tests. I hmnd it to be particularly valuable w ith the
dichotomous test items. The data were processed quickly, overall test statistics and item
statistics were available rig h t away and the ouqmt was easy to read. It fnovided a good
indication as to whether or not the items were behaving as they should. I was able to
examine correlation o f a ll responses but was especially able to check that the alternative
responses were behaving as they should. There were however some lim itations. Because
a ll calculations were taken directly horn the test data, t k test analysis and the item
analysis was test dependent. It was good inform ation to have but it was lim ited to the

119

examinees and how the examinees responded to these items 6)r this adm inistration o f the
tests. By using TestG raf I was able to use the data to predict probable responses and to
thereby analyze how items were behaving across a range o f students abilities. The
program is set up to use probabilities o f correct responses and expected scores. It is
therefore possible to examine the maximum likelihoods fo r certain responses. The data
were used to predict responses by studoits across the range o f a b ility levels. This was
particularly valuable in examining discrim ination levels 6*r the various a b ility grotg)s.
I found the TestGraf output fo r m ultipoint items difB cult to read and as a
result used the program in the analysis o f the dichotomous items only. 1 used Bigsteps to
examine the short answer test items. However, I found the program difB cult to setup and
the output difB cult to read. There was a lo t o f o u ^u t and 1 lim ited my considerations to
an analysis o f the items. It could have also been used in examining the examinees
(persons). I used Bigsteps and Ascal to identify items that could have been causing
difBcuhies. Even in id a itify in g an item 1 found that I reverted to Iteman and TestGraf to
examine more closely what could have been happening w ith it.
In terms o f w hich programs were most useful, Iteman and TestGraf were
this researcher^ s preferences. The Iteman program provided valuable inform ation about
how the grade 5s and the grade 7s students who wrote the DM ATs in 2000 responded to
the test items. The TestGraf program provided valuable inform ation about item
discrim ination at various levels o f a b ility.
The programs Bigsteps and Ascal, although more d iË c u lt to use, provide
an opportunity to compare test results in subsequent years w ith diGerent groups o f grade
5 and grade 7 students. They also provide an opportunity, should the tests be changed, to

120

reference any new items to the older items fo r w hich difB culty, discrim ination and
pseudo-guessing values have already been established. They provided anchor points fo r
any new test items. In this regard, although they were diGBcult to use, they also provide
valuable inform ation.
Lim itations
ZWa CoZZectfon
The cleaning up o f the data was a long and tedious a f& ir which cannot be
considered complete because there were s till several students who could not be identiGed
6om the inform ation recorded from the DM AT. The biggest factor seemed to be the
inaccurate coding that appeared on the bubbled answer sheets. A t the very least the
inform ation should be made clear at the very beginning. An even better course o f action
could be the identiGcation o f students by a bar coded or machine stamped label which
can be afSxed to the answer sheet and which wiU clearly ide ntify students by name,
number (PEN), and school. Such an identiGcaGon could occur even before the test
booklets were sent to schools.
The correlaGons, in Table 18, show to what degree there is agreement
between the various mathematics measures. There appears to be a strong reladonship
between the DMATs and students' marks and the relaGonship between the FSA scores
and students' maiks appears much weaker. CauGon should be exercised in conclusions
relating to the FSA scores. The data collected on FSA scores was not taken Gom the raw
test data. Rather, it was taken Gom the reports sent to schools and to parents. This data
consists o f only Gve points and is therefore quite restiicGve. Although less so, students'
marks are also somewhat restricted. The system o f grades in the schools produces a

121

seven or eight point system depending on whether or not students w ith lEPs are included.
I f possible raw data should be used.
There was a six month tim e delay between the FSAs and the DMATs. This
is a lim itahon in the study. It can mean that a lo t o f learning and/or forgetting has taken
place. It is expected that as the tim e between tests increases the correlations w ill decrease.
Two o f the ^xograms that were used in this analysis, Ascal, 2 parameter and
3 parameter models, can be used on dichotomous test items only. Because both these
program present an IR T analysis, the overall IR T analysis was lim ited. As IR T programs
are developed to process m ultipoint items, the range o f analysis option w ill be expanded.
Im plications fo r Future research
There are a variety o f assessment needs. Mathematics achievement is but
one. I also see a need fo r good diagnostic tests. It should be a mathematics test that
included a large number o f items, cross referenced to speciGc learning outcomes and that
display maximal discnm ination characteristics fo r low a b ility students. It could be o f
great assistance to teachers and administrators. The data from mathematics achievement
tests can be used to measure the relative achievements o f d ifk re n t group o f students.
This data could assist in the analysis o f factors that affect achievement in mathematics.
Im plications fo r Future Practice
There needs to be a decision as to the purpose o f the test in order that one
can on a personal level decide on the va lid ity o f the tesL I f the usefulness and therein the
purpose o f the test is fo r diagnosis - the identification o f students who are at risk in math
then the type o f question that is needed is one d ia l relates to specific curricular objectives
and shows good discriinination fo r students in the low scoring group. The overall look o f

122

the item w ould be "easy" and peAaps an overall discrim ination that is low but one where
the greatest level o f discrim ination is achieved 6>r students in say the lowest twenty-6ve
percent o f the population. A t the other extreme, i f the usefulness and therein the purpose
o f the test is to determine which students qualifying fa r enrichment, scholarships or
advancement into programs designed fa r the most capable math students then the type o f
question that is needed is one that shows good discrhnination among the high scoring
group. This item would probably be "hard" and may even have low discrim ination
overall but must have a great level o f discrim ination fo r students in the top twenty-hve
percent o f the population.
The original purpose o f the DM ATs was to give an overall read o f math
achievement in the d istrict and to that end they seem to be successful. The best indicator
that we currently have o f math achievement is student grades. The correlaticm between
the DM ATs and dnal grades is quite strong.
The big issue here is that there are a variety o f expectations that
practitioners have o f a test and the measure o f va lid ity is in part going to be a measure o f
how successhdly the instrument meets the various expectations that are set on it.

123

Reference L W

A llen, M . J. & Yen, W . M . (1979). Introduction to measurement theory. Belmont,
CA: Wadsworth Inc.
Assessment Systems Corporation. (1989). User's manual & r ascal: 2- and 3narameter IR T calibration program. St. Paul, Minnesota: Assessment Systems
Corporation.
Assessment Systems Corporation. (1993). User's manual fo r the iteman
conventional item analysis prnpram. SL Paul, Minnesota: Assessment Systems
Corporation.
B ritish Columbia Foundation S kills Assessment H ighlights 2000 (n.d ).
In& nnation about the FSA 2000 and results fo r reading comprehension, w ritin g and
numeracy. Retrieved from
Carpenter, J. & Gorg, S. (Eds) (2000). Principles and standards fo r school
mathematics. Reston, V A : The N ational Council o f Teachers o f Mathematics, Inc.
Crocker, L. M . & A lgina, J. (1986). Introduction to classical and modem test
theory. Orlando, Florida: H olt, Rinehart and W inston, Inc.
The Crown in R ight o fth e Governments o f Manitoba, Saskatchewan, B ritish
Columbia, Yukon Territory, Northwest Territory and Alberta. (1995). The common
curriculum hamework fa r K-12 mathematics: Western Canadian protocol fo r
collaboration in basic education.
Cunningham, G. K. (1998). Assessment in the classroom: Constuctine and
interpreting texts B ristol, PA: The Fahner Press.
Feldt, L. S. & Brennan, R. L. (1989). R eliab ility. In R. L. Lion (Ed), Educational
Measurement: Third E dition. New Y ork: M acm illan.
Gipps, C. & M urphy, P. (1994). A 6 ir test?: Assessment, achievement and equity.
P h ila d e lp ^: Open U niyersity Press.
Hambleton, R. K. (1989). Principles and selected applications o f item response
theory. In R. L. Linn (Ed), Educational Measurement: Third E dition. New Y ork:
MacmiUan.
Hambleton, R. K ., Swaminathan, H. & Rogers, H. J. (1991). FurKlamentals o f
. Newbury Park, CA: Sage Publications, Inc.

124

Henrysson, S. (1971). Gathering, analyzing, and using data on test items. In R. L.
Thorndike (E d), Education Measurement Second E dition. Washington, DC: American
Council on Education.
H urlburt, R. (1998). Comprebendine Behavioral Statistics. Toronto, ON:
Brooks/Cole Publishing Company.
K ilp a trick, J., SwaBbrd, J. and Findell, B. (2003). Adding it un: Helpine children
learn mathematics. Washington, DC: National Academy Press.
Kober, N . (1991). W hat we know about mathematics teaching and learning:
ED TALK. Washington, DC: Council & r Educational Development and Research. (ERIC
Document Reproduction S avice NO. ED 343 793)
Linacre, J. M . and W right, B. D. (1996). A user's guide to biestens: Rasch-model
computer program. Chicago, D: Mesa Press
Lyman, H. B. (1998). Test scores and \^bat they mean: Sixth edition. Needham
Heights, M A : A Viacom Company
M arshall, M . A . et al. (1997). The 1995 B ritish Columbia assessment o f
mathematics and science: Technical reporL V ictoria, BC: Queen's P rinter 6)r B ritish
Columbia
McConaghy, T. (1998). Canada's participation in TIMSS. Phi Beta Kapoan.
79(101.793, 800
M cM illan, J. H. & Schumacher, S. (1997). Research in education: A conceptual
Introduction New York: Addison-W esley Educational Publishers Inc.
Messick, S. (1989). V a lid ity. In R. L. Linn (Ed), Educational Measurement: Third
E dition. New Y ork: M acm illan.
MuUis, 1. V . S. eL al. (1997). Mathematics achievement in the prim ary school
years: lE A 's third international mathematics and science study (T lM S S l. Chestnut H ill,
M A : TIMSS International Study Center. (ERIC Document Reproduction Service NO. ED
410 120)
Province o f B ritish Columbia, M inistry o f Education, Curriculum Branch. (1995).
Mathematics K to 7: Integrated resource package. V ictoria, BC: (Queen's Printer fo r
B ritish Columbia
Ramsay, J. O. (2(KX)). TestGraf: A nroeram fo r the graphical analysis of m ultiple
choice test and questionnaire data. M ontreal: M cG ill U niversity.

125

Sax, G. (1997). Principles o f educational and psychological measurement and
evaluation. Scarborough, ON: Wadsworth Publishing Company
Stanley, J. C. (1971). R eliability. In R. L. Thorndike (Ed), Educational
Measurement Second E dition. Washington, DC: American Council on Education.
Webb, N . & Romberg, T. A . (1992). Im plications o f the NCTM standards fo r
mathematics assessment In T. A . Romberg (Ed.). Mathematics ass^sment and
evaluadon: Imperatives & r mathematics educators. Albany, N Y : State U niversity o f New
Y ork Press. (ERIC Document Reproduction Service NO. ED 377 073 )

126

Appendix B

Table 20

DMAT
Code

School
Code

010
020
030
037.5
040
050
060

311
313
314
312
316
317
324

070
075
080
090
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
285
290
300
310
320
330
340
350
360

318
319
321
322
323
327
328
329
331
332
333
334
336
337
338
339
340
342
343
344
345
348
350
347
351
353
354
355
358
359
360
364

Grade 5
Cases

School. Name
AUSTIN ROAD
BEAYERLY
BLACKBURN ELEMENTARY
BEAR LAKE
BUCKHORN
CARNEY HILL
FORT GEORGE CENTRAL
COLLEGE HEIGHTS
ELEMENTARY
DOME CREEK
DUNSTER
EDGEWOOD
FOOTHILLS
GISCOME
GLADSTONE
GLENVIEW
HALDIROAD
HART HIGHLANDS
HART HIGHWAY
HARWIN
HERITAGE
HIGHGLEN
HIGHLAND

mxoN
KING GEORGE V
LAKEWOOD ELEMENTARY
MACKENZIE ELEMENTARY
MALASPINA
MCBRIDE CENTENNIAL
MEADOW
MORFEE
MCLEOD LAKE
MOUNTAIN VIEW
NECHAKO NORTH
NUKKOLAKE
PEDENHILL
PINEVIEW
PINEWOOD
QUINSON
RON BRENT

130

54
36
56
5
21
30
15
40

Grade 7
Cases

11

54
78
98
5
48
55
26

45

85

42
42
27
25

1
3
5

35
6
20
31
11
58
37
14
27
33
19
7
15
31
20
40
25
29
28
6
23
32
21
30
30
20
31
22

School
Total

1

4
13
23
4

27
5

17
48
28
23
4
8
44
34
27
23
15
36
1

17
29
30
23
18
24
22

7

18
58
10
47
31
16
58
37
31
75
61
42
11
23
75
54
67
48
44
64
7
40
61
21
60
53
38
55
44

370
380
390
400
410
420
430
440
450
460
470
480
490

365
366
368
326
367
369
370
374
375
376
378
379
305
05
07
280
281
282
283
284
285
286
287
288
289
290
291
295
361
373

SALMON VALLEY
SEYMOUR
SHADY VALLEY
FORT GEORGE SOUTH
SOUTHRIDGE
SPRINGWOOD
SPRUCELAND
VALEMOUNT ELEMENTARY
VAN BIEN
VANWAY
WESTWOOD
WILDWOOD
HEATHER PARK
CONTINUING EDUCATION
CORRESPONDENCE CENTRE
BLACKBURN JUNIOR
COLLEGE HEIGHTS SECONDARY
DUCHESS PARK
KELLY ROAD
LAKEWOOD JUNIOR
SECONDARY
MACKENZIE SECONDARY
MCBRIDE SECONDARY
JOHN MCINNIS
PRINCE GEORGE SECONDARY
D f . TODD
VALEMOUNT SECONDARY
CONTINUING EDUCATION
YOUTH CONTAINMENT CENTRE
RED ROCK
UPPER FRASER

4
17
11
17
48
30
42
29
30
37
44
21

TOTAL KNOWN CASES
MISCODED DATA
TOTAL CASES

1297
12
1309

131

10
15
38
32
17
23
24
37
240

1175
17
1192

4
27
11
32
86
30
74
46
53
61
81
21
240

2472
2501

Table 21
DoAz

fin a l aW fl& 4 KeW ly

The numbers o f students 6>r which a ll tarm and final marks were recorded, ju s t Gnal
marks were recorded, total number o f studait w ith at least fin a l marks, no final marks
were recorded and students (grade 5 only) whose FSA results were recorded is as
k llo w s:

A ll term and Snal marks
Final marks only
A t least fin a l marks
W ithout data
W ith FSA results

Grade 5
number
1009
103
1112
185
614

Total Number o f Students

1297

percentage
77.8%
7.9%
85.7%
14.3%
47.3%

Grade 7
number
955
42
998
177

1175

132

percentage
81.4%
3.6%
84.9%
15%

Appendix c

#
#
#
#
#
#

% e test booklets fo r the DM AT grade 5 and grade 7 included:
2000 D istrict Assessment o f Mathematics —Grade 5, Part 1 - M ultiple Choice;
ZOOODistrictAssesanentofMathem alics —G rades, Part 2 -S h o rt Answer;
M ath Assessment - Grade 5 Answer Key;
2000 D istrict Assessment o f Mathematics - Grade 7, Part 1 - M ultiple Choice;
2000D istrictAssessm entofM athem atics —Grade 7, Part 2 —Short Answer; and
M ath Assessment - Grade 7 Answer Key.

They comprised pages 133 to 198 o f this thesis. They have been excluded horn
the published thesis to sa&guaid the integrity o f the item database.

133

Af^)endix D
Letter to the priiK ipals at elementary schools in the school district:

February 27,2002
(Principal's Name)
(School Name)
Dear (Principal)
I am currently working as a classroom teacher at Beaverly Elementary School, and am
working on my Master's degree in Curriculum and Instruction through the Education
graduate program at UNBC. This letter outlines my research project and is a request
for your assistance in its completion. The program has been approved by Bonnie
Chappell, Director of Instruction for School District #57, who will be advised of any and
all particulars of this project throughout ik duration. The results of this study are of
interest to various teachers and administrative officers in the school district.
Background
In the 1996-97 school year, grade 5 and 7 students in selected schools in the district
wrote locally developed Math Achievement tests. The tests were administered in the
fall and were designed to test student achievement in the grade 4 and 6 curriculum
respectively.
Tfie tests were originally developed and administered as part of the districts
commitment to ongoing student assessment. Some minor changes to the tests have
occurred but most items have remained unchanged and the tests have now been
administered several times since they vwre originally developed. The most recent use
of tfiese tests was in October, 2000, when they were written by all grade 5 and grade 7
students in the district.
Current Studv
To complete my master's degree thesis, I have proposed a study of the reliability and
the validity of tfiese tests.
The reliability of each test is the easier aspect to establish because student scores,
taken directly from the Math Achievement tests, can be used. The reliabilities can be
statistically calculated using these scores.
The validity of each test is tfie more difficult aspect to establish and will require
additional information. This is where I will need your help. My study will focus on two
measures of validity. The first - content-related validity - will require an item by item
analysis of the tests to determine the degree to which each test item matches the

199

curriculum. The second - criterion-related validity - will require a comparison of each
student's Math Achievement test score with his/her scores in comparable curriculum
areas and tests. The specific information I need is outlined in the next section.
Method
To carry out this study, I'll be analyzing students' scores from the Grade 5 and Grade 7
Math Achievement tests written in 2000. This information is available at Central Office
through School District files. It will provide information about the test items and will be
used to measure the reliability of each of the tests.
The validity of the tests will be determined by comparing each student's year end and/or
FSA scores with the scores they received on the Math Achievement tesk. From each
school, I will need:
#

for students who wrote the Grade 5 Math Achievement test - (1) math letter
grades for the 1999-2000 school year for each student Wio wrote this test (this
would include math letter grades A>r each term plus their final math grade) and
(2) results on the FSA-Numeracy test he/she wrote in May 2000 (Not Yet Within
Expectations (1), Meets Expectations (3), Exceeds Expectations (5) or halfway
between these either (2) or (4)).

# for Grade 7 students -math letter grades for the 1999-2000 school year for each
student who wrote this test (this would include math letter grades for each term
plus their final math grade).
# please note students who are on modified or adapted math programs

Ethics
This project will follow all UNBC research procedures and guidelines to safeguard and
maintain information conhdentiality. Student names will be coded once data is collected
and will be removed from all research documentation for further phases of the study.
As data collection only involves examining existing school records, there will be no
direct contact with the students and they will not be personally affected or identified by
the study in any way. This research proposal has been presented to and approved by
the UNBC Ethics Committee. Dr. Peter MacMillan from the UNBC Education
Department will be supervising the project. Research resulk will be shared with
university and school district personnel.
Summary
To facilitate the collection of this infomnation. I've attached a list of the students who
wrote the Math Achievement tests. In the case of students who wrote the grade 7 Math
Achievement test, their files may have been forwarded to a junior secondary school.

200

Letter to the principals at secondary schools in the school district:

I am currently working as a classroom teacher at Beaverly Elementary School, and am
working on my Master's degree in Curriculum and Instruction through the Education
graduate program at UNBC. This letter outlines my research project and is a request
for your assistance in its completion. The program has been approved by Bonnie
Chappell, Director of Instruction for School District #57, who will be advised of any and
all particulars of this project throughout its duration. The results of this study are of
interest to various teachers and administrative officers in the school district.

In Ihe 1996-97 school year, grade 5 and 7 students in selected schools in the district
wrote locally developed Math Achievement tests. The tests were administered in the
fall and were designed to test student achievement in the grade 4 and 6 curriculum
The tests were originally developed and administered as part of the districts
commitment to ongoing student assessment. Some minor changes to the tests have
occurred but most items have remained unchanged and the tests have now been
administered several times since they were originally developed. The most recent use
of these tests was in October, 2000, when they were written by all grade 5 and grade 7

Current Studv
To complete my master's degree thesis,
the validity of these tests.
The reliability of each test is the easier aspect to establish because student scores,
taken directly from the Math Achievement tests, can be used. The reliabilities can be
statistically calculated using these scores.
The validity of each test is the more difficult aspect to establish and will require
additional information. This is where I will need your help. My study will focus on two
measures of validity. The first - content-related validity - will require an item by item
analysis of the tests to determine the degree to which each test item matches the
curriculum. The second - criterion-related validity - will require a comparison of each

student's Math Achievement test score with his/her scores in comparable curriculum
areas and tests. The specific information I need is outlined in the next section.
Method
To carry out this study, I'll be analyzing students' scores from the Grade 5 and Grade 7
Math Achievement tests written in 2000. This information is available at Central Office
through School District files. It will provide information about the test items and will be
used to measure Ore reliability of each of the tests.
The validity of the tests will be determined by comparing each student's year end and/or
FSA scores with the scores tfrey received on the Math Achievement tests. From each
school, I will need:
» for students who wrote the Grade 5 Math Achievement test - (1) math letter
grades for the 1999-2000 school year for each student who wrote this test (this
would include math letter grades for each temr plus their final math grade) and
(2) results on the FSA-Numeracy test he/she wrote in May 2000 (Not Yet Within
Expectations (1), Meets Expectations (3), Exceeds Expectations (5) or halfway
between these either (2) or (4)).
* for Grade 7 students -math letter grades for the 1999-2000 school year for each
student who wrote this test (this would include math letter grades for each term
plus their final math grade).
» please note students who are on modified or adapted math programs
Ethics
This project will follow all UNBC research procedures and guidelines to safeguard and
maintain information confidentiality. Student names will be coded once data is collected
and will be removed from all research documentation for further phases of the study.
As data collection only involves examining existing school records, there vwll be no
direct contact with the students and they will not be personally affected or identified by
the study in any way. This research proposal has been presented to and approved by
the UNBC Ethics Committee. Dr. Peter MacMillan from the UNBC Education
Department will be supervising the project Research results will be shared with
university and school district personnel.
Summary
Most students who wrote the grade 7 Math Achievement test will be in grade 8 this year
and their files will have been forwarded to a secondary school. To complete this study, I
will need to know, for each, the final and term grades they received in grade 6 (ie. 19992000). I will also need to know the school from which they came so that I can match

203

j^{)pendîx E
Table 22
Orffer q/^D^cW fy w?(/ Dzfcn/wMoffOM, Grodle J

Iteman
15
21
1

26
30
19
13
11
14
25
16
18
8
28
22
29
3
27
17
9
4
5
6
24
7

12
10
20
23
2

D ifB cully
Bigsteps 12PAscal 3PAscal
15
1 15
15
1
21
1
1
21
21
26
30
30
14
30
26
31
19
19
13
13
13
26
11
11
8
25
25
19
14
14
11
16
16
18
18
18
25
8
8
16
28
28
29
22
22
29
3
29
28
3
3
27
27
17
17
17
22
27
9
4
9
32
4
5
5
4
5
9
24
7
6
24
6
24
7
7
6
23
12
23
10
10
10
12
23
12
20
20
20
2
2
2
35
36
34
33

Order
1st
2nd
3rd
4th
5th
6*^
7th
86
9tb
lOth
116
126
136
146
156
166
176
186
196
20 6
21st
22nd
23rd
25 6
266
27 6
28 6
29 6
306
31st
32nd
33rd
346
356
36 6

205

Iteman

28
25
26
16
29
3
27
18
19
17

14
11
8
13
5
21
9
4
30
1

22
7
24
15
6
23
3Ô
12
20
2

D iscrim ination
Bigsteps 2PAscal 3PAscal
34
33
28
14
28
25
25
28
17
17
30
31
3
27
8
27
3
29
35
29
23
3
26
26
15
5
5
25
16
29
26
18
2
18
36
32
14
16
5
4
4
17
11
24
13
19
18
4
8
10
27
9
19
21
13
9
11
7
21
16
24
19
20
30
23
14
7
11
2
23
12
7
6
8
24
10
13
10
1
6
9
20
21
12
30
6
22
1
12
2
22
1
15
15
22

This table summarizes the order o f the difB culty and discrim ination determined by the
d ifk re n t programs fo r the grade 5 DM AT. The items w ith the greatest difB culty and the items
w ith the greatest discrirnination ^tpear at the top o f the table.

Table 23
ofO rdbr o f

Iteman
9
11
12
16
25
8
20
10
24
14

2
18
19
21
6
1
22
3
23
4
13
15
17
7
5

o w f

Dironmmotfon, Gradle 7

D ifB culty
Bigsteps 2PAscal 3PAscal
9
16
16
11
9
9
12
11
11
16
12
25
25
12
25
8
8
8
10
20
20
20
10
10
24
24
24
14
2
14
31
28
2
14
18
30
18
2
18
32
19
19
21
21
22
19
6
6
6
1
1
22
1
22
21
34
3
3
3
23
17
23
4
4
17
13
7
13
23
4
15
7
17
13
7
15
15
35
5
1
5
5

Order
1st
2nd
3rd
4th
5th
6"^
7th
8th
9th
10th
11th
12th
13th
14th
15th
16th
17th
18th
19th
20th
21st
22nd
23rd
24th
25th
26th
27th
28th
29th
30 6
30th

206

Iteman
18

10

14

2
22
19
20
6
24
1
21
8
3
12
17
25
9
11
7
23
4

D iscrim ination
Bigsteps 2PAscal 3PAscal
29
10
18
11
33
31
18
22
16
26
32
34
22
10
2
35
27
28
30
14
7
10
2
19
9
19
17
20
1
14
18
20
2
22
6
7
20
3
12
6
7
14
1
9
5
19
17
3
25
25
11
6
24
9
17
11
12
1
4
24
3
12
4
5
8
8
8
21
24
21
21
23
25

33
29
27
26

1
1
1
1

1 3^d
1 33rd
1 34th
1 356

1
1
1
1

16
5
15
13

5
15
16
13

1
1
1
1

23
15
16
13

4

23
15
13

This table summarizes the order o f the difB culty and discrim ination detam ined by the
diSerent programs fa r the grade 7 DM AT. The items w ith the greatest difB culty and the items
w ith Ae greatest discrim ination appear at the top o f the charL

207

Af^iendix F
CoefScient A lpha can be used fo r dichotomous test items (iu which case it is the same as KuderRichardson form ula 20) and 6)r m ulti-point test items. The form ula (Sax, p282) is:
« =

n
n -1

X

Where n = number o f item on the test
^^2)" —variance o f scores on the test
o f the variances on each item ; that is
:

(Z M

Z S D >
Where W = number o f examinees
/ = hequency
score

Spearman-Brown form ula:
2rA
rs = ------rA + 1
where rs = sp lit h a lf re lia b ility
rh = correlation between two halves o f the test
Chi-Square Goodness to F it (H urlburt, 1998)

^
E,
where O = is the observed value
E = the expected value

208

Appendix G

Table 24

TobZe

Gra:/e i DA<4T

Pedtems &
30
Relations
Multiple Choice - Part 1

lesbon
1
2
3
4
5
6
7

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

IRP Strand
Numt)er
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Shape & Space
Shape & Space
Shape & Space
Shape & Space
Shape & Space
Statistics &
Probatxiity
Statistics&
Probability
Statistics &
Probability
Statistics&
Probability
Patterns &
Relations
Patterns &
Relations
Patterns &
Relations
Patterns &
Reladons

Ma/Aemafzcf ZRP

Patterns

Average

HO

Degree of
Difficulty
Average
Easy
Average
Average
Average
Easy
Easy
Difficulty
Average
Easy
Average
Easy
Average
Difficulty
Difficulty
Difficulty
Average
DifRculty
Average
Difficulty

Ministry of
Education*
K
K
A
A
K
K
A
HO
K
K
A
K
K
A
K
A
K
K
K
HO

5

Easy

K

Data Anal^is

1

Average

K

Data Analysis

3

Easy

K

Data Analysis

2 ,3

Average

K

Chance & Uncertainty

1.2

Average

K

Patterns

2

Difficulty

HO

Patterns

2

Average

A

Patterns

2

Average

HO

Patterns

1.2

Difficulty

HO

Learning Outcome
6 ,7
4
2
6
3
3

Substrand
Numt)er Concepts
Number Concepts
Number Concepts
Numtier Concepts
Number Concepts
Numt)er Concepts
Numtier Operations
Numt)er Concepts
Number Operations
Number Operations
Number Operabons
Number Concepts
Number Concepts
Numt)erOper8tk)ns
Number Concepts
Number Operations
Measurement
Measurement
Measurement
Measurement
3D Objects & 2D
Shapes

1

5
2, 3 ,4
4
3
10
10
8
9,10
8
2
3
9
5 ,7

209

ShoaAn»M#-Pad2
Question
31a
31b
32a
32b
33
34a
34b
34c
35

IRP Strand
Shape & Space
Number
Shape & Space
Shape & Space
StatisbcsA
Probability
Shape & Space
Shape & Space
Shape & Space
Shape & Space

36a

Shape & Space
Shape & Space

Substrand
Measurement
Number Operations
Measurement
Measurement

beaming Outcome
14
1,2
14
14

Data Analysis
Measurement
Measurement
Measurement
Transformations
3D0tyectS&2D
Shapes
3D Objects & 2D
Shapes

Degree of
Difficulty

Ministry of
Education*
A
A
K
K

2,3
6
6
4
1.2

A
HO
HO
HO
A

2

K

2

A

K is Ministry of Education Knowledge (Bloom's Knrwledge)
A is Ministry of Education Application - Bloom's Comprehension and Application
HO is Ministry of Education Higher Order Reasoning - Bloom's Analyse, Synthesis and Evaluation

210

Table 25
TaWe

fAe Grade 5 DM4 7^ JCüfiMgMd/KMadcf frw e ffe f
MuRipie Choice - Part 1

2
3

WCC Strand
Number
Numt)er
Number

Sutatrand
Numt)er Concepts
Number Concepts
Numtier Operations

4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Number
Shape & Space
Shape & Space
Shape & Space
Shape & Space
Shape & Space
Statisbcs & Prot)ability
Statistics & Prot)abllity
Statmtics & Pnot)abillty
Statistics & Prob^lity
Patterns & Relations
Patbms & Relations
Patterns & Relations
Patterns & Relations
Patterns & Relations

Number Operations
Nunnber Concepts
Number Concepts
Number Operations
Numt)er Concepts
Number OperaUons
Numt)er Operations
Number Operations
Numt)er Concepts
Number Concepts
Number Operations
Number Concepts
Number Operations
Measurement
Measurement
Measurement
Measur^nent
30 Otyects & 2 0 Shapes
Oata Analysis
Oata Analysis
Oata Analysis
Chance & Uncertainty
Patterns
Patterns
Patterns
Patterns
Patterns

ion
1

Learning
Outcome
6 ,7
3
12
12. 13,
14
5
5
12
8
15
15
14
11
11
19
10
19
2
3
9
5
21
1
3
2
5
2
2
2
1
2

Mathematics
Processes**
C .V
V
C, PS, R, V

Learning
Outcome
14
12
14
14
3
6
6

Mathematical
Processes**
C
C, PS, R, V
C
C

C, PS. R. V
C
C
C, PS, V
E
E
E
C, PS, V
CN, R, V
CN, R, V
C, PS, V

c,v
C, PS. V
E ,R ,V
C
PS
E. PS, R
E,V
PS

c,v
PS,V
C ,R
C, PS, R
C, PS, R
C, PS. R
C, PS, R
C, PS, R

Short Answer - Part 2
Question
31a
31b
32a
32b
33
34a
34b

WCC Strand
Shape & Space
Number
Shape & Space
Shape&Space
StatBtics & Prot)at)ility
Shape & Space
Shape & Space

Substrand
Measurement
Numt)er Operations
Measurement
Measurement
Oata Analysis
Measurement
Measurement

211

c,v
PS, R
PS.R

34c
35
36a
36b

Shape & Space
Shape & Space
Shape&Space
Shape&Space

Measurement
Transformations
3D Objects & 2D Shapes
3D Objects & 2D Shapes

4
24
17
18

E, PS
C, CN
E, PS, V
CN, V

** Mathematics Processes: Taken frwn Western Canadian Protocol for Collaboration in Basic Education
p.4
C - Communication
CN-Connections
E - Estimation & Mental Mathematics
PS - Problem Solving
R - Reasoning
T - Technology
V - Visualization

212

Table 26
Grodle 7 D&MT %mg fAe MafAemaficf ZRf
Multiple Choice - Part 1
lesdon

IRP Strand

1
2

9
10
11
12

Shape & Space
Number
Patterns &
Relations
Patterns &
Relations
Statistics &
Probability
Pattems&
Relations
Shape & Space
Shape & Space
Statistics &
Probat)ility
Number
Number
Number

13
14
15
16
17
18
19
20
21
22

Shape & Space
Number
Number
Number
Number
Number
Numba^
Number
Number
Number

23

Shape & Space
Statistics &
ProbabiBty
StatisdcsA
Probability

3
4
5
6
7
8

24
25

Sut)strand
3D Objects & 2D
Shapes
Numt)er Operations

Learning
Outcome
1

Difficulty

Ministry of
Education*

Easy

K
K

Easy

A

1

Variables & Equations

2

Patterns

5

Data Analysis

7

Easy

A

Patterns
Measurement
Measurement

5
2
3

Average
Average
Easy

HO
K
K

Data Analysis
Number Operations
Number Operations
Number Operations
3D Objects & 2D
Shapes
Number Crmcepts
Numt)erCorx%ptB
Number Concepts
Number Concepts
Numtier Concepts
Numt)er Concepts
Number Concepts
Number Concepts
Numlaer Operations
3D Objects & 2D
Shapes

8
1

HO

Easy
A
Difficulty A
1 Average A
1 Average A

2 Difficulty
3 Difficulty
12 Easy
3 Average
2 Average
11 Easy
6 Average
3
1 Easy
1 Average

K
A
K
A
K
K
K
A
K
HO

10

Easy

K

Charx» & Uncertainty

3

Average

A

Data Analysis

6

Average

K

Learning
Outcome

DIfRculty

Ministry of
Education*

Shcxt Ansvwer - Part 2
uestlon
26
27a
27b

IRP Strand
Patterns &
Relations
Stabstics&
Prot>at)ility
StatisUcs &

Substrand
Patterns

6 ,8
6 ,8

Data Analysis
Data Analysis

213

28
29a
29b
29c
30
31
32
33
34a
34b
35a
35b

Probability
Number
Number
Numt)er
Number
Shape&Space
Number
Shape&Space
Shape & Space
Shape & Space
Shape&Space
Shape & Space
Shape & Space

Numt)er Concepts
Numtrer Operations
Number Operations
I
Number Operations
| 1
Measurement
Number Operations
Transformations
3D Objects & 2D Shapes
Transformations

3
1
1

Measurement
Measurement

7
7

3
1
2
2

K is Ministry of Education Knowledge (Bloom's Knowledge)
A is Ministry of Education AppRcaUon (Bloom's Comprehension and Application)
HO Is Ministry of Education Higher Order (Bloom's Analysis, Synthesis and Evaluadon)

214

Table 27
7W»k

fAg Grodlp 7 DM4 71 ZWïMg AWAemaficf
Multiple Choice - Part 1

Quesîkm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

WCC Strand
Shape&Space
Number
Patterns & Relations
Patterns & Relations
Statktks&ProbabKty
Patterns & Relations
Shape&Space
Shape & Space
Statistics &Prob8bilNy
Numt)er
Number
Number
Shape&Space
Number
Number
Number
Number
Number
Number
Number
Number
Number
Shape&Space
Statistics & Piobabilily
Statistics & Probability

Sutrstrand
3D Objects & 2D Shapes
Number Operations
Variables & Equations
Patterns
Data Analysis
Patterns
Meœurement
Measurement
Data Analysis
Number Operations
Number Operations
Number Operations
3D Objects & 2D Shapes
Numtw Concepts
Number Concepts
Number Conceits
Number Concepts
Number Concepts
Numt)er Concepts
Number Concepts
NunAer Concepts
Numtier Operations
Measurement
Chance & Uncertmnty
Data Analysis

Learning
Outcome
14
12
6
1
7
1
1

3
8
12
13
13
15
4
10
3
8
9
5
4
1
12
12
11
6

Math Processes
C
PS, R. T
PS,R
C, R .V
C, E. PS. R. V
C, R ,V
CN, PS. R
CN, PS, R
C,CN
PS, R, T
E, PS, R
E, PS, R
V
C, PS, R
C. CN, R, V
R
E
C, R .V
R
C, PS, R. V
C^CN
PS, R, T
E
CN, R
C. T .V

Short Answer - Part 2
Question
26
27a
27b
28
29a
29b
29c
30
31
32
33
34a
34b

WCC Strand
Statistics &Prot)ab*ity
Statisbcs & ProbabHlty
Statistics & Probability
Number
Number
Number
Number
Shape&Space
Number
Shape & Space
Shape&Space
Shape & Space
Shape&Space

Subsband
Data Analysis
Data Analysis
Data Analysis
Numt)er Concepts
Number Operations
Number Operations
Number Concepts
Measurement
Number Operations
Transformations
3D Objects & 2D Shapes
Transformations
Transformations

215

Learning
Outcome
3
3
6
4
12
12
9
3
12
19
17
20
20

Math
Processes"*
C, PS, T
C, PS. T
C .T .V
C, PS, R, V
PS. R, T
PS. R ,T
C, R, V
CN, PS. R
PS, R .T
C, T ,V
PS, T .V
PS.V
PS,V

3Sa
35b

Shape&Space
Shape & Space

Measurement
Measurement

12 E
12 E

Mathematical Processes: Taken from Westem Canadian Protocol for Collaboration in Basic Education
p.4
C - Communication
C N - Connections
E - Estimation & Mental Mathematics
PS - Prdalem Solving
R - Reasoning
T - Technology
V - Visualization

216