Comparing different assessment formats in undergraduate mathematics by Belinda Huntley Submitted in partial fulfilment of the requirements for the degree Philosophiae Doctor in the Department of Mathematics and Applied Mathematics in the Faculty of Natural and Agricultural Sciences University of Pretoria Pretoria April 2008 © University of Pretoria DECLARATION I, the undersigned, hereby declare that the thesis submitted herewith for the degree Philosophiae Doctor to the University of Pretoria contains my own, independent work and has not been submitted for any degree at any other university. Name: ………………………………… Belinda Huntley Date:………………………………….. i ABSTRACT In this study, I investigate how successful provided response questions, such as multiple choice questions, are as an assessment format compared to the conventional constructed response questions. Based on the literature on mathematics assessment, I firstly identify an assessment taxonomy, consisting of seven mathematics assessment components, ordered by cognitive levels of difficulty and cognitive skills. I then develop a theoretical framework, for determining the quality of a question, with respect to three measuring criteria: discrimination index, confidence index and expert opinion. The theoretical framework forms the foundation against which I construct the Quality Index (QI) model for measuring how good a mathematics question is. The QI model gives a quantitative value to the quality of a question. I also give a visual representation of the quality of a question in terms of a radar plot. I illustrate the use of the QI model for quantifying the quality of mathematics questions in a particular undergraduate mathematics course, in both of the two assessment formats – provided response questions (PRQs) and constructed response questions (CRQs). I then determine which of the seven assessment components can best be assessed in the PRQ format and which can best be assessed in the CRQ format. In addition I also investigate student preferences between the two assessment formats. Keywords: Mathematics assessment, Quality Index, good mathematics questions, assessment components, assessment taxonomies, provided response questions, constructed response questions, multiple choice questions. ii DEDICATION “Yea, if thou criest after knowledge, and liftest up thy voice for understanding; if thou seekest her as silver, and searchest for her as for hidden treasures; then shalt thou understand the fear of the Lord, and find the knowledge of God. For the Lord giveth wisdom; out of His mouth cometh knowledge and understanding”. PROVERBS 2: 3 - 6 iii ACKNOWLEDGEMENTS The author would hereby like to thank all people and organisations whose assistance and co-operation contributed to the completion of this thesis, and in particular: My supervisor, Professor Johann Engelbrecht, for setting high professional standards which provided the much-needed challenge and motivation, and for his interest and moral support. My co-supervisor, Professor Ansie Harding, for her invaluable guidance and expert assistance throughout the period of this research. Elsie Venter, a senior lecturer from the Centre for Evaluation and Assessment, School of Education, University of Pretoria, for introducing me to the Rasch method of data analysis and for her assistance in analysing my research data. Marie Oberholzer, for editing and type-setting the final draft of my thesis with great care and diligence. My parents, Roland and Daisy Hill, for their prayers of upliftment and loving support. My husband, Brian and children, Byron, Christopher and Cayla, for their total devotion and patience and on-going faith in my abilities. iv INDEX OF TABLES Table 1.1 Table 1.2 Table 1.3 Table 1.4 Table 2.1 Table 3.1 Table 3.2 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 6.1 Table 6.2 Table 6.3 Table 7.1 Page Student numbers and pass rates for undergraduate mathematics courses, 2000-2004 8 Exit level outcomes (ELOs) 266 Associated assessment criteria (AAC) 267 Critical cross-field outcomes (CCFOs) 268 MATH Taxonomy 26 MATH109 student interviewees and their academic backgrounds 87 Probabilities of correct response for persons on items of different relative difficulties 102 Mathematics assessment component taxonomy and cognitive level of difficulty 137 Mathematics assessment component taxonomy and cognitive skills 138 Decision matrix for an individual student and for a given question, based on combinations of correct or wrong answers and of low or high average CI 154 Classification of difficulty intervals 169 Characteristics of tests written 178 Misfitting and discarded test items 269 Component analysis – trends 232 A comparison of the success of PRQs and CRQs in the mathematics assessment components 244 v INDEX OF FIGURES Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 7.1 Figure 7.2 Figure 7.3 Figure 7.4 SOLO Taxonomy Classification according to lecturer’s purpose Learning-required classification De Lange’s level of understanding Cycle of formative and summative assessment Integrated assessment Number of misreadings of nine subjects in two tests How differences between person ability and item difficulty ought to affect the probability of a correct response The item characteristics curve Item characteristic curve of the dichotomous Rasch model Mathematics I Major (MATH109) assessment programme Illustration of confidence deviation from the best fit line between item difficulty and confidence Illustration of expert opinion deviation from the best fit line between item difficulty and expert opinion Visual representation of the three axes of the QI Quality index for PRQ A good quality item A poor quality item Distribution of six difficulty levels A good quality item A poor quality item A difficulty, poor quality item An easy, good quality item Page 28 29 30 31 37 47 92 98 99 103 110 161 163 164 165 166 167 168 238 238 239 239 vi TABLE OF CONTENTS DECLARATION ABSTRACT DEDICATION ACKNOWLEDGEMENTS INDEX OF TABLES INDEX OF FIGURES Page i ii iii iv v vi CHAPTER 1: INTRODUCTION 1.1 1.2 1.3 1.4 1.5 Purpose of study Statement of problem Significance of the study Context of this study Outline of study 1 2 4 7 11 CHAPTER 2: LITERATURE REVIEW 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 Terminology The changing nature of university assessment in the South African context Assessment models in mathematics education Assessment taxonomies Assessment purposes 2.5.1 Diagnostic assessment 2.5.2 Formative assessment 2.5.3 Summative assessment 2.5.4 Quality assurance Shifts in assessment Assessment approaches 2.7.1 The traditional approach 2.7.2 Computer-based (online) assessment 2.7.3 Workplace- and community-based/learnership assessment 2.7.4 Integrated or authentic assessment 2.7.5 Continuous assessment 2.7.6 Group-based assessment 2.7.7 Self-assessment 2.7.8 Peer-assessment Question formats Constructed response questions and provided response questions Multiple choice questions 2.10.1 Advantages of MCQs 2.10.2 Disadvantages of MCQs 15 17 21 24 33 33 33 35 37 38 39 40 40 44 44 48 49 49 50 51 52 56 60 63 vii 2.11 2.12 2.13 2.10.3 Guessing 2.10.4 In defense of multiple choice Good mathematics assessment Good mathematics questions Confidence 67 69 70 74 77 CHAPTER 3: RESEARCH DESIGN AND METHODOLOGY 3.1 3.2 3.3 3.4 3.5 Research design 82 Research questions 84 Qualitative research methodology 85 3.3.1 Qualitative data collection 86 Quantitative research methodology 89 3.4.1 The Rasch model 89 3.4.1.1Historical background 91 3.4.1.2 Latent trait 96 3.4.1.3 Family of Rasch models 101 3.4.1.4 Traditional test theory versus Rasch latent trait theory105 3.4.1.5 Reliability and validity 107 3.4.2 Quantitative data collection 109 Reliability, validity, bias and research ethics 115 3.5.1 Reliability of the study 115 3.5.2 Validity of the study 116 3.5.3 Bias of the study 118 3.5.4 Ethics 119 CHAPTER 4: QUALITATIVE INVESTIGATION 4.1 4.2 Qualitative data analysis Qualitative investigation 122 122 CHAPTER 5: THEORETICAL FRAMEWORK 5.1 5.2 5.3 Mathematics assessment components 5.1.1 Question examples in assessment components Defining the parameters 5.2.1 Discrimination index 5.2.2 Confidence index 5.2.3 Expert opinion 5.2.4 Level of difficulty Model for measuring a good question 5.3.1 Measuring criteria 5.3.2 Defining the quality index (QI) 5.3.3 Visualising the difficulty level 135 138 149 150 153 157 159 160 160 163 167 CHAPTER 6: RESEARCH FINDINGS 6.1 6.2 6.3 6.4 Qualitative data analysis 6.1.1 Methodology Data description Component analysis Results 172 172 178 179 232 viii 6.4.1 Comparison of PRQs and CRQs within each assessment component CHAPTER 7: 7.1 7.2 7.3 7.4 7.5 7.6 232 DISCUSSION AND CONCLUSIONS Good and poor quality mathematics questions A comparison of PRQs and CRQs in the mathematics assessment components Conclusions Addressing the research questions Limitations of study Implications for further research REFERENCES 235 239 242 244 247 248 251 APPENDIX Appendix A1 Appendix A2 Appendix A3 Appendix A4 Appendix A5 Appendix A6 Appendix A7 Appendix A8 Declaration letter Table 1.2: Exit level outcomes (ELOs) of the undergraduate curriculum Table 1.3: Associated assessment criteria (AAC) Table 1.4: Critical cross-field outcomes (CCFOs) Table 6.2: Misfitting and discarded test items Test items Rasch statistics Confidence level items Rasch statistics Item analysis data 265 266 267 268 269 270 274 279 ix CHAPTER 1: 1.1 INTRODUCTION PURPOSE OF STUDY The quickest way to change student learning is to change the assessment system (Biggs, 1994, p5). The purpose of this research study is to investigate to what extent alternative assessment formats, such as provided response questions (PRQs) format, in particular multiple choice questions (MCQs), can successfully be used to assess undergraduate mathematics. For this purpose I firstly develop a model to measure how good a mathematics question is. To my knowledge, no such model currently exists and such a measure of the quality of a question is original. The objective is then to use the proposed model to determine whether all undergraduate mathematics can be successfully assessed. For this purpose a taxonomy of assessment components of mathematics is developed to enable us to identify those components of mathematics that can be successfully assessed using alternative assessment formats. Where this is not the case, the proposed model is used to determine whether the conventional constructed response questions (CRQs) format is more suitable for assessment purposes. By using the proposed model to compare the PRQ assessment format with the more conventional, open-ended CRQ assessment format applied in tertiary first year level mathematics courses, I attempt to address the research question of whether we can successfully use PRQs as an assessment format in undergraduate mathematics. One of the aims of tertiary education in mathematics should be to develop proficiency within all components of mathematics. A greater knowledge of the suitability of question formats within different components can assist educators and assessors to improve their assessment programmes, enhancing problemsolving abilities, reducing misconceptions, restricting surface learning and simultaneously improving the efficacy of marking and maintaining standards in a 1 first year tertiary mathematics course with large student numbers, as described in this study. This research study aims to assist mathematics educators and assessors in reducing their large marking loads associated with continuous assessment practices in first year undergraduate mathematics courses, by determining in which of the assessment components the PRQ assessment format can be used successfully, without undermining the value of assessment of undergraduate mathematics courses. 1.2 STATEMENT OF PROBLEM In South Africa, as in the rest of the world, higher education has been forced to respond to the demands placed on the sector by two late modern imperatives, globalisation and massification of education (Luckett & Sutherland, 2000). In Southern Africa, and in particular South Africa, the accessibility of higher education to the masses has a particularly moral dimension, as it implies the need to respond to the historical inequalities of the past apartheid era, by making the higher education sector accessible to previously disadvantaged black and working class communities. The apartheid government in South Africa attempted to limit access by black students by excluding them from most higher education institutions, imposing a quota system and by establishing institutions that are now regarded to be ‘historically disadvantaged’ universities (Makoni, 2000). With the consolidation of democracy, economic and political changes are taking place at the same time as the radical rethinking of the educational philosophies underlying higher education. Higher education needs to be more open, flexible, transparent and responsive to the needs of underprepared, lifelong and part-time learners (Luckett & Sutherland, 2000). This statement has implications for appropriate assessment practices in higher education. My interest in different forms of assessment at the first year level in undergraduate mathematics grew out of my role as a lecturer and coordinator of the Mathematics I Major course at the University of the Witwatersrand. In South Africa, the socio-economic and policy contexts emerging from the post-colonial 2 and post-apartheid reconstruction, pose enormous challenges for assessment practices in higher education. With more and more students being drawn to higher education, the numbers of first year undergraduate students studying tertiary mathematics are increasing rapidly. The growth in numbers of students enrolling for first year mathematics courses is not unique to the School of Mathematics at Wits University, in which the study was based. In a study conducted by Engelbrecht and Harding (2002), it was observed that this increase in first year enrolment numbers in mathematics is a national trend over the past decade in South African universities. At first year level Mathematics is regarded as a pre-requisite for many courses and is considered essential for students who venture into engineering and many other fields of technology. With this increase in student numbers, one of the challenges facing academics is that the more conventional open-ended constructed response questions (CRQ) assessment format is placing increased pressure on academic staff time. The assessment load created by increasing numbers of students and the shift in thinking towards competency frameworks are among the most prominent of many pressures. Improving student learning, encouraging deep rather than surface learning and nurturing critical abilities and skills all require time. However, in an expanding higher education system with increased student numbers and large classes, the conscientious educator is faced with a problem. Larger classes lead to more marking and, if properly done, takes more time. While lecturers can usually handle many more students in a lecture, the corresponding increase in their marking loads is another matter entirely. Continuous assessment of large undergraduate mathematics classes, which is generally considered as essential, can no longer be afforded because of the corresponding huge marking load. Alternatives have to be found. As the sizes of first year mathematics classes increase, so does the teaching load and especially the marking load. Decreasing the amount of feedback to each student in order to complete the task in the limited time available is clearly undesirable, given the great potential of feedback in assessment (Boud, 1995). The notion of ‘working smarter, not harder’ (Brown & Knight, 1994) should be 3 pursued. If assessment is to be a useful part of the learning experience of students, it is beneficial to employ a fairly diverse variety of assessment types and formats. The implementation of alternative assessment formats such as provided response questions (PRQ), including multiple choice items, matching and the single-response item assessment format, amongst others, is gathering support. Firstly, their simplicity is such that implementation for marking by computer, either through optically marked response sheets, or directly online is straightforward. Processing through optically marked recorders is fast, easy and is amenable to a variety of analysis. Secondly, scoring is immediate and efficient. PRQs can be very useful for diagnostic purposes for helping students to see their strengths and weaknesses. Thirdly, as this study aims to show, PRQs can be constructed to evaluate higher order levels of thinking and learning, such as integrating material from several sources, critically evaluating data and contrasting and comparing information. 1.3 SIGNIFICANCE OF THE STUDY In South Africa, as in the rest of the world, the changes in society and technology have imposed pressures on academics to review current assessment approaches. In these years of post-colonial and post-apartheid reconstruction in South Africa, academics are tasked with ensuring that graduates are able to apply their knowledge outside of the tertiary environment and to communicate and apply that expertise in a wide range of contexts (Makoni, 2000). Changes in educational assessment are currently being called for, both within the fields of measurement and evaluation as well as in specific academic disciplines such as mathematics. Geyser (2004, p90) summarises the paradigm shift that is currently under way in tertiary education as follows: The main shift in focus can be summarized as a shift away from assessment as an add-on experience at the end of learning, to assessment that encourages and supports deep learning. It is now important to distinguish between learning 4 for assessment and learning from assessment as two complementary purposes of assessment…. Assessment should be seen as an integral and vital part of teaching and learning. An emerging vision of assessment is that of a dynamic process that continuously yields information about student progress toward the achievement of learning goals (NCTM, 1995). This vision of assessment acknowledges that when the information gathered is consistent with learning goals and is used appropriately to inform instruction, it can enhance student learning as well as document it (NCTM, 2000). Rather than being an activity separate from instruction, assessment is now being viewed as an integral part of teaching and learning, and not just the culmination of instruction (MSEB, 1993). Assessment drives what students learn (Hubbard, 1997). Every act of assessment gives a message to students about what they should be learning and how they should go about it. It controls their approach to learning by directing them to take either a surface approach or a deep approach to learning (Smith & Wood, 2000). Students gear their learning processes to be effective for the type of assessment they will undergo. They will seek and request teaching methods that will best fulfil their ability to respond to the assessment. Because assessment is often viewed as driving the curriculum and students learn to value what they know they will be tested on, we should assess what we value. The type of questions we set show students what we value and how we expect them to direct their time (Hubbard, 1995). This study attempts to define the concept of a ‘good’ or successful question which can be used to successfully assess mathematics in both the PRQ and CRQ formats. Assessment must be linked to and be evidence of the levels of learning and in particular the learning outcomes and competencies required. Assessment defines for students what is important, what counts, how they will spend their time and how they will see themselves as learners. If you want to change student learning, then change the methods of assessment (Brown, Bull & Pendlebury, 1997, p6). 5 The more data one has about learning, the more accurate the assessment of a student’s learning. Assessment forms a critical part of a student’s learning. Student assessment is at the heart of an integrated approach to student learning (Harvey, 1992, p139). Mathematics at tertiary level remains conservative in its use of alternative formats of assessment. As goals for mathematics education change to broader and more ambitious objectives (NCTM, 1989), such as developing mathematical thinkers who can apply their knowledge to solving real problems, a mismatch is revealed between traditional assessment and the desired student outcomes. It is no longer appropriate to assess student knowledge by having students compute answers and apply formulas, because these methods do not reveal the current goals of solving real problems and using mathematical reasoning. During the period of this study (2004-2006) enrolment numbers for the first year mainstream mathematics course were large, with numbers between 400 to 500 students in each year. These large numbers placed increased pressures on academic staff time. In particular, the more conventional open-ended CRQ assessment format, which was the predominant method of assessment, resulted in very large marking loads. Recent expansions in student numbers have tended to result in an increase in teaching class sizes accompanied by a reduction in small group tutorial provisions. The wider access to higher education together with increased recruitment of tertiary students, have added to the burden of making provision both for larger groups and for individuals. This challenge led me to re-evaluate current assessment practices and to explore alternative assessment approaches. I hope that, based on the research findings, more support will be gained for assessment using the provided response (PRQ) format in undergraduate mathematics. Perhaps it is time for those involved in course co-ordination and curriculum design of large undergraduate mathematics courses to examine the learning benefits and experiment with changes in assessment. Computer 6 assisted multiple choice testing can provide a means of preserving formative assessment within the curriculum at a fraction of the time-cost involved with written work. Furthermore, developing a model by which to measure the quality of a question (PRQ or CRQ) is of great benefit to the successful assessment of such large undergraduate mathematics courses, improving the efficacy of the marking with respect to both time and quality. No such measure currently exists and such a model can be used to measure the quality of questions, either in PRQ or CRQ format. A greater knowledge of the quality of questions within the assessment components can assist mathematics educators and assessors to improve their assessment programmes and enhance student learning in mathematics. 1.4 CONTEXT OF THIS STUDY In this study, I firstly investigate how we can measure whether a mathematics question is of a good quality or not. Three measuring criteria are used to develop a model for determining the quality of a question. Secondly, using this model, the quality of all PRQs and CRQs are determined. Thirdly, a comparison is made within each mathematics assessment component, between the PRQ assessment format and the CRQ assessment format. Furthermore, I investigate student preferences regarding the different assessment formats, both PRQ and CRQ, in a first year mainstream mathematics course at the University of the Witwatersrand in Johannesburg, South Africa. University of the Witwatersrand The study is set within the milieu of a first year mathematics course (Mathematics I Major) at the University of the Witwatersrand over the period July 2004 to July 2006. The University of the Witwatersrand is a major researchorientated South African institution that draws its students from diverse socioeconomic backgrounds and a wide range of high schools (Adler, 2001). For example, some students come from schools which for the last several years have had close to 100% matriculation (Grade 12) pass rate; others come from 7 schools where the overall pass rate at the matriculation level over the last few years is less than 60%. School of Mathematics The School of Mathematics at the University of the Witwatersrand offered a three-year mathematics major course in the BSc, BA and BCom degrees between 2000 and 2004. From 2005 onwards, two majors were offered, Mathematics and Mathematics Techniques, a minor academic development that recognises the de facto distinction between the two essentially distinct suites of topics and their outcomes, aimed at students wishing to pursue careers in mathematics teaching. Student registrations in the School of Mathematics have increased by 73% since 2000, in line with an increase in registrations at the University of the Witwatersrand. In 2004, over 3400 students registered in the School of Mathematics and mathematics student numbers accounted for about 18.5% of the Faculty of Science. The average pass rate in the School of Mathematics was at the 70% level over the period of this study. A summary of course registration figures is given in Table 1.1. Table 1.1: Student numbers and pass rates for undergraduate mathematics courses, 2000-2004. Year 2000 2001 2002 2003 2004 Actual student course numbers 1998 2666 3203 3383 3447 Course Pass 1439 2053 2338 2402 2413 Course Fail 550 594 832 948 1017 Course Pass Rate 72 77 73 71 70 236 382 241 272 263 Course Cancelled (Source: Executive Information System, School of Mathematics, Academic Review, University of the Witwatersrand) First year Mathematics Major (MATH109) The first year Mathematics Major course (MATH109) has a minimum entry level of a Higher Grade C Symbol in Grade 12 mathematics. MATH109 has two compulsory components, Calculus and Algebra, both taught and tested throughout the year with a final examination in November. 8 The Mathematics I Major course, MATH109, is intended both for students who wish to become professional mathematicians or high school mathematics teachers and for students who need to complete the course as a co-requisite to other courses in the Science Faculty such as Physics or Computer Science. Students who are studying the Biological Sciences do not generally take the Mathematics I Major course. They do a less theoretical, more skill-oriented first year Ancillary Mathematics course and they cannot proceed to a second year of mathematics. The MATH109 course is compulsory for students entering degree courses in mathematics, computing, actuarial science, economics, statistics, but also attracts students from the biological sciences, humanities, education and business. This course thus attracts the kind of diversity now commonly found in undergraduate tertiary mathematics. Students’ interests, levels of motivation and mathematical needs are very varied in the group. Although all students in the course have studied Grade 12 Higher Grade mathematics, the students emanate from a range of schools and thus have a range of mathematical backgrounds. For example, many students have taken Additional Mathematics as an extra subject at school and hence have covered most of the Calculus and Algebra material taught in the first semester. At the other end of the spectrum, students have achieved the minimum entrance requirements, and due to disadvantaged educational backgrounds, demonstrate weaknesses in some areas of school mathematics such as fundamental algebra, trigonometry, functions and graphing. With the large number of students involved, the teaching in the first year is predominantly in large groups (up to 150 students per class) and each group comprises students from more than one faculty. It is also inevitable that an initial level of attainment and competence in a range of mathematical skills and knowledge is assumed of the class. Teaching in large classes is staff-efficient, but little direct provision can be made in lectures or classes to accommodate possible initial deficiencies of individual students where precise and detailed 9 feedback would be valuable. Supplementary assistance through tutorials are used to help students on a more individual basis. The tutorial classes are weekly 45-minute periods during which about 25 students come together in a class with a lecturer or student assistant. The tutorial classes are primarily periods in which the student can consult the lecturer or student assistant on particular tutorial problems or mathematical concepts. The tutorial problems are mathematical exercises which have been set, prior to the tutorial period, by the course co-ordinator (myself, in this instance), and are usually from the prescribed textbook. An important aspect of the MATH109 course is the prescribed Calculus textbook (Stewart, 2000). The textbook has many features advocated by the Calculus Reform Movement: for example, multiple representations of mathematical objects are presented in the textbook as are real-life applications of many mathematical concepts. Unfortunately, the textbook is still used in a traditional and conservative way: inter alia, students are not allowed to use technology such as graphics calculators or computers in problem-solving or in examinations, and group projects are not considered acceptable components of the assessment programme. However, in 2004, a technology component in MATH109 was introduced in which students learned the rudiments of ‘Mathematica’. This teaching innovation, using technology as a tool, had an impact on the assessment programme of MATH109. During the period of my study, the MATH109 assessment programme consisted of 4 class tests, a midyear exam and a final examination. The October class record is the cumulative of all tests and assignments written before the final exam (continuous assessment). In order to pass MATH109, the students’ final year mark must be ≥50%. Prior to the period of my study, assessment of the course had been very traditional with the CRQ assessment format being the predominant method of assessment. The implementation of alternative assessment formats such as PRQs, including MCQs, matching and single item-response questions for mathematics assessment was initially met with some resistance by the academic staff of the School of Mathematics at the University of the Witwatersrand. However, with the numbers of first year undergraduate students 10 studying tertiary mathematics increasing and the problems surrounding largescale traditional CRQ format examinations, such as quick and efficient marking of these, becoming more and more acute, the use of alternative PRQ assessment format gathered support. Conformity with qualification specifications The interim registration of the BSc degree under the South African National Qualifications Framework (NQF) requires that graduates have certain skills and abilities. The NQF may briefly be described as a flexible structure for articulating the various levels of the educational enterprise, at a national level. Its main purpose is to provide a degree of standardisation and interchangeability of educational qualifications across the country (Dison & Pinto, 2000). MATH109 course confirms to the NQF requirements. The Graduates’ skills and abilities are specified in Exit Level Outcomes (ELOs) in Table 1.2, found in Appendix A2. How these ELOs are assessed constitutes a series of Associated Assessment Criteria (AAC) in Table 1.3, found in Appendix A3. The ELOs and the AAC incorporate the Critical Cross-Field Outcomes (CCFOs) listed in Table 1.4, found in Appendix A4. 1.5 OUTLINE OF STUDY In the purpose of this study outlined in Chapter 1, I indicated that my primary research focus is to develop a model to measure how good a mathematics question is and to use this model to determine to what extent provided response questions (PRQs) and constructed response questions (CRQs) can be used to successfully assess mathematics at undergraduate level. In order to develop this research focus, I discuss and compare different purposes of assessment such as diagnostic, formative and summative. These will be reviewed in the literature review in Chapter 2. Terminology relevant to this study, as well as mathematics assessment components (Niss, 1993) will also be reviewed. Important issues in assessment practices for university 11 undergraduates will be identified (Biggs, 2000). Certain interesting alternative methods of assessment and question types in undergraduate mathematics will be explored (Cretchley, 1999; Anguelov, Engelbrecht, & Harding, 2001; Hubbard, 2001; Wood & Smith, 1999, 2001). In addition, various assessment taxonomies will also be discussed (Biggs & Collis, 1982; Bloom, 1956; Crooks, 1988; De Lange, 1994; Freeman & Lewis, 1998; Hubbard, 1995; Smith, Wood, Crawford, Coupland, Ball & Stephenson, 1996). What the literature on assessment reveals about good assessment practices and the qualities of a “good” question will be presented (Fuhrman, 1996; Haladyna, 1999; Webb & Romberg, 1992). This will become relevant when considering when a question in the assessment of mathematics is considered to be successful. Literature on the issue of confidence will also be presented. Other non-mathematical studies (Hasan, Bagayoko & Kelley, 1999; Potgieter, Rogan & Howie, 2005), where a respondent is requested to provide the degree of confidence he has in his own ability to select and utilise well-established knowledge, concepts or laws to arrive at an answer, will be elaborated upon in the literature review. Having defined the necessary theoretical background in Chapter 2, I introduce new concepts pertinent to my research study in Chapter 3. In this chapter on research design and methodology, I state my research question and subquestions in a more focused way. I describe how I went about investigating my research question and subquestions. The population sample and sampling procedures are described. The organisation of the study discusses both the qualitative and quantitative research methodologies. In particular, an in-depth discussion of the Rasch model (Rasch, 1960) is presented as this is the method of quantitative data analysis used in this research study. Issues of reliability validity, bias and ethics are also discussed. Chapter 4 presents the qualitative investigation which forms part of the qualitative research methodology. The qualitative investigation is in the form of interviews conducted with a representative sample of the target population of the study. These interviews were conducted to establish student preferences regarding different assessment formats that they had been exposed to in their 12 undergraduate mathematics course. Qualitative data in the form of student opinions will be summarised. In Chapter 5, a set of seven mathematics assessment components, based on Niss’s (Niss, 1993) mathematics assessment components discussed in Chapter 2, will be proposed. Further background will be given on the confidence index, together with a description of other statistical parameters pertinent to this study. In this chapter, I attempt to develop a theoretical framework to form a way of measuring the qualities of a good mathematics question. In particular, three measuring criteria: discrimination index, confidence index and expert opinion, will be described. These three parameters are used for measuring the quality of a test item. A Quality Index (QI) model, based on the measuring criteria, is developed to measure the quality of a good mathematics question. The QI model will be used both to quantify and visualise the quality of a mathematics question. The theoretical framework forms the foundation against which we address the research question and subquestions of how we can measure how good a mathematics question is and which of the mathematics assessment components can be successfully assessed in the PRQ format, and which can be better assessed in the CRQ assessment format. Chapter 6 presents the quantitative research findings and results. In the quantitative data analysis methodology, an overview of the statistical procedures followed will be given. Both the traditional statistical analysis of the quantitative data and the Rasch (Rasch, 1960) method of data analysis is discussed under the methodology section. A description of the data follows in which details of the tests written, the number of PRQs per test, the number of CRQs per test and the number of students per test are summarised. A component analysis is presented within the different assessment components. In this analysis, examples of items, both PRQs and CRQs, together with a radar plot and a table summarising the quality parameters of each item, is presented. Finally an analysis of good quality items and poor quality items in each of the PRQ and CRQ assessment formats, in terms of the quality index developed in section 5.3.2, within each of the seven assessment components will be presented. 13 In Chapter 7, I set about discussing my research results. The discussion in this chapter will include the interpretation of the results and the implications for future research. I also discuss how the research results could have implications for assessment practices in undergraduate mathematics. Furthermore, I draw conclusions from my research about which of the mathematics assessment components, as defined in section 5.1, can be successfully assessed with respect to each of the two assessment formats, PRQ and CRQ. The Quality Index model will be used both to quantify and visualise the quality of a mathematics question. In this way, I endeavour to probe and clarify my research question and subquestions as stated in section 3.2. I will signal some limitations of my research study, as well as some pedagogical implications for further research. 14 CHAPTER 2: LITERATURE REVIEW In order to set the background for furthering research knowledge in the area of assessment in tertiary undergraduate mathematics, various documents on what other researchers have produced are reviewed. These will include preliminary sources i.e. hard-copy or electronic indices to the literature; primary sources i.e. reports of research studies written by those who conducted them; and secondary sources i.e. published reviews of particular bodies of literature. 2.1 TERMINOLOGY Some technical clarification is necessary, as in this study the terms assessment, evaluation, tests and examinations shall be used frequently. According to Niss (1993) ‘assessment in mathematics education is taken to concern the judging of the mathematical capability, performance and achievement of students whether as individuals or in groups’ (p3). Assessment has been described as the heart of the student experience, the barometer of an educational system and the quality of teaching it provides (Luckett & Sutherland, 2000). Rowntree (1987) offers another definition, which emphasises the intimacy, subjectivity and professional judgement involved: Assessment in education can be thought of as occurring whenever one person, in some kind of interaction, direct or indirect, with another, is conscious of obtaining and interpreting information about the other person. To some extent or other it is an attempt to know that person. In this light, assessment can be seen as human encounter (p4). The following two definitions by the South African Qualifications Authority (SAQA) for the registration of South African qualifications reflect only one aspect of assessment, namely the process: 15 Assessment is about collecting evidence of learners’ work so that judgements about learners’ achievements, or non-achievements, can be made and decisions arrived at. Assessment is a structured process for gathering evidence and making judgements about an individual’s performance in relation to registered national standards and qualifications (SAQA, 2001, pp15, 16). Brown, Bull and Pendlebury (1997) provide a useful, working definition of assessment: ‘Assessment consists, essentially, of taking a sample of what students do, making inferences and estimating the worth of their actions’ (p8). Assessment is thus concerned with the outcomes of mathematics teaching at the student level. In its narrowest form, assessment seeks to measure the degree to which learning objectives have been met. In a broader context, it seeks to measure the achievement of graduate attributes (Groen, 2006). Evaluation in mathematics education on the other hand, is taken to be the judging of educational systems or instructional systems as far as mathematics teaching is concerned. These systems include curricula, programmes, teachers, teacher training, schools or school districts. Thus, evaluation addresses mathematics education at the systems level. According to Scriven (1991), evaluation refers to both the methods of gathering information from students and the use of that information to make a variety of judgements (p139). Romberg (1992, p10) describes evaluation as ‘a coat of many colours’. He emphasises that to assess student performance in mathematics, one should consider the kinds of judgements or evaluations that need to be made and consequently develop assessment procedures to address those judgements. We need to view tests as ‘assessments of enablement’ (Glaser, 1988, p40). In other words, rather than merely judging whether students have learned what was taught, we should ‘assess knowledge in terms of its constructive use for further learning’ (Wiggins, 1989, p706). 16 The word test originated from a testum, which was a porous cup determining the purity of metal. Later it came to stand for any procedures for determining the worth of a person’s effort. The root of the word assessment reminds us that an assessor (from ad + sedere) should sit with a learner in some sense to be sure that the student’s answer really means what it seems to mean. The implication of this is that assessment is primarily concerned with providing guidance and feedback to the learner. This is ultimately still the most important function of assessment. Tests and exams should be central experiences in learning, not just something to be done as quickly as possible after teaching has ended in order to produce a final grade (Steen, 1999). To let students show what they know and are able to do is a very different business from the all too conventional practice of counting students’ errors on questions. Such assessment practices do not welcome student input and feedback. Wiggins (1989) suggests that we think of students as apprentices who are required to produce quality work and are therefore assessed on their real performance and use of knowledge. For the purpose of this study, the term assessment will be used to refer to any procedure used to measure student learning. When tests and examinations are considered to be ways of judging student performance, they are forms of assessment. On the other hand, when the outcomes of tests and examinations are used as indicators of the quality of an educational system, then examinations and tests belong to the realm of evaluation. 2.2 THE CHANGING NATURE OF UNIVERSITY ASSESSMENT IN THE SOUTH AFRICAN CONTEXT In recent years, assessment has attracted increased attention from the international mathematics education community (MSEB, 1993; CMC and EQUALS, 1989). There are numerous reasons for this increase in attention, of which one seems to predominate. During the last couple of decades, the field of mathematics education has developed considerably in the area of outcomes and objectives, theory and practice (Hiebert & Carpenter, 1992; Niss, 1993; 17 Romberg, 1992; Schoenfeld, 2002; Stenmark, 1991). These developments have not, however, been matched by parallel developments in assessment. Consequently, an increasing mismatch and tension between the state of mathematics education and current assessment practices are materialising. Changing teaching without due attention to assessment is not sufficient (Brown, Bull & Pendlebury, 1997). Changes in educational assessment in universities are currently being called for - in its intent and in its methods. While much assessment still focuses on ranking students according to the knowledge that they gained in a subject or course, pressure for change has come in at least three forms (Nightingale, Te Wiata, Toohey, Ryan, Hughes & Magin, 1996). The first is a growing need to broaden university education and to develop – and consequently assess – a much broader range of student abilities. The second is the desire to harness the full power of assessment and feedback in support of learning. The third area arises from the belief that education should lead to a capacity for independent judgement and an ability to evaluate one’s own performance – and that these abilities can only be developed through involvement in the assessment process (Luckett & Sutherland, 2000). Assessment which requires the student only to regurgitate material obtained through lectures and required reading virtually forces the student to use a surface approach to learning that material. On the other hand, assessment which requires the student to apply knowledge gained on the course to the solution of novel problems, not previously seen by the student,… cannot be tackled without a deeper understanding (Entwistle, 1992, p39). If one adopts an outcomes-based approach to assessment (as is required by SAQA), then one is obliged to state quite explicitly to all stakeholders concerned what knowledge and skills or learning outcomes one is assessing i.e. the assessment criteria. Students’ performances are then assessed against these criteria. SAQA requires all qualifications to include critical outcomes, which consist of a list of general transferable skills that requires the learner to integrate knowledge, skills and attitudes while carrying out a task in a context of 18 application. This type of criterion-referenced assessment encourages links with teaching and learning. In contrast, in norm-referenced assessment, the criteria against which a student’s performance is compared with that of his or her peers remain implicit. Criterion-referencing tends to be more transparent because of its explicit statement of criteria. towards criterion-referencing. Currently, the trend in assessment is to move In criterion-referenced education, more time would be spent teaching and testing the student’s ability to understand and internalise the criteria of genuine competence (Wiggins, 1989). Criterion- referencing can help establish agreement amongst different assessors, which improves the reliability of the assessment. In order to implement criterion- referenced or outcomes-based assessment, it needs to be clear what the criteria are against which judgements will be made and what will count as evidence for meeting those criteria. The socio-economic and policy contexts in South Africa have posed enormous challenges for assessment practice in higher education. Contextual criteria have led to the introduction of new assessment policies relating to education and the accreditation of qualifications through a National Qualifications Framework (NQF) (see Chapter 1, p11). Below is an extract from the document entitled “Revisions to the Senate Policy on the assessment of student learning”, approved by the Senate of the University of the Witwatersrand, 2006, reflecting the changing nature of university assessment in the South African context. Assessment should be unbiased, fair, transparent, valid and reliable (noting that there is some tension between validity and reliability). Valid methods of assessment must be employed in order to sample the range of competencies required of a student graduating from this University, at all levels. In order to do this, depending on the purpose, the use of a variety of assessment forms and methods is recommended and may be carried out throughout the year. Assessment performance. should allow students to demonstrate optimal levels of Appropriate formats must be used for the valid testing of competencies and objectives, and adequate sampling with a variety of examiners over time will assist in reliably testing a variety of competencies. It is 19 acknowledged, however, that assessment is not an overriding aspect of teaching and learning, but is integral to it. Therefore the assessment of students should be designed to achieve the following purposes: ● To be an educational tool to teach appropriate skills and knowledge ● To encourage continuous learning and detect learning problems ● To determine whether students are meeting, or have met the educational aims and outcomes of a course (including qualifications exit-level outcomes where appropriate) and to give students continuous feedback on their progress ● To determine levels of competence and to inform students on their current competence ● To facilitate decisions relating to student progress ● To provide a measure of student ability for future employers ● To inform teachers about the quality of their instruction ● To allow evaluation of a course (p2). This policy is premised on the principles of promoting criterion referencing, which compares performance against specified criteria and encourages links with teaching and learning. There is a responsibility to provide criteria that make explicit the constructs of the teaching and to make these available and accessible to the students in as many different ways as possible. There is a need for flexibility and variety in assessment. The shift to criterion-referenced assessment would allow education to make sound judgements about the comparability of qualifications on the basis of scrutinising assessment criteria and the evidence required for their attainment. In tertiary education in South Africa, pressure to increase the student intake in higher education as well as to improve throughput has a particularly moral dimension. It implies the need to respond to the historical inequalities of the past, by making the higher education sector accessible to previously disadvantaged black and working class communities. This requires the system to be more open, flexible, transparent and responsive to the needs of under20 prepared, adult, lifelong and part-time learners (Harvey, 1993). This in turn, has implications for appropriate assessment practices in higher education. Such assessment practices would incorporate the use of alternative forms of assessment to provide more complete information about what students have learned and are able to do with their knowledge, and to provide more detailed and timely feedback to students about the quality of their learning. 2.3 ASSESSMENT MODELS IN MATHEMATICS EDUCATION An assessment model emerges from the different aspects of assessment: what we want to have happen to students in a mathematics course, different methods and purposes for assessment, along with some additional dimensions. The first dimension of this framework is WHAT to assess, which may be broken down into: concepts, skills, applications, attitudes and beliefs. Niss (1993) uses the term assessment mode to indicate a set of items in an assessment model that could be implemented in mathematics education. These items include the following: ● The subject of assessment i.e. who is assessed ● The objects of assessment i.e. what is assessed ● The items of assessment i.e. what kinds of output are assessed ● The occasions of assessment i.e. when does assessment take place ● The procedures and circumstances of assessment i.e. what happens, and who is expected to do what ● The judging and recording in assessment i.e. what is emphasised and what is recorded ● The reporting of assessment outcomes i.e. what is reported, to whom. For the purpose of this study, the focus will be on the objects of assessment in the Niss model outlined above i.e. types of mathematical content (including methods, internal and external relations) and which types of student ability to deal with that content. This varies greatly with the place, the teaching level and 21 the curriculum, but the predominant content objects assessed seem to be the following: [a] Mathematical facts, which include definitions, theorems, formulae, certain specific proofs and historical and biographical data. [b] Standard methods and techniques for obtaining mathematical results. These include qualitative or quantitative conclusions, solutions to problems and display of results. [c] Standard applications which include familiar, characteristic types of mathematical situations which can be treated by using well-defined mathematical tools. To a lesser extent, objects of assessment also include: [d] Heuristic and methods of proof as ways of generating mathematical results in non-routine contexts. [e] Problem solving of non-familiar, open-ended, complex problems. [f] Modelling of open-ended, real mathematical situations belonging to other subjects, using whatever mathematical tools at one’s disposal. In mathematics, we rarely encounter [g] Exploration and hypothesis generation as objects of assessment. With regards to the students’ ability to be assessed, the first three content objects require knowledge of facts, mastery of standard methods and techniques and performance of standard applications of mathematics, all in typical, familiar situations. As we proceed towards the content objects in the higher levels of Niss’s assessment model, the level of the students’ abilities to be assessed also increase in terms of cognitive difficulty. In the proof, problem-solving, modelling and hypothesis objects, students are assessed according to their abilities to activate or even create methods of proof; to solve open-ended, complex problems; to perform mathematical modelling of open-ended real situations and to explore situations and generate hypotheses. 22 In the Niss assessment model, objects [a] – [g] and the corresponding students’ abilities are widely considered to be essential representations of what mathematics and mathematical activity are really about. The first three objects in the list emphasise routine, low-level features of mathematical work, whereas the remaining objects are cognitively more demanding. Objects [a], [b] and [c] are fundamental instances of mathematical knowledge, insight and capability. Current assessment models in mathematics education are often restricted to dealing only with these first three objects. One of the reasons for this is that methods of assessment for assessing objects [a], [b] and [c] are easier to devise. In addition, the traditional assessment methods meet the requirement of validity and reliability in that there is no room for different assessors to seriously disagree on the judgement of a product or process performed by a given student. It is far more difficult to devise tools for assessing objects [d] – [g]. Inclusion of these higher-level objects into assessment models would bring new dimensions of validity into the assessment of mathematics. Webb and Romberg (1992) argue that if we assess only objects [a], [b] and [c] and continue to leave objects [d] – [g] outside the scope of assessment, we not only restrict ourselves to assessing a limited set of aspects of mathematics, but also contribute to actually creating a distorted and wrong impression of what mathematics really is (Niss, 1993). Traditional assessment models, have, in many cases, been responsible for hindering or slowing down curriculum reform. We should seek alternative assessment models in mathematics education which at the same time allow us to assess, in a valid and reliable way, the knowledge, insights, abilities and skills related to the understanding and mastering of mathematics in its essential aspects; provide assistance to the learner in monitoring and improving his/her acquisition of mathematical insight and power; assist the teacher to improve his/her teaching, guidance, supervision and counselling and to assist curriculum planners, authorities, textbook authors and in-service teacher trainers in shaping the framework for mathematical instruction, while also saving time. Alternative assessment models, such as the PRQ format, can reduce marking loads for 23 mathematical educators and assessors, and does provide immediate scores to students. 2.4 ASSESSMENT TAXONOMIES According to the World Book Dictionary (1990), a taxonomy is any classification or arrangement. Taxonomies are used to ensure that examinations contain a mix of questions to test skills and concepts. A leader in the use of a taxonomy for test construction and standardisation was Ralph W. Tyler, the “father of educational evaluation” (Romberg, 1992, p19) who in 1931 reported on his efforts to construct achievement tests for various university courses. He claimed to have found eight major types of objectives: ● Type 1: information ● Type 2: reasoning ● Type 3: location of relevant data ● Type 4: skills characteristic of particular subjects ● Type 5: standards of technical performance ● Type 6: reports ● Type 7: consistency in application of point of view ● Type 8: character (Tyler, 1931). At the time, Tyler neither linked these objectives to specific behaviour nor arranged the behaviour in order of complexity. By 1949, however, he had specified seven types of behavior: [a] understanding of important facts and principles [b] familiarity with dependable sources of information [c] ability to interpret data [d] ability to apply principles [e] ability to study and report results of study [f] broad and mature interests [g] social attitudes. 24 The next step was taken by Benjamin Bloom (1956), who organised the objectives into a taxonomy (dedicated to Tyler) that attempted to reflect the distinctions teachers make and to fit all school subjects. In Bloom’s Taxonomy of educational objectives, objectives were separated by domain (cognitive, affective and psychomotor), related to educational behaviours, and arranged in hierarchical order from simple to complex: ● Level 1: Knowledge ● Level 2: Comprehension ● Level 3: Application ● Level 4: Analysis ● Level 5: Synthesis ● Level 6: Evaluation. Bloom’s taxonomy has often been seen as fitting mathematics especially poorly (Romberg, Zarinnia & Collis, 1990). It is quite good for structuring assessment tasks, but Freeman and Lewis (1998) suggest that Bloom’s taxonomy is not helpful in identifying which levels of learning are involved. They, however, give an alternative which divides into headings not far removed from Bloom’s: ● Routines ● Diagnosis ● Strategy ● Interpretation ● Generation (Freeman & Lewis, 1998). As Ormell (1974) noted in a strong critique of the taxonomy, Bloom’s categories of behaviour “are extremely amorphous in relation to mathematics. They cut across the natural grain of the subject, and to try to implement them – at least at the level of the upper school – is a continuous exercise in arbitrary choice” (p7). All agree that Bloom’s taxonomy has proven useful for low-level behaviours (knowledge, comprehension and application), but difficult for higher levels (analysis, synthesis and evaluation). One problem is that the taxonomy suggests that lower skills should be taught before higher skills. The fundamental problem is the taxonomy’s failure to reflect current psychological 25 thinking on cognition, and the fact that it is based on “the naive psychological principle that individual simple behaviours become integrated to form a more complex behaviour” (Collis, 1987, p3). Additional criticisms have questioned the validity of the distinction between cognitive and affective objectives, the independence of content from process and the meaning of objectives isolated from any context (Kilpatrick, 1993). Nevertheless, the view of mental abilities and consequently of mathematical thinking and achievement as organised in a linear, hierarchical way has been powerful in 20th Century assessment practice. It has deep roots in our history and our psyches (Romberg et al., 1990). Since its publication, variants of Bloom’s taxonomy for the cognitive domain have helped provide frameworks for the construction and analysis of many mathematics achievement tests (Begle & Wilson, 1970; Romberg et al., 1990). Attacking behaviourism as the bane of school mathematics, Eisenberg (1975) criticised the merit of a task-analysis approach to curricula, because it essentially equates training with education, missing the heart and essence of mathematics. Expressing concern over the validity of learning hierarchies, he argued for a re-evaluation of the objectives of school mathematics. The goal of mathematics, at whatever level, is to teach students to think, to make them comfortable with problem solving, to help them question and formulate hypotheses, investigate and simply tinker with mathematics. In other words, the focus is turned inward to cognitive mechanism. Smith et al. (1996) propose a modification of Bloom’s taxonomy called the MATH taxonomy (Mathematical Assessment Task Hierarchy) for the structuring of assessment tasks. The categories in the taxonomy are summarised in Table 2.1. Table 2.1: MATH Taxonomy. Group A Group B Group C Factual knowledge Information transfer Justifying and interpreting Comprehension Applications in new situations Implication, conjectures and comparisons Routine use of procedures Evaluation (Adapted from Smith et al., 1996) 26 In the MATH taxonomy, the categories of mathematics learning provide a schema through which the nature of examination questions in mathematics can be evaluated to ensure that there is a mix of questions that will enable students to show the quality of their learning at several levels. It is possible to use this taxonomy to classify a set of tasks ordered by the nature of the activity required to complete each task successfully, rather than in terms of difficulty. Activities that need only a surface approach to learning appear at one end, while those requiring a deeper approach appear at the other end. Previous studies have shown that many students enter tertiary institutions with a surface approach to learning mathematics (Ball, Stephenson, Smith, Wood, Coupland & Crawford, 1998) and that this affects their results at university. There are many ways to encourage a shift to deep learning, including assessment, learning experiences, teaching methods and attitudinal changes. The MATH taxonomy addresses the issue of assessment and was developed to encourage a deep approach to learning. It transforms the notion that learning is related to what we as educators do to students, to how students understand a specific learning domain, how they perceive their learning situation and how they respond to this perception within examination conditions. The MATH taxonomy has eight categories, falling into three main groups. The first Group A encompasses tasks which could be successfully done using a surface learning approach. Group A tasks will include tasks which students will have been given in lectures or will have practised extensively in tutorials. In Group B tasks, students are required to apply their learning to new situations, or to present information in a new or different way. Group C encompasses the skills of justification, interpretation and evaluation. Tasks in both Groups B and C require a deeper learning approach for their successful completion. categories of the taxonomy are context specific. The For example, proving a theorem when the proof has been emphasised in class is a Group A task while proving the same theorem ab initio is a Group C task. The taxonomy encourages us to think more about our attempts at constructing exercises. Whether we act consciously on this influence or simply make changes 27 instinctively, it provides a useful check on whether we have tested all the skills, knowledge and abilities that we wish our students to demonstrate (Smith et al., 1996). Recently, work on how the development of knowledge and understanding in a subject area occurs has led to changes in our view of assessing knowledge and understanding. For example, in Biggs (1991) SOLO Taxonomy (Structure of the Observed Learning Outcome), he proposed that as students work with unfamiliar material their understanding grows through five stages of ascending structural complexity: Figure 2.1: SOLO Taxonomy. Prestructural a stage characterised by the lack of any coherent grasp of the material: isolated facts or skill elements may be acquired. Unistructural a stage in which a single relevant aspect of the material or skill may be mastered. Multistructural a stage in which several relevant aspects of the material or skills are mastered separately. Relational a stage in which the several relevant aspects of the material or skills which have been mastered are integrated into a theoretical structure. Extended Abstract the stage of ‘expertise’ in which the material is mastered both within its integrated structure, and in relation to other knowledge domains, thus enabling the student to theorise about the domain. (Adapted from Biggs, 1991) The first three stages are concerned with the progressive growth of knowledge or skill in a quantitative sense, the last two with qualitative changes in the structure and nature of what is learned. (Biggs, 1991, p12). According to Biggs (1991), at one end, knowledge and understanding are simple, unstructured and unsophisticated and of use as support for higher order abilities, while at the other end, they are complex, structured and provide the basis for expert performance. In the light of this opinion, Hughes and Magin (cited in Nightingale et al., 1996) regard assessment of isolated fragments of knowledge appropriate 28 at the earlier stages (perhaps the first two or three) of Biggs’s scheme. Only the assessment of higher order abilities would be appropriate at the later stages. With increased interest in the assessment of higher order abilities, other classifications to improve and assess learning have been developed. In a project at the Queensland University of Technology, a hierarchy of purposes for setting exercises was proposed to the faculty of a mathematics department. The aim of the project was to encourage faculty members to look more critically at their questions and to relate their questions to learning objectives. A classification according to the lecturer’s purpose was conceived as a framework for enabling faculty members to think critically about writing questions and about the signals concerning learning that the questions were sending to their students. This classification according to the lecturer’s purpose has been described in Figure 2.2 (Hubbard, 1995). Figure 2.2: Classification according to lecturer’s purpose. 1. To learn a formula, practice manipulation, become familiar with notation, state or prove a standard theorem. 2. Any purpose in 1, but set in a context which is mathematically irrelevant. 3. Apply theory to a problem for which a specific model has been provided, show how the model can be used in different situations. 4. Apply results to new kind of problem, develop problem solving strategies. 5. Prepare for a new concept, lead to the development of a concept or extend a concept. 6. Draw conclusions, generalise, make conjectures, reflect on results. (Adapted from Hubbard, 1995) In the Queensland project, it was then decided to separate the classifications in order to emphasise the different ways in which lecturer and student might view the questions. This resulted in the learning-required classification. (Figure 2.3) 29 Figure 2.3: Learning-required classification. 1. 2. 3. 4. Recognition of key words and symbols which trigger memorised, standard procedures. ! Some understanding of standard procedures so that they can be modified slightly for new situations. ! Ability to explain and justify procedures and to form them into a coherent system. ! Ability to synthesise mathematical experiences into strategies for problem solving. (Adapted from Hubbard,1995) This learning-required classification is based on Crooks (1988) classification, who regards it as a simplification of Bloom’s taxonomy. However Crooks’ third category ‘critical thinking or problem solving’ is divided into two categories. These are essentially critical thinking and problem solving but set in a mathematical context. When applying any taxonomy, the mathematical context is important, because learning objectives which are not subject-specific are more difficult for subject specialists to apply. If we analyse the goals of mathematics education, different levels can be distinguished. A possible categorisation of them is described by Jan de Lange (1994). Because the assessment has to reflect education, these categories can be used both for the goals of mathematics education in general and for the assessment. De Lange (1994) represents the levels of understanding in the form of a pyramid as shown in Figure 2.4. 30 Figure 2.4: De Lange’s levels of understanding. (Adapted from De Lange, 1994) The lower level This level concerns the knowledge of objects, definitions, technical skills and standard algorithms. Some typical examples are: ● adding (easy) fractions ● solving a linear equation with one variable ● measuring an angle using a compass ● computing the mean of a given set of data. According to De Lange’s categorisation, most of traditional school mathematics and traditional tests seem to be at the lower level. One might think that a question at the lower level will be easier than a question at one of the other two levels. But this need not be the case. A question at the lower level can be a difficult one. The difference is that it does not demand much insight; it can be solved by using routine skills or even by rote learning. 31 The second level The second level can be characterised by having students relate two or more concepts or procedures. Making connections, integration and problem solving are terms often used to describe this level. Also problems that offer different strategies for solving, or offer more than one approach to solve, are at this level. For questions at this level careful reading and some good reasoning are needed. There is quite a lot of information to read and students have to make decisions about their selection of strategies. The third level The highest level has to do with complex matters like mathematical thinking and reasoning, communication, critical attitude, communication, interpretation, reflection, generalisation and mathematising. creativity, Students’ own constructions are a major component of this level. Assessing content knowledge and understanding, usually at the lower levels of any taxonomy, is often assumed to be far less problematic than assessing the higher order skills and abilities at the higher taxonomy level. Academic staff have a long familiarity with conventional methods of assessing knowledge and understanding, and texts on how to assess knowledge have been in existence for many years (Ebel, 1972; Gronlund, 1976; Heywood, 1989; McIntosh, 1974). However, several researchers of student learning (Dahlgren, 1984; Marton & Saljö, 1984; Ramsden, 1984) have identified an alarming phenomenon whereby numerous students who have done well in examinations intended to test understanding, have been found to still have fundamental misconceptions about basic underlying principles and concepts on which they were supposed to have been tested. Some of the most profoundly depressing research on learning in higher education has demonstrated that successful performance in examinations does not even indicate that students have a good grasp of the very concepts which staff members believed the examinations to be testing (Boud, 1990, p103). 32 In the interests of higher quality tertiary education, a deep approach to learning mathematics is to be valued over a surface approach (Smith et al., 1996). Students entering university with a surface approach to learning should be encouraged to progress to a deep approach. Studies have shown (Ball et al., 1998), that students who are able to adopt a deep approach to study tended to achieve at a higher level after a year of university study. 2.5 ASSESSMENT PURPOSES Although we appreciate that assessment can have enormous value as a tool for learning and that it provides important data for review, management and planning, we also need to examine different theories of assessment. Different assessment purposes require different assessment theories. There is general agreement that assessment in an educational context can be grouped under three broad traditional purposes: Diagnostic assessment; Formative assessment and Summative assessment; with Quality assurance having been added more recently. These will now be defined and discussed in more detail. 2.5.1 Diagnostic assessment The purpose of diagnostic assessment is to determine the learner’s strengths and weaknesses and to determine the learner’s prior knowledge (Geyser, 2004). Diagnostic assessment can also be used to determine whether a student is ready to be admitted to a particular learning program and to determine what remedial action may be required to enable a student to progress. 2.5.2 Formative assessment Boud in Geyser (2004) defines formative assessment as: …focused on learning from assessment. Formative assessment refers to assessment that takes place during the process of learning and teaching – it is day-to-day assessment. It is designed to support the teaching and learning 33 process and assists in the process of future learning. It feeds directly back into the teaching-learning cycle. The learner’s weaknesses and strengths are diagnosed and (immediate) feedback is provided. It helps in making decisions on the readiness of the learners to do summative assessment. It is developmental in nature, therefore credits of certificates are not awarded (SAQA, 2001, p93). According to Biggs (2000), the critical feature of formative assessment is the feedback that is given to the students. This feedback is aimed at improving the learning of the student as well as the teaching of the lecturer, motivating students, consolidating work done to date and provides a profile of what a student has learnt. All formative assessment is diagnostic to a certain degree. Diagnostic assessment is an expert and detailed enquiry into underlying difficulties, and can lead to radical re-appraisal of a learner’s needs, whereas formative assessment is more developmental in assessing problems with particular tasks, and can lead to short-term and local changes in the learning work of a learner. Formative learning provides a model for self-directed learning and hence for intellectual autonomy (Brown & Knight, 1994). Students are encouraged to be more autonomous in appraising their performances, learning to be more reflective and to take responsibility for their own learning. Because formative assessment is intended as the feedback needed to make learning more effective, it cannot simply be added as an extra to a curriculum. The feedback procedures, and more particularly their use in varying the teaching and learning programme, have to be built into the teaching plans, which thereby will become both more flexible and more complex. The integration of feedback into the curriculum is emphasised very strongly by Linn (1989): …the design of tests useful for the instructional decisions made in the classroom requires an integration of testing and instruction. It also requires a clear conception of the curriculum, the goals, and the process of instruction. And it 34 requires a theory of instruction and learning and a much better understanding of the cognitive processes of learners (p5). The quote shows how much needs to be done with our current assessment system. Astin (1991, p189) was certain that ‘the best principles of assessment and feedback are seldom followed or applied in the typical lower-division undergraduate course’. It seems that there is little scope for formative assessment because too many assessments (especially examinations) do not lead to feedback to the students. In addition, there is the problem of continuous assessments placing increased pressure on staff time with an increase in marking loads. There is also dissatisfaction with the quality of feedback which students often get. These problems are all compounded by the fact that undergraduate classes in tertiary mathematics are usually very large. Large student numbers not only place pressure on administration and marking loads, but also on the effectiveness and quality of feedback to the students. A major improvement in assessment systems would be to examine departmental policies for generating feedback to students. There is a shortage of research into the way that students use the feedback that they do get. The practice of formative assessment must be closely integrated with curriculum and pedagogy and is central to good quality teaching (Linn, 1989). 2.5.3 Summative assessment The term ‘summative’ implies an overview of previous learning. Summative assessment is used to grade students at the end of a unit, or to accredit at the end of a programme (Biggs, 2000). Summative assessment is used to provide judgement on students’ achievements in order to: ● establish a student’s level of achievement at the end of a programme ● grade, rank or certify students to proceed to or exit from the education system ● select students for further learning, employment, etc ● predict future performance in further study or in employment ● underwrite a ‘license to practise’ (Brown & Knight, 1994, p16). 35 The overview of previous learning involved in summative assessment could be obtained by an accumulation of evidence collected over time, or by test procedures applied at the end of the previous phase which covered the whole area of the previous learning. Beneath the key phrases here of ‘accumulation’ and ‘covered’, lies the problem of selecting that information which is most relevant for summative purposes. It is through summative assessment that educators exert their greatest power over their students. Because the purposes of assessment often remain vague and implicit, there is a danger that the different assessment purposes, i.e. summative, formative or diagnostic become confused and conflated and as a consequence, assessment often fails to play a truly educational role (Harlen & James, 1997). For example, an over-stretched lecturer may set a test for formative purposes and then, through lack of time and energy, decide to use the results for summative purposes. Not only is this kind of practice unfair to students, but it also undermines the developmental potential of assessment. Students are entitled to be informed beforehand how their assessment results will be used. A further consequence of confusing the different purposes of assessment is that lecturers sometimes assume that they can add up a series of formative assessment results (eg. classmarks) in order to make a summative judgement. In assessing students it is advisable to keep the formative and summative purposes separate. This is because the reliability concerns of summative assessment are far greater than they are for formative assessment and confusion of the two may result in unfair assessment practices. A common and legitimate practice is to use the evidence derived from formative assessment indirectly to inform professional judgements made about students in difficult summative circumstances. The cycle of formative and summative assessment as illustrated in Figure 2.5 (Makoni, 2000) suggests that rather than understanding the formative and summative purposes of assessment as dichotomous, we should view them as two ends of a continuum (Brown, 1999). 36 Figure 2.5: Cycle of formative and summative assessment certification summative assessment establishment of learning outcomes & learning contract feedback the learning process feedback formative assessment evidence gathering interpreting & recording (Adapted from Luckett & Sutherland, 2000, p112) 2.5.4 Quality assurance One further purpose of assessment needs to be mentioned, and that is how assessment contributes to institutional management. Summative (and to a lesser extent formative) assessment can also be used for quality assurance of the educational system. Here assessment is used to provide judgement on the educational system in order to: ● provide feedback to staff on the effectiveness of their teaching ● assess the extent to which the learning outcomes of a programme have been achieved ● evaluate the effectiveness of the learning environment ● monitor the quality of an education institution over time (Brown, Bull & Pendlebury, 1997; Yorke, 1988). Although often neglected, this type of assessment is crucial. Erwin (1991, p119) said that “for the typical faculty [lecturer] or student affairs staff member, the 37 major value of assessment is to improve existing programmes”. The results of assessment and testing for accountability should be presented and communicated so that they can serve the improvement of educational institutions. 2.6 SHIFTS IN ASSESSMENT There are tensions between the different purposes of assessment and testing, which are often difficult to resolve, and which involve choices of the best agencies to conduct assessments and of the optimum instruments and appropriate interpretations to serve each purpose. For example, if we are clear on the purpose of each assessment we design, then we will be in a position to make sound judgements about ‘the what’ and ‘the how’ of the assessment instrument. Finally, it is worth noting that assessment, together with face-to-face teaching, course design, course management and course evaluation, is part of the generic task of teaching. The phrase ‘teaching, learning and assessment’ often makes assessment look like an afterthought or at least a separate entity. In fact, teaching and feedback (formative assessment) merge, while assessment is an ongoing and necessary part of helping students to learn. Geyser (2004) summarises the paradigm shift that is currently under way in tertiary education as follows: Traditionally, assessment has been almost entirely summative in nature, with a final explanation and educator as the sole and unconditional judge. Traditional assessments have often targeted a learner’s ability to demonstrate the acquisition of knowledge (that is, achievement), but new methods are needed to measure a learner’s level of understanding within content area and the organization of the learner’s cognitive structure (that is, learning). The main shift in focus can be summarised as a shift away from assessment as an add-on experience at the end of learning, to assessment that encourages and supports deep learning. It is now important to distinguish between learning for assessment and learning from assessment as two complementary purposes of assessment (p90). 38 This shift means that we need to move away from assessing how well students can reproduce content knowledge, towards a situation where we learn how to assess the integration and application of knowledge skills, and maybe even attitudes in unfamiliar as well as familiar contexts. Taking this idea one step further, Luckett and Sutherland (2000) are of the opinion that: Conventional ways of assessing students such as the unseen three hour exam, are no longer adequate to meet these demands. We can no longer justify testing again and again the same restricted range of skills and abilities; we can no longer get away with simply requiring students to write about performance, instead of getting them to perform in authentic contexts (p201). New trends in assessment in higher education demand that we begin to assess generic and applied competencies as well as traditional knowledge bases. Hence the need to collect evidence, via assessment, that shows how well (or badly, or if at all) our students have been able to understand, integrate and apply the knowledge, skills and values specified in our course outcomes. A shift in assessment is related to a shift between the types of assessment discussed in section 2.5. We will have to be innovative and try out a range of new assessment approaches and methods, ensuring that we do indeed assess all of our intended learning outcomes and that our assessments add value to students’ learning. Assessment will be seen as natural and helpful, rather than threatening and sometimes a distraction from real learning as in traditional models (Jessup, 1991, p136). 2.7 ASSESSMENT APPROACHES Assessment approaches work best where learning outcomes have been articulated in advance, shared with students and assessment criteria agreed. Questions about the purpose of assessment arise, especially questions related to formative as opposed to summative purposes. Assessment approaches which are integrated into a course, not ‘bolted-on’ are desirable – this implies both staff and curriculum development. 39 Before going on to describe alternative question formats, I will briefly outline a range of assessment approaches which are important to think about prior to selecting a specific method and designing a specific instrument. A number of different methods may be appropriate to any one approach, or combination of approaches, depending on one’s purpose, learning outcomes and teaching and learning context. 2.7.1 The traditional approach In the traditional approach it is taken for granted that assessment follows teaching and that the aim of assessment is to discover how much has been learned. Here the lecturer or examiner is usually considered to be the only legitimate assessor. Students are assessed strictly as individuals in competition with each other in a highly controlled environment and strict measures to avoid cheating are employed. Learning is viewed quantitatively in terms of the amount of teaching which has been absorbed. There is little interest in the specifics of which questions has been correctly answered. Common methods used in this approach include examinations, essays, pen-and paper tests and reports. Literature review has revealed that more recently certain interesting alternative approaches to assessment in undergraduate mathematics have been explored (Cretchley & Harman, 2001; Anguelov, Engelbrecht & Harding, 2001; Hubbard, 2001; Wood & Smith, 2001). In the overview of approaches that follow, innovative variations will be discussed. 2.7.2 Computer-based (online) assessment In an age of increasing access to computers and to university education, new technologies have become an exciting medium for the delivery and assessment of courses at the tertiary level. 40 There can be no doubt that increasing technological support for much that had to be done by hand, will not only impact on the way we do mathematics, but even determine the very nature of some of the mathematics that we do (Cretchley & Harman, 2001, p160). Engelbrecht and Harding (2004) found that ‘many teachers of mathematics still shy away from granting technology the same significant role in the assessment process’ (p218). The following statement by Smith (as cited in Anguelov, Engelbrecht and Harding, 2001) is very descriptive with regard to the motives for technological forms of assessment: Courses in mathematics that ignore the impact of technology on present and future practices of science, engineering and mathematics perpetrate a fraud upon our students. Technology should be used not because it is seductive, but because it can enhance mathematical learning by extending each student’s mathematical power. Calculators and computers are not substitutes for hard work, but challenging tools to be used for productive ends (p190). The use of computers in assessment can solve the problem of providing detailed, individualised feedback to large student numbers. This approach is often based on a mastery learning model, in which students receive immediate feedback and can repeat or progress at their own pace. In a study conducted by Senk, Beckmann and Thompson (1997), teachers pointed out that technology allowed them to deal with situations that would have involved tedious calculations if no technology had been available. They explained that “not-sonice”, “nasty”, or “awkward” numbers arise from the need to find the slope of a line, the volume of a silo, the future value of an investment or the 10th root of a complex number. Additionally, some teachers of Algebra II classes noted how technology influenced them to ask new types of questions, how it influenced the production of assessment instruments and how it raised questions about the accuracy of results (Senk, Beckmann & Thompson, 1997, p206). 41 I think you have to ask different kinds of things… When we did trigonometry, you just can’t ask them to graph y = 2 sin x or something like that. Because their calculator can do that for them… I do a lot of going the other way around. I do the graph, and they write the equation… The thing I think of most that has changed is just the topic of trigonometry in general. It’s a lot more application type things…given some situation, an application that would be modeled by a trigonometric equation or something like that [Ms. P]. I use it [the computer] to create the papers, and I can do more things with it…not just hand-sketched things. I can pull in a nice polynomial graph from Mathematica, put it on the page, and ask them questions about it. So, in the way, it’s had a dramatic effect on me personally… We did talk about problems with technology. Sometimes it doesn’t tell you the whole story. And sometimes it fails to show you the right graph. If you do the tangent graph on the TI-81, you see the asymptotes first. You know, that’s really an error. It’s not the asymptote [Mr. M]. The role of information technology in educational assessment has been growing rapidly (Barak & Rafaeli, 2004; Beichner, 1994; Hamilton, 2000). The high speed and large storage capacities of today’s computers makes computerised testing a promising alternative to paper-and-pencil measures. Assessment tasks should include life-like, authentic or situated activities (Cumming & Maxwell, 1999). For many disciplines, including mathematics, computer technology can be seen as part of such a context (Groen, 2006). Web-based testing systems offer the advantages of computer-based testing delivered over the Internet. The possibility of conducting an examination where time and pace are not limited, but can still be controlled and measured, is one of the major advantages of webbased testing systems (Barak & Rafaeli, 2004; Engelbrecht & Harding, 2004). Other advantages include the easy accessibility of on-line knowledge databases and the inclusion of rich multimedia and interactive features such as colour, sound, video and simulations. Computer-based online assessment systems offer considerable scope for innovations in testing and assessment as well as a significant improvement of the process for all its stakeholders, including teachers, students and administrators (McDonald, 2002). In a web-based study 42 conducted by Barak and Rafaeli (2004), MBA students carried out an online Question-Posing Assignment (QPA) that consisted of two components: Knowledge Development and Knowledge Contribution. The students also performed self- and peer-assessment and took an online examination. Findings indicated that those students who were highly engaged in online questionposing and peer-assessment activity received higher scores on their final examination compared to their counter peers. The results provide evidence that web-based activities can serve as both learning and assessment enhancers in higher education by promoting active learning, constructive criticism and knowledge sharing. Online assessment holds promise for educational benefits and for improving the way achievement is measured. Computer technology has come to play central roles in both learning objectives and instructional environment in tertiary mathematics. While the use of online assessment may seem a logical progression in this regard, it is perhaps not as widely used as it could be. Online assessment can be a valuable investment with efficiencies in marking, administration and resource use (Engelbrecht & Harding, 2004; Greenwood, McBride, Morrison, Cowan & Lee, 2000; Lawson, 1999). In a study conducted by Groen (2006) in the Department of Mathematical Sciences, University of Technology, Sydney, Australia, it was found that marking of computer-based tests was no more time-consuming than marking a paper-based test. Feedback was individualised, easy to supply and immediately accessible to students. Further, copying appeared no more or less possible than for a paper test. In addition, question item banks provided a valuable record of the components of assessment and provide a library of questions. Appropriate design of online assessments tasks and support activities can also foster other positive learning outcomes including competence in the use of, written and electronic communication, critical though, reasoned arguments, problem solving and information management, as well as the ability to work collaboratively. Further online assessment offers an authentic environment under which to assess the computer laboratory skills that feature strongly in many mathematics subjects and in professional practice (Groen, 2006). 43 2.7.3 Workplace- and community-based/learnership assessment Where employers are increasingly involved in workplace- and community-based learning and assessment, as is the case with nursing, social work, teaching and tailor-made programmes, employers are more involved in assessment issues, often coming to realise how complex and costly they can be. The workplaceand community-based learnership assessment approach gives students an opportunity to apply their knowledge and skills in a real-world context and to learn experientially. This approach is considered highly beneficial for the development of professional skills and competences as opposed to the learning of knowledge and theory in isolation from context or application. Typically, in such approaches, supervisors or mentors assess performances, but students are also required to submit a written report or portfolio to their lecturer (Brown & Knight, 1994). 2.7.4 Integrated or authentic assessment Concerns about validity heralded the new era in assessment dating from the 1960s to the present. From the beginning of the historical record to the nineteenth century, measurement in education was quite crude. During the nineteenth century, educational measurement began to assimilate, from various sources, the ideas and the scientific and statistical techniques which were later to result in the psychometric testing period, dating from about 1900 to the 1960s. Dating from the 1960s to the present is the policy-programme evaluation period. Tyler’s model of evaluation in education prevailed until the 1970s, when his approach was found inadequate as a guide for policy and practice. The earliest signs of the new era in assessment were small shifts away from norm-referenced towards criterion-referenced assessment. The standardised norm-referenced test based on behaviourism assures that one knows isolated pieces of knowledge. Such a test asks students to respond to a variety of questions about specific parts of mathematics, some of which the student knows 44 and some not. Responses are processed by summing the number of correct responses to indicate how many parts of mathematical knowledge a student possesses and the totals for an individual student compared to those of other students. Criterion-referenced assessment is also based on behaviourism (Niss, 1993). However, criterion-referenced assessment establishes standards (criteria) for specific grades or for passing or failing. So a student who meets the criteria gets the specified result. Competency standards may be used as the basis of criteria-referenced assessment. Mastery learning is another example: students must demonstrate a certain level of achievement or they cannot continue to the next stage of a subject or program of study. The goal is for everyone to meet an established standard. The problem with both approaches is that neither yields information about the inter-relationships among the parts of knowledge held by a student. Both approaches can reinforce the idea that mere right answers are adequate signs of achievement. What is required is authentic assessment: ‘contextualised complex intellectual challenges, not fragmented and static bits or tasks’ (Wiggins, 1989, p711). Authentic assessment (Lajoie, 1991), based on constructivist notions, begins with complex tasks which students are expected to work on for some period of time. Their responses are not just answers; instead they are arguments which describe conjectures, strategies and justifications. Integrated assessment calls on the students to demonstrate that they are: …able to pull together and integrate the different bits of information, skills and attitudes that they have developed from across a [whole qualification] as a whole. Integrated assessment therefore involves the design and judgement of learner performances that can be used as evidence from which to infer capability (the integration of theory and practice) and to demonstrate that the purposes of a programme as a whole has been achieved (Luckett & Sutherland in Makoni, 2000, p111). An authentic test not only reveals student achievement to the examiner, but also reveals to the test-taker the actual challenges and standards of the field (Wiggins, 1989). To design an authentic test, we must first decide what the 45 actual performances are that we want students to be good at. Authentic assessments can be developed by determining the degree to which each student has grown in his or her ability to solve non-routine problems, to communicate, to reason and to see the applicability of mathematical ideas to a variety of related problem situations (Niss, 1993). In other words, authentic assessment tasks call on students to demonstrate the kind of skills that they will need to have in the ‘real world’. Baron and Boschee (1995) argue that authentic assessment relates to assessing complex performances and higher-order skills in real-life contexts: Authentic assessment is contextualised, involves complex intellectual changes, and does not involve fragmented and static bits or tasks. The learner is required to perform real-life tasks (p25). Authentic assessment is performance-based, realistic and set within contexts that students will encounter beyond the educational setting. Learning is multidimensional and integrated. Integrated assessment is needed to ensure that students can bring together and integrate all the knowledge, skills and attitudes they have gleaned from a programme as a whole. Outcomesbased education requires integrated assessment of competence, which is described as consisting of three dimensions: ● knowledge/foundational competence – knowing and understanding what and why ● skills/practical competence – knowing how, decision making ability; and ● attitudes and values/reflexive competence – the ability to learn and adapt through self-reflection and to apply knowledge appropriately and responsibly (Luckett & Sutherland, 2000, p111). Reflexive competence is the ability to integrate performance and decision making with understanding and with the ability to adapt to change and unforeseen circumstances, and to explain the reasons behind these adaptations. 46 Authentic or integrated assessment is particularly appropriate for professional and applied courses. It should be used throughout the curriculum, particularly at the degree exit level. It may also be used at modular level in order to ensure that the specific learning outcomes listed in course outlines are achieved holistically. A scaffolded research project in the discipline is the primary vehicle for this to happen. This could integrate skills from across various disciplines. Diagrammatically, this can be represented as: Figure 2.6: Integrated assessment. Skills/practical competence Knowledge/foundational competence (knowing how, decision-making ability) (knowing & understanding what and why) Integrated assessment of knowledge in use Attitudes & values/ reflexive competence (the ability to learn and adapt through self-reflection and to apply knowledge appropriately and responsibly) (Adapted from Luckett & Sutherland, 2000, p111) The controversy about this sort of assessment is centred primarily around its reliability. For assessment to be reliable, it should yield the same results if it is repeated, or different markers should make the same judgements about students’ achievements. Because integrated assessment involves a complex task with many variables, the judgement of the overall quality of the performance is more likely to be open to interpretation than an assessment of a simpler task. In a truly authentic and criterion-referenced education, more time would be spent teaching and testing the student’s ability to understand and internalise the criteria of genuine competence than in a norm-referenced situation. In higher education, it does not necessarily mean a shift to more external forms of assessment, but it will mean that the unquestioned relationship between a 47 course and the assessment ‘which forms part of it’ will be open to critical scrutiny from an outcomes-oriented perspective. The positive aspect is that assessment will be related to outcomes in a discipline which can be publicly justified to colleagues, to students and to external bodies. We are now seeing moves to a holistic conception: no longer can we think of assessment merely as the sum of its parts, we need to look at the impact of the total package of learning and assessment (Knight, 1995). The assessment challenge we face in mathematics education is to give up old, traditional assessment methods to determine what students know which are based on behavioral theories of learning, and develop integrated or authentic assessment procedures that reflect current epistemological beliefs about what it means to know mathematics and how students come to know. 2.7.5 Continuous assessment Continuous assessment takes place concurrently with, and is often integrated into, the teaching/learning unit at issue. This approach involves assessing students regularly in a manner that integrates teaching and assessment; it uses feedback from each assessment to inform further teaching and the construction of the next assessment. It is usually formative and developmental in purpose, using a range of assessment methods in which the lecturer is not always the sole judge of quality. Its primary purpose is to inform students (and their parents) about their performance so as to help them control and adjust their learning activity. An almost equally important purpose is to inform the teacher about the outcome of his/her teaching in general in order to adjust it if desirable – and specifically in relation to the individual student in order to advise and influence his/her actual or potential association with mathematics. Continuous assessment suggests a cyclical process through which a multi-facetted, holistic understanding of the learner can be developed. If used summatively, continuous assessment should involve summing up the evidence about a learner through the exercise of professional judgement. It should not simply mean adding up a series of test marks that are all given equal weight (Luckett & Sutherland, 2000). 48 2.7.6 Group-based assessment This approach recognises that all learning takes place in a social context and that professional identity is best developed through interaction with a community of professionals. In this approach, students are required to work in teams. They may be assessed as a group or individually. This approach allows one to assess the learning process as well as its product. In group-based assessment, the assessor relies on peer-assessment to tap into attitudes and skills such as accountability, effort and teamwork. A typical approach is to calculate the final mark as the sum of a peer mark for process and a group mark for product. Peers allocate a mark to each individual in the group for process skills and the lecturer allocates a group mark for the learning product (Luckett & Sutherland, 2000). 2.7.7 Self-assessment Assessment systems that require students to use higher-order thinking skills such as developing, analysing and solving problems instead of memorising facts are important for the learning outcomes (Zohar & Dori, 2002). Two of these higher-order skills are reflection on one’s own performance – self-assessment, and consideration of peers’ accomplishments – peer assessment (Birenbaum & Dochy, 1996; Sluijsmans, Moerkerke, van-Merrienboer & Dochy, 2001). Both self- and peer-assessment seem to be underrepresented in contemporary higher education, despite their rapid implementation at all other levels of education (Williams, 1992). Larisey (1994) suggested that the adult student should be given opportunities for self-directed learning and critical reflection in order to mirror the world of learning beyond formal education. In the self-assessment approach students are invited to assess themselves against a set of given or negotiated criteria, usually for formative purposes but sometimes also for summative purposes. The aim of this type of assessment is to provide students with opportunities to develop the skills of thoughtful, critical 49 self-reflection. Self-assessment gives students a greater ownership of the learning they are undertaking. Assessment is not then a process done to them, but is a participative process in which they are themselves involved. This in turn tends to motivate students, who feel they have a greater investment in what they are doing. Self-assessment can be a central aspect of the development of lifelong learning and professional competence, particularly if students are involved in the generation and development of the assessment criteria and are required to justify the marks they give themselves (Boud, 1995). Self-assessment has proved to be an excellent means of getting students to take responsibility for their own learning and to become more reflective and effective learners (Luckett & Sutherland, 2000). Boud (1995) developed this further by arguing that traditional assessment practices neither matched the world of work, nor encouraged effective learning. “Self-assessment”, he argued, “is fundamental to all aspects of learning. Learning is an active endeavour and thus it is only the learner who can learn and implement decisions about his or her own learning: all other forms of assessment are therefore subordinate to it” (Boud, 1995, p109). On graduation, students will be expected to practice self-evaluation in every area of their lives, and it is a good exercise in self-development to ensure that these abilities are extended (Brown & Knight, 1994). The goal of self- assessment is to promote the reflective student, one who has a degree of independence and who is therefore well placed to be a lifelong learner. 2.7.8 Peer-assessment In peer-assessment students are involved in assessing their peers using a wide range of assessment methods, always under the guidance of the lecturer. The lecturer acts more as an external examiner, checking for reliability and is ultimately responsible for the final allocation of marks. 50 Criterion-referenced assessment makes this approach possible: the explaining, discussing and even negotiating of the assessment criteria and what will count as evidence for their attainment can be an extremely valuable learning experience for students. Using peer-assessment makes the process much more one of learning, because learners are able to share with one another the experiences that they have undertaken. For peer-assessment, ideas can be interchanged and effective learning will take place (Luckett & Sutherland, 2000). Experiencing peer-assessment seems to motivate deeper learning and produces better learning outcomes (Williams, 1992). Peer-assessment can deepen students understanding of the subject, develop their evaluative and reflective skills and their groupwork and task management skills. Peer- assessment is probably the best means of assessing how individual students work in teams. Given the importance which employers put upon the ability to work as part of a team, it is important that learners in higher education are exposed to situations which require them to respond sensitively and perceptively to peers’ work. Through peer-assessment students would be learning, which is, as we repeatedly argue, the main purpose of assessment (Brown & Knight, 1994, p60). 2.8 QUESTION FORMATS New forms of assessment and question formats are not goals in and of themselves. The major rationale for diversifying mathematics assessment is the value that the diversification has as a tool for the improvement of our teaching and the students’ learning of mathematics. Lynn Steen in Everybody Counts (Mathematical Sciences Education Board, 1989, p57) makes the point that ‘skills are to mathematics what scales are to music or spelling is to writing. The objective of learning is to write, to play music, or to solve problems – not just to master skills’. As assessment policies change, so too must our assessment practices and instruments. Mathematics tests cannot only be vehicles used to assess the memorisation and regurgitation of rote skills. Assessment driven by 51 problems and applications will naturally subsume the more routine skills at the lower levels of thinking. Again from Everybody Counts, we know that: Students construct meaning as they learn mathematics. They use what they are taught to modify their prior beliefs and behaviour, not simply to record the story that they are told. It is students’ acts of construction and invention that build their mathematical power and enable them to solve problems they have never seen before (p59). Today’s needs demand multiple methods of assessment, integrally connected to instruction, that diagnose, inform and empower both teachers and students. 2.9 CONSTRUCTED RESPONSE QUESTIONS AND PROVIDED RESPONSE QUESTIONS Questions used for assessment can be classified into two broad categories – Constructed Response Questions (CRQs) where students have to construct their own response and Provided Response Questions (PRQs) where the student has to choose between a selection of given responses. This terminology was introduced by Engelbrecht and Harding in 2003. In a constructed response format, the student produces a product such as a case study report or lab study, engages in a process or performance such as a social work interview or a musical performance, or exhibits a personal trait such as some leadership ability (Engelbrecht & Harding, 2003; Haladyna, 1999). In mathematics, CRQs or free-response items (Braswell & Jackson, 1995) include questions in open-ended format (Bridgeman, 1992), essays, projects, short answer questions (paper-based or online), portfolios and paper-based or online assignments. Communication in mathematics has become important as we move into an era of a thinking curriculum (Stenmark, 1991). In a constructed response format, writing in mathematics becomes vital. Mathematics writing may take on many forms. It may be a separate activity, or may be part of a larger project. Journals, reports of investigations, explanations of the processes used in solving a problem, portfolios or responses to CRQs all become part of what students do daily in the mathematics class as well as what is reviewed for 52 assessment purposes. The traditional three-hour, unseen constructed response examination constitutes an important component of any undergraduate mathematic assessment programme. However, where clear criteria are absent, the marking of such examinations for summative purposes is unreliable (Luckett & Sutherland, 2000) and time-consuming. Methods of assessment within the examination framework can be varied to assess a wider range of cognitive skills and to achieve higher levels of reliability. For example, short answer questions are easier to mark reliably, can be designed to test a wide range of knowledge and are not that time consuming to mark; assignments in which students are given a specified period to deliver a product are closer to real-world conditions and allow more time for thought; open-book examinations and tests are also more authentic and assess what students can do with information. Examinations can be used as opportunities for problem-solving if an unseen exam question is, for example, linked to case studies that require students to apply the material that they have had to prepare for the examination to different situations (Hounsell, McCulloch & Scott, 1996, p115). In a provided response or fixed-response format (Ebel & Frisbie, 1986; Osterlind, 1998; Wesman, 1971), the student chooses among available alternatives. PRQs include multiple choice questions (MCQs), multiple- response questions, matching questions, true/false questions, best answers and completing statements. A true/false question can be classified as a particular type of two option multiple choice. Matching questions, in which students are asked to match items, can be designed to test knowledge and reasoning. In the ‘complete the statement’ type of PRQ, the student is given an incomplete statement. He/she must then select the choice that will make the completed statement correct. PRQs are sometimes referred to as objective tests, and such tests, far from diminishing the curriculum or distorting teaching, enable teachers to diagnose learners’ difficulties and individualise their instruction (Kilpatrick, 1993). Others argue that objective tests have driven other forms of assessment out of academic institutions, trivialised learning and warped instruction (Resnick, 1987; Romberg et al., 1990). A common concern is that the use of PRQs encourages rote learning and memorising of discrete bits of information, rather 53 than developing an overall deeper understanding of the topic. Many examples exist of PRQs, however, that emphasise understanding of important mathematical ideas and generally involve integrating more than one mathematical concept (Gibbs, Habeshaw & Habeshaw, 1988; Lawson, 1999; Johnstone & Ambusaidi, 2001; Smith et al., 1996). This discussion will be expanded on in subsequent sections. In a study conducted by Engelbrecht and Harding (2003), it is reported that students at the University of Pretoria performed better in online PRQs than in online CRQs, on average, and better in paper CRQs than in online CRQs. It was thus recommended that it is important to use a combination of question types when setting an online paper. In contrast to paper CRQs, online CRQs also mostly have the problem of little or no partial credit. Various strategies have been developed to adapt PRQs to give credit for partial knowledge (Friel & Johnstone, 1978), to reduce the effect of guessing (Harper, 2003) and to find indications of reasoning paths of students. CRQs offer at least three major advantages over PRQs. Firstly, they reduce measurement error by eliminating random guessing. Secondly, they allow for partial credit for partial knowledge and thirdly, problems cannot be solved by working backwards from the answer choices. Because this last advantage makes test items more like the kind of problems students must solve in their academic work, this enhances the face validity of the test. A review by Traub and Rowley (1991) suggests that there is evidence that some free-response essay tests measure different abilities from those measured by fixed-response tests, but that when the free response is a number or a few words, format differences may be inconsequential. Another study that focused on mathematical reasoning (Traub & Fisher, 1977) found that there was no evidence that provided response and constructed response mathematics tests measured different traits in eighth-grade students. Martinez (1991) found that constructed response versions of questions that relied on figural and graphical material were more reliable and discriminating than parallel provided response questions. Bridgeman (1992) found that at the level of the individual item, there 54 were striking differences between the constructed response format and the provided response format. Format effects appeared to be particularly large when the PRQs were not an accurate reflection of the errors actually made by students. In the analysis of the individual items, 71% of the examinees answered the easiest item correctly in the constructed-response format, while 92% got it correct in the multiple choice format. According to Bridgeman (1992), this is caused not only by the opportunity to guess, but also by the implicit corrective feedback that is part of the multiple choice format. In other words, if the answer computed by the examinee is not among the answer choices in a multiple choice format, the examinee knows that an error was made and may try a different strategy to compute the correct answer. Such feedback may reduce trivial computational errors. However, despite the impact of format differences at the item level, total test scores in the constructed response and provided response formats appeared to be comparable. Both formats ranked the relative abilities of students in the same order, gender and ethnic differences were neither lessened nor exaggerated and correlations with other test scores and college grades were about the same. Bridgeman (1992) reminds us that tests do more than assign numbers to people. They also help to determine what students and teachers perceive as important: Test preparation for an examination with an open-ended answer format would have to emphasize techniques for computing the correct answer, not methods for selecting among five answer choices. Thus, with the grid-in format, coaching and test preparation should become synonymous with sound instructional strategies that are designed to foster understanding of basic mathematical concepts. Ultimately, the decision to accept or reject open-ended answer formats may rest as much on these non-psychometric considerations as on any small differences in test reliability or validity (Bridgeman, 1992, p271). Assessment for broader educational and societal uses calls for tests that are comprehensive in breadth and depth. Both breadth and depth can be covered by including a large number of questions for assessment using a variety of question formats, such as CRQs and PRQs, including the multiple choice format. Both open-ended and fixed-response assessment formats have a place 55 to ensure that assessment remains open and congenial to all students (Engelbrecht & Harding, 2004). 2.10 MULTIPLE CHOICE QUESTIONS The multiple choice test, first invented in 1915, was derived from the tradition of intelligence testing. Intelligence tests, which were to influence the construction of numerous subsequent tests, put mental ability on a scale from low to high. Tasks were arranged in increasing order of difficulty, and the examinee received a score based on the point at which successful performance began to be outweighed by unsuccessful performance. Intelligence tests were instituted in many societies to meet the need for selection into specialist or privileged occupations. One of the first uses of multiple choice testing was to assess the capabilities of World War I military recruits. Criticisms of multiple choice testing became prominent in the late 1960s, notably with the publication by Hoffman (1962) of The Tyranny of Testing. The strongest criticisms arose from the growing body of research into effective learning (Gifford & O’Connor, 1992). Here, the evidence indicated that learning is a complex process which cannot be reduced to a routine of selection of small components (Black, 1998). The multiple choice test was further justified by the prevailing emphasis on managing learning through specification of behavioural objectives. These objective tests provided an economical and defensible way of meeting the social needs of an expanding society (Black, 1998). The importance and nature of the function of objective testing changed as societies evolved, from serving education for a small elite, through working with the larger numbers and wider aspirations of a middle class, to dealing with the needs and problems of education for all. Multiple choice questions (MCQs) have been the most developed of all objective tests. They are applicable to a wide range of disciplines. There is a long history of their use in medicine (Freeman & Byrne, 1976). In undergraduate education, they are generally used within formal examination settings in which a large number of questions are used. They also tend to be used in classes where 56 enrolment numbers are large. MCQs are attractive to those looking for a faster way of assessing students arising from their ease of marking (Hibberd, 1996). MCQs are easy to mark by hand or by computer, either through optically marked response sheets, directly online or a template. This means that rapid feedback can be given to students, and it also gives the lecturers better records of what students do and do not know which makes it easier to identify major areas of attention. Many variations of multiple choice form have been used. Wesman (1971) defines the following eight types: the correct answer variety, the best answer variety, the multiple response variety, the incomplete statement variety, the negative variety, the substitution variety, the incomplete alternatives variety and the combine response variety. Extended matching items/questions are also types of multiple choice questions, with the main difference being that there are two or more scenarios. The principle of this type of MCQ is that each scenario should be roughly similar in structure and content, and each scenario has one ‘best’ answer from amongst the series of answer options given. This variation of MCQ is often used in medical education and other healthcare subject areas to test diagnostic reasoning. Research has shown that students exposed to this variation of MCQ format have a greater chance of answering incorrectly if they cannot synthesise and apply their knowledge (Case & Swanson, 1989). MCQs are useful for both summative and formative purposes. Use of MCQs as part of an assessment portfolio is extremely valuable and is particularly useful for initial diagnostic purposes. Its strength as a diagnostic test lies in its capacity to detect at a very early stage, any significant gaps in knowledge of an individual student (Hibberd, 1996). The printed or displayed individual results can be given to each student together with directions to relevant supplementary material. The global results from the tests can inform and assist in directing tutorial assistance or other help. Also, they may be used to assist in future planning of lectures, seminars and classes or in more general use for revision purposes. Their use in teaching improves test-wiseness (Brown, Bull & Pendlebury, 1997), as well as learning and thereby increases the reliability of 57 the assessment procedure. Sometimes increasing test-wiseness is thought to be questionable, yet if one is going to assess learning in a particular way, then one should give students the opportunities to learn and to be assessed in that way. Ebel and Frisbie (1986) justified test-wiseness by stating that more errors are likely to originate from students who have too little rather than too much skill in test taking. Brown, Bull and Pendlebury (1997) indicate that the use of MCQs in improving test-wiseness can also develop the self-confidence of the students being assessed. MCQs provide an important way of evaluating the mathematical ability of a large class of students, but they need more care in setting than the more conventional CRQs requiring full written solutions (Webb, 1989). There are several well documented rules to guide the construction of such questions (Gronlund, 1988; Nightingale et al., 1996; Webb, 1989). Carefully constructed MCQs can assess a wide variety of skills and abilities, including higher-order thinking skills. MCQs involve the following terminology: Item: the term for the whole MCQ, including all answer choices. Stimulus material: the text, diagram, table, graph etc. on which the item is based. Stem: either a question or an incomplete statement presenting the problem for which response is required. Options or alternatives: all the choices in an item. Key: the correct answer or best option. Distracters: the incorrect answers or options other than correct answers. Item set: a number of items all of which are based around the same stimulus material. (Adapted from Hughes & Magin, 1996, p152) 58 Sample Item If u and v are orthogonal (i.e. perpendicular), then II u – v II² = Stem A. (II u II + II v II)² B. (II u II - II v II)² Item Distracters Options C. II u II² - II v II² D. II u II² + II v II² Key (MATH 109 Tutorial Test 3, August 2004, University of the Witwatersrand.) Creating a good MCQ starts with a description of the skills, abilities and knowledge to be tested in the form of written specifications. Once the test specifications are prepared, test questions that assess the skills, abilities and/or knowledge must be constructed. Advice on setting MCQs: ● The item as a whole should test one or more important learning outcomes, processes or skills. The commonest faults found in MCQ items are irrelevance and triviality (McIntosh, 1974). McIntosh suggests that both of these faults can be avoided only through a process of ensuring that all questions are related to previously established learning outcomes and that the answering of each question requires application of knowledge, understanding or other abilities which have been identified as important course outcomes. ● The stem should be stated in a positive form, wherever possible. Diagrams and pictures can be an economical way of setting out the question situation. A complex or lengthy stem can be justified if it can serve as the basis for several questions. 59 ● The options should all be similar to one another in numbers of words and style, both for directness and to avoid giving clues, whether genuine or false. ● Questions should be checked by several experts to ensure that there are no circumstances or legitimate reasoning by virtue of which any of the distracters could be correct; to look for unintended clues to the correct option; and to ensure that the key really is correct. The main challenge in setting good MCQs is to ensure that the distracters are plausible so that they can represent a significant challenge to the student’s knowledge and understanding (Kehoe, 1995). ● Hughes and Magin (1996), advocate using simple words and clear concepts in order to avoid making mathematics tests highly dependent upon students’ ability to read. 2.10.1 Advantages of MCQs MCQs, although often criticised, still form the backbone of most standardised and classroom tests (Fuhrman, 1996). There is a large literature in the field of psychometrics, the psychological theory of mental measurement, that confirms there are good reasons for using multiple choice testing (Haladyna, 1999). The major justifications offered for their widespread use include the following (Tamir, 1990): ● they permit coverage of a wide range of topics in a relatively short time ● they can be used to measure different levels of learning ● they are objective in terms of scoring and therefore more reliable ● they are easily and quickly scored and lend themselves to machine scoring ● they avoid unjustified penalties to students who know their subject matter but are poor writers 60 ● they are suitable for item analysis by which various attributes can be determined such as which items on a test were too easy or too difficult or ambiguous (Isaacs, 1994; Wesman, 1971). It is a common misconception that MCQs can test only factual recall. They can be used to test many types of learning from simple recall to high-level skills like making inferences, applying knowledge and evaluating (Adkins, 1974; Aiken, 1987; Haladyna, 1999; Isaacs, 1994; Oosterhof, 1994; Thorndike, 1997; Williams, 2006). These testing experts point out that while multiple choice tests are quick and easy to score, good multiple choice items which test high-level skills are more difficult and time consuming to develop. The design of MCQs is challenging if one wishes to assess deep learning. It is possible to test higherorder thinking through well-developed and researched MCQs, but this requires skill and time on the part of those designing the test. MCQs can provide a good sampling of the subject matter of concern, and therefore, an adequate and dependable sample of student responses. Given the same time for assessment, free-response items usually sample a smaller number of topics and therefore, tend not to be as reliable as tests made up of many short questions (Fuhrman, 1996). Reliable multiple choice assessments can be ideal if comprehension, application and analysis of content is what one wants to test (Johnson, 1989). Johnson (1989) suggests two ways that higher level MCQs can be introduced into the assessment programme for a curriculum. One way is to make sure that the curriculum includes problem solving skills such as interpreting data, making predictions, assessing information, performing logical analyses, using scientific reasoning or drawing conclusions, and to include questions of this nature in tests. Another way is to combine mathematics content with process. In order to do this, you need to examine concepts currently tested in the curriculum and think of ways to restructure items so that they require students to apply concepts, analyse information, make inferences, determine cause and effect or perform other thoughtful processes. 61 By writing questions that assess your students’ higher levels of ability, you are really testing their unlimited potential (Johnson, 1989). Johnson (1989) cautions that classroom tests should also include some items written at the knowledge and comprehension levels, since students need to have a certain base of facts and information ‘before they are able to reach other plateaus of applying skills and analyzing and evaluating data’ (p61). According to Elton (1987), the reason why MCQs demand so much more than just memory is quite different. It has to do with the brevity of the question and not with the fact that a correct answer has to be chosen. Brief questions can be set in such a way that the student can be asked to think for about two minutes. If he/she thinks wrongly, nothing much is lost, as he/she can go on to the next question. However, if one expects the student to think constructively for 25 minutes or an hour and if he/she then goes wrong in the first five minutes, the penalty is much greater. MCQs give the instructor the ability to obtain a wide range of scores for better discrimination among students. If fine discrimination among students is desired, MCQs offer the ability to obtain a wide range of scores, because the test is made up of many separately scored parts (Fuhrman, 1996). With multiple choice tests, it is easier to frame questions so that all students will address the same content. The student must deal with the responses made available. Although this does increase the risk of the student answering correctly by merely recognising or even guessing the correct answer, at least objective scoring is made easier (Hibberd, 1996). CRQs provide less structure for the student, and a common problem is that test-wise students can overwhelm the marker with pages of unrelated discourse that may at first glance appear to signify understanding (Fuhrman, 1996). A further advantage of MCQs, in particular for large groups of students, is that of the reduction in cost and time. The cost savings is most significant in mass testing such as for large lecture courses or standardised testing. MCQs are 62 quick to mark and provide for ready analyses and comparisons between groups (Hibberd, 1996). High quality MCQs are not easy to construct, but the time spent in constructing them can be offset against the time saved in marking. If one has a large number of students (and not enough tutors) to frequently and objectively assess using CRQs, MCQs can be appropriate for some assessments, especially if subject-matter knowledge is emphasised in the course. Since MCQs can be machine scored, they can be used to assess when scoring must be done quickly, thus being both cost and time effective. In addition to being a legitimate testing mode, the problem oriented multiple choice examination has pragmatic advantages. copying more difficult. First, it makes cheating by With the multiple choice format it is easy to create duplicate exams with answers, and questions renumbered, making copying very difficult. Secondly, all scoring can be done by machine, eliminating unfair subjective evaluations. 2.10.2 Disadvantages of MCQs Graham Gibbs (1992) claims that one of the main disadvantages of MCQs is that they do not measure the depth of student thinking. They are ‘often used to test superficial learning outcomes involving factual knowledge, and that they do not provide students with feedback’ (p31). Further, he argues that this disadvantage is not inherent in the tests in that ‘it is possible to devise objective tests which involve analysis, computation, interpretation and understanding and yet which are still easily marked’ (p31). A common concern expressed when using MCQs is that students are encouraged to adopt a surface learning approach, rather than developing a deep approach to learning the topic (Black, 1998; Resnick & Resnick, 1992). Bloom (1956) himself wrote such tests ‘might lead to fragmentation and atomisation of educational purposes such that the parts and pieces finally placed into the classification might be very different from the more complete objective with which one started’ (p5). 63 Many educators believe that the use of objective tests such as MCQs, while providing inexpensive assessment of large groups of students, may be a factor in lowering achievement in mathematics. The California Mathematics Council’s (CMC) analysis of publishers’ tests, for example, indicated that this assessment mode did not provide information about student understanding of graphs, probability, functions, geometric concepts or logic, focusing instead on rote computation (CMC and EQUALS, 1989). In another study, Berg and Smith (1994) challenge the validity of using multiple choice instruments to assess graphing abilities. They argue that from the viewpoint of a constructivist paradigm, multiple choice instruments are an invalid measure of what subjects can actually do, and equally important, the reasons for doing so. However, as shown by many authors (Gronlund, 1988; Johnson, 1989; Tamir, 1990), as the focus turns away from the correct answer variety (where one of the options is absolutely correct while the others are incorrect) to the best answer variety (where the options may be appropriate or inappropriate in varying degrees and the examinee has to select the best, namely the most appropriate option), the picture changes dramatically. Now the student is faced with the task of carefully analysing the various options, each of which may present factually correct information, and of selecting the answer which best fits the context and the data given in the item’s stem. MCQs of this kind cater for a wide range of cognitive abilities. When compared with open-ended CRQs, although they do not require the student to formulate an answer, they do impose the additional requirement of weighing the evidence, provided by the different options. The correct answers require analytical skills, knowledge of relevant theories and judgement, all cognitively high level items within the assessment models. A criticism, mentioned earlier, is that MCQs are very time consuming to write. Andresen, Nightingale, Boud & Magin (1993) estimated that the development time is such that it would take three years before a course with 50 students a year was showing a saving in staff time. If reliability is at a premium, then many rewrites and plentiful piloting are needed. A department will want to build up a substantial bank of MCQs so that a cohort of students gets a different item on a 64 topic than did the students in the past two years. One suggestion to build up a bank of MCQs is to use them for formative purposes, in peer- and selfassessment, perhaps with computer or tutor support. Such a study was conducted by Barak and Rafaeli (2004) in which graduate MBA students were required to author questions and present possible answers relating to topics taught in class. The students were required to share these questions online with their classmates. The online question-posing assignment required students to be actively engaged in constructing instructional questions, testing themselves with their fellow students’ questions (self-assessment) and assessing questions contributed by their peers (peer-assessment). Although standardised item banks of mathematics questions at the tertiary level are freely available, these are problematic in that they are standardised to specific contexts and may contain linguistic features and other concepts which are unfamiliar to students attending universities in South Africa. If used, such questions will have to be modified and refined to suit the South African context. Another objection to the whole principle of multiple choice is that MCQs are not characteristic of the real world (Bork, 1984). Education often criticise multiple choice tests because such tests are rarely ‘authentic’ (Fuhrman, 1996). Webb (1989) relates a comment made by Peter Hilton on this very issue about MCQs: …the very idea is highly artificial. Nowhere in real-life mathematics, let alone real life, is one ever faced with a problem together with five possible solutions, exactly one of which is guaranteed to be correct (p216). Fuhrman (1996) argues that when a real world task is one that requires choosing the ‘correct’ or ‘best’ answer from a limited universe of answers, multiple choice tests can be used. But if the real world task is one that requires the performance of a skill, such as a laboratory skill or writing skill, MCQs are not usually appropriate. Webb’s defence in this case is that even so MCQs serve as a diagnostic tool and not a real-life event. The distracters in a multiple choice item function much like one of the standard procedures in a Piagetian classical interview. There, 65 when the interviewer is not fully satisfied even when the child gives a correct answer, understanding is checked by suggesting an alternative answer. Thus, the distracters in a good multiple choice item serve as such alternatives. In designing MCQs, a recognised strategy is to select plausible distracters. If these are chosen on the basis of representing common errors in understanding the topic, patterns of wrong choices can have useful diagnostic value. Most test setters use their experience of frequently encountered misconceptions when deciding on plausible distracters. The danger of this practice, however, is that when a student gets to an answer on grounds of a misconception and finds his wrong answer as one of the distracters, the student believes that he answered correctly. The student often feels that his mathematical prowess is intact until he receives feedback on his response, thereby reinforcing the misconception (Engelbrecht & Harding, 2003). This view is supported by Webb (1989) who proposes that distracters should be devised that …look feasible, but which could not have been obtained by means of a correct strategy incorporating a minor algebraic error (p217). When distracters based on misconceptions are included, immediate feedback is advisable if MCQs are used in formative assessment. The MCQs must be written in a manner that does not give away the correct answers. The MCQ test must also feature a good overall balance of well written items clearly correlated to the learning outcomes of the course (Johnson, 1989). The rigidity of the marking scheme for MCQs is criticised. Several authors have reported that about one third of students choosing the correct option in a multiple choice question do so for a wrong reason (Tamir, 1990; Treagust, 1988; Johnstone & Ambusaidi, 2001). We assume that when a student makes a wrong choice, it indicates a certain lack of knowledge or understanding, or that the student reveals a misconception. However, it is possible for students to have the correct understanding, but to make a minor calculation error. 66 In general, several options are available for the modification of test items in order to address these issues (Johnstone & Ambusaidi, 2001). Treagust (1988) developed a two-tier testing methodology for the probing of conceptual understanding. MCQs treat minor and major errors as equal and do not make provision for partial credit. There have been several ingenious attempts made to score MCQs to allow for partial knowledge (Friel & Johnstone, 1978; Johnstone & Ambusaidi, 2001). Some of these ask the students to rank all the responses in the question from the best to the worst. In other cases students are given a tick (") and two crosses (#) and asked to use the crosses to label distracters they know to be wrong and the tick to choose what they think is the best answer. They get credit for eliminating the wrong, as well as for choosing the correct. The rank order produced when these devices are applied to multiple choice tests and the rank order produced by an open-ended test correlate to give a value of about 0.9; almost a perfect match. This underlines the importance of the examiner having the means of detecting and rewarding reasoning (Johnstone & Ambusaidi, 2001). You could also give partial credit for a partially correct option on Learning Management Systems such as Blackboard (Engelbrecht & Harding, 2006). 2.10.3 Guessing Another (well researched) concern when using MCQs is the possibility of guessing. It is always possible to guess at an answer so that the probability of obtaining correct answers in items comprising of four options by purely random selection is 25%. The probability of choosing the correct answer randomly gets lower if there are a sufficient number of distracters. True/false questions are rarely a good idea. Different evaluators have taken different positions regarding the way the problem of guessing should be addressed. Guessing can be counteracted by negative marking or penalty marking whereby each wrong answer leads to marks being lost. A rational student who is not sure of the answer to a question 67 will therefore not answer it, incurring no penalty. A wrong answer penalty would strongly discourage guessing. Aubrecht and Aubrecht (1983) argue that although they would like to discourage random guessing, they believe that there is an important pedagogical reason to encourage reasoned guessing. Active involvement on the part of the student in sifting through the answers on the test, even if the wrong answer is eventually chosen, prepares the student to understand the correct answer when it is explained. If students can correctly eliminate some distracters, this method of reasoned guessing, they will do better than if they guess randomly. A wrong answer penalty in MCQs reduces the effect of guessing (Harper, 2003) and finds indications of reasoning paths for students (Johnstone & Ambusaidi, 2001). At some institutions, however, negative marking is prohibited. Using negative marking also requires knowledge of the probability for guessing the correct answer. This may be beyond the statistical competence of many question designers, particularly if the test includes multiple response questions or matching questions for which the process is more complex. Harper (2003) developed a method for post-test correction for guessing. His method enables the test designer to do a post-test correction to neutralise the impact of guessing. An alternative approach to eliminate guessing is the use of justifications (Tamir, 1990). The term justification is assigned to reasons and arguments given by a respondent to a multiple choice item for the choice made. When students are required to justify their choice in MCQs, they have to consider the data in all the options and explain why a certain option is better than others. In addition, there is the back-wash effect when requiring justifications for multiple choice items. In other words, students who know that they may be asked to justify their choices will attempt to learn their subject matter in a more meaningful way and in more depth so that they will be prepared to write an adequate and complete justification. Justifications to choices in multiple choice items significantly increase the information that test results provide about students’ knowledge. 68 Their contribution is made by: ● identifying misconceptions, missing links and inadequate reasoning among students who correctly choose the best answer ● gaining better understanding of notions held by students who choose certain distracters. 2.10.4 In defense of multiple choice Seen as a part of an overall strategy of assessment, MCQs have a great deal to commend them. Much of the criticism levelled at multiple choice tests focuses on poorly worded answers which penalise the better student and that the correct answer may be guessed. Neither of these faults is inherent in the multiple choice test itself, but only in the way in which it is used. The primary focus of a mathematics testing methodology based on an active, constructivist view of learning is on revealing how individual students think about key concepts in mathematics. Rather than comparing students’ responses with a correct answer to a question, the emphasis should rather be on understanding the variety of responses that students make to a question and inferring from those responses students’ level of conceptual understanding. In defense of multiple choice tests, they provide faster ways of assessing the large numbers of first year undergraduate students studying tertiary mathematics and test scores can be highly reliable. This research study has concentrated mostly on MCQs, and not on the other types of PRQs. As discussed in the literature review, MCQs enable one to sample rapidly a student’s knowledge of mathematics and they may be used to measure deep understanding. Literature search has revealed that alternative types of MCQs encourage a deep approach to learning as they require students to solve a problem by utilising their knowledge and intellectual skills. Traditional factual recall MCQs can be modified to both assist student learning and to better assess the students’ progress towards understanding. A sophistication of the standard multiple choice test is available through the use of computer adaptive testing. Here, the questions to be presented to a student at any point during a test can be chosen on the basis of the quality of the 69 answers supplied up to that point. This can mean that each student can avoid spending time on items which give little useful information because they are far too difficult or far too easy (Scouller & Prosser, 1994). Biggs (1991) points out that the use of MCQs in very large classes provides a form of continuous assessment and feedback: students knowing how they have done on a multiple choice test can provide more feedback than is otherwise available…and that it is also possible to provide computerised tutorial feedback for students when they give incorrect answers to multiple choice questions (p31). The inclusion of multiple choice formats in assessment lessens the burden of heavy teaching loads coupled with large student numbers experienced by academic staff, particularly in the early undergraduate years. This enables academic staff to perform their duties as teachers and researchers in academic institutions. The challenge, then, is to find out enough about student understanding in mathematics to design assessment techniques that can accurately reflect these different understandings. 2.11 GOOD MATHEMATICS ASSESSMENT From a methodological point of view, mathematics assessment for broader education and societal uses calls for tests that are comprehensive in breadth and depth (Ramsden, 1992). With regard to the importance of assessment, Ramsden (1992) says that: From our students’ point of view, assessment always defines the actual curriculum. In the last analysis, that is where the curriculum resides for them, not in the lists of topics or objectives. Assessment sends messages about the standard and amount of work required, and what aspects of the syllabus are most important. Too much assessed work leads to superficial approaches; 70 clear indications of priorities in what has to be learned, and why it has to be learned, provide fertile ground for deep approaches (p187). Whether we focus on examinations or on other forms of assessment, we can use a range of techniques to assess the nature and extent of student learning. Our decisions about which forms of assessment we choose are likely to be affected by the particular learning context and by the type of learning outcome we wish to achieve (Wood, Smith, Petocz & Reid, 2002). Essentially, good mathematics assessment practices: ● encourage meaningful learning when tasks encourage understanding, integration and application ● are valid when tasks and criteria are clearly related to the learning objectives and when marks or grades genuinely reflect students’ levels of achievement ● are reliable when markers have a shared understanding of what the criteria are and what they mean ● are fair if students know when and how they are going to be assessed, what is important and what standards are expected ● are equitable when they ensure that students are assessed on their learning in relation to the objectives ● inform teachers about their students’ learning (Biggs, 2000; Brown & Knight, 1994; Wood et al., 2002). It is also possible (and desirable) to characterise the quality of a test as a whole. In this context, quality is defined as the extent to which the test measures what we wish it to measure, and the degree to which it is consistent as an instrument for this measurement (Niss, 1993). The first of these characterises the validity of the test: the second of these is the reliability. Measuring quality in terms of reliability and validity can and should be done for any type of assessment. Good assessment must be both reliable and valid (Fuhrman, 1996). This definition is part of the “common wisdom” of psychometrics (Haladyna, 1999). A reliable assessment is one which consistently achieves the same results with the same 71 (or similar) cohort of students. Qualitatively, a reliable measure is one that provides consistent scores. There are several ways to determine the reliability of a measure. One type of reliability is defined as the level of agreement between test scores for a test given on several occasions. Reliability can be expressed analytically, and using performance data, calculated for any scored test. Various factors affect reliability: the number and quality of the questions, including ambiguous questions, too many options within a question paper, the type of examination environment, the type of test administration directions, vague marking instructions, the objectivity of scoring procedures, poorly trained markers and the test-security arrangements (Nightingale et al., 1996). An assessment is valid when it accurately measures what it intends to measure. Validity is determined in a variety of ways, depending on the purpose of the test. For example, for a test that is intended to assess subject matter, the validity of the test content can be confirmed by linking the items to the important concepts in the curriculum. A valid test is built by ensuring that each question is linked to a specific item that is included in the curriculum. Often the description of the skills/knowledge to be tested is too broad to permit the measurement of each and every concept listed. In this case, a valid test should sample the subject matter in a way that ensures the broadest possible representation of the subject in the examination. For a test used for predictive purposes, for example to predict success in an academic programme, the validity can be confirmed by correlating performance on the test to some measure of actual success attained (Black, 1998). A student’s mathematical understanding, for example, of linear functions or the capacity to solve non-routine examples, is a “mental concept” (Romagnano, 2001), and as such can only be observed indirectly. Objectivity in mathematics assessment would be desirable if we could have it, but according to Kerr (1991), is a myth. Romagnano (2001) is of the opinion that all assessments of students’ mathematical understanding are subjective. Good mathematics assessment should not be defined in terms of its objectivity or subjectivity. A more useful way to characterise good mathematics assessment methods would be with 72 respect to their consistency (or reliability) and the meaning (or validity) of the information they provide. When a consistent method is used by different teachers to assess the knowledge of a given student, the teachers’ assessments will agree. When two students have roughly the same level of understanding of a set of mathematical ideas, consistent assessment of these students’ understandings will be roughly equal as well. Good mathematics assessment methods provide teachers with information about student understanding of specific mathematical ideas and how this understanding changes over time, information that can be used to make appropriate curriculum decisions. The Assessment Principle: Assessment should support the learning of important mathematics and furnish useful information to both teachers and students. -Principles and standards for school mathematics (NCTM, 2000) The National Council of Teachers of Mathematics (NCTM, 2000) evaluation standards suggest that: ● student assessment be integral to instruction ● multiple means of assessment be used ● all aspects of mathematical knowledge and its connections be assessed ● instruction and curriculum be considered equally in judging the quality of a programme. According to Webb and Romberg (1992), good mathematics assessment practices are those in which students can: ● learn to value mathematics ● develop confidence ● communicate mathematically ● learn to reason mathematically ● become mathematical problem solvers (p39). Assessment should be a means of fostering growth toward high expectations and should support high levels of student learning. When assessments are used in thoughtful and meaningful ways, students’ scores provide important information that, when combined with information from other sources, can lead 73 to decisions that promote student learning and equality of opportunity (NCTM, 2000). 2.12 GOOD MATHEMATICS QUESTIONS The types of questions that we set reflect what we, as mathematics educators, value and how we expect our students to direct their time (Wiggins, 1989). In striving to set questions of good quality, assessors need to be able to measure how good a mathematics question is. Good mathematics questions are those that help to build concepts, alert students to misconceptions and introduce applications and theoretical questions. When students are asked to puzzle and explain, to apply their knowledge in an unfamiliar context, they must construct meaning for themselves by relating what they know to the problem at hand. mathematicians. In other words, they must act like This kind of activity encourages them in the belief that mathematics is primarily a reasonable enterprise, founded in the relationships apparent in everyday life and accessible to all students, whatever age or level of ability (Massachusetts Department of Education, 1987, p41). According to Romberg (1992) the criteria for measuring good mathematics questions can be traced to three main concerns: 1. Test questions must reflect the current view of the nature of mathematics. This view emphasises understanding, thinking, and problem solving that require students to see mathematical connections in a situation-based problem and to be able to monitor their own thinking processes to accomplish the task efficiently. This requires that test questions have the following characteristics: ● They assess thinking, understanding and problem solving in a situational setting as opposed to algorithmic manipulation and recall of facts. ● They assess the interconnection among mathematical concepts and the outside world. 2. Test questions must reflect the current understanding of how students learn. The current view of instruction and learning assumes that students 74 are active learners and engage in creating their own meaning during the instructional process. This requires that test questions have the following characteristics: They must: ● be engaging ● be situational and based upon real-life applications ● have multiple-entry points in the sense that students at various levels in their mathematical sophistication should be able to answer the question ● allow students to explore difficult problems and students’ explorations are rewarded ● allow students to answer correctly in diverse ways according to their experiences, rather than requiring a single answer 3. Test questions must support good classroom instruction and not lend themselves to distortion of curriculum. Good curriculum practices require that test questions have the following characteristics ● They must be exemplars of good instructional practices ● They should be able to reveal what students know and how they can be helped to learn more mathematics (p125). Hubbard (2001) suggests that good mathematics questions are those that require students to reflect on results, in addition to obtaining them. Good questions specifically encourage students to develop relational understanding, a process approach and higher-level learning skills. Further, students’ solutions to good questions should indicate what kind of intellectual activity they engaged in to answer the questions. Good questions direct students to think, as well as to do (Hubbard, 2001). Asking the right question is an art to be cultivated both by educators and by students, for teaching and learning as well as for assessment. Good questions and their responses will contribute to a climate of thoughtful reflectiveness (Niss, 1993). Stenmark (1991) has suggested a list of possible characteristics of good open-ended questions to open new avenues of thinking for students. 75 ● Problem Comprehension Can students understand, define, formulate or explain the problem or task? Can they cope with poorly defined problems? ● Approaches and Strategies Do students have an organised approach to the problem or task? How do they record? Do they use tools (diagrams, graphs, calculators, computers, etc.) appropriately? ● Relationships Do students see relationships and recognise the central idea? Do they relate the problem to similar problems previously done? ● Flexibility Can students vary the approach if one approach is not working? Do they persist? Do they try something else? ● Communication Can students describe or depict the strategies they are using? Do they articulate their thought processes? Can they display or demonstrate the problem situation? ● Curiosity and Hypotheses Do students show evidence of conjecturing, thinking ahead, checking back? ● Self-assessment Do students evaluate their own processing, actions and progress? ● Equality and Equity Do all students participate to the same degree? Is the quality of participation opportunities the same? 76 ● Solutions Do students reach a result? Do they consider other possibilities? ● Examining results Can students generalise, prove their answers? Do they connect the ideas to other similar problems or to the real world? ● Mathematical learning Did students use or learn some mathematics from the activity? Are there indications of a comprehensive curriculum? (p31). Questions might also assess a student’s understanding of a specific mathematical topic. Such focused mathematics questions can be developed according to instructional needs. Retaining unsatisfactory questions is contrary to the goal of good mathematics assessment (Kerr, 1991). This view is consistent with the NCTM Evaluation Standards proposal that ‘student assessment be integral to instruction’ (NCTM, 1989, p190). By thinking of instruction and assessment as simultaneous acts, educators optimise both the quantity and the quality of their assessment and their instruction and thereby optimise the learning of their students (Webb & Romberg, 1992). 2.13 CONFIDENCE When the National Council of Teachers of Mathematics (NCTM) published its Curriculum and evaluation standards for school mathematics in 1989, many of the recommended assessment methods were different from those routinely used in mathematics classrooms of the 1980s. For example, one such recommended assessment method was having students write essays about their understanding of mathematical ideas and using classroom observations and individual student interviews as methods of assessment. The document, Evaluation Standard 10 – Mathematical Disposition (NCTM, 1989), maintains 77 that it is also important to assess students’ confidence, interest, curiosity and inventiveness in working with mathematical ideas. Corcoran and Gibb (1961) and other writers in the 1950s and the 1960s argued similar points (as cited in the National Council of Teachers of Mathematics Yearbook, 1961): One of the best indications of the mastery of a subject possessed by a pupil is his ability to make significant comments or to ask intelligent questions about the subject… Another indication of achievement in a field is interest in that field… Still another indication of achievement is the degree of confidence displayed when work is assigned or undertaken (Spitzer, pp193-194). Appraisal ideally includes many aspects of learning in addition to acquisition of facts and skills. It includes the student’s attitude toward the work; the nature of his curiosity about the ingenuity with mathematics; his work habits and his methods of recording steps toward a conclusion; his ability to think, to exclude extraneous data, and to formulate a tentative procedure; his techniques and operations; and finally, his feeling of security with his answer or conclusion (Sueltz, pp15-16). Using only the results of multiple-choice tests can lead to incorrect conclusions about what a student does or does not know (Webb, 1989). As Johnson (1989) indicated, if students can write clearly about mathematical concepts, then they demonstrate that they understand them. In a study conducted by Gay and Thomas (1993), with 199 seventh- and eighth-grade students that focused on students’ understanding of percentage, about one-fourth of the students had no explanation to support their correct choice to the multiple choice question. It is possible that this lack of response gives some indication of the number of students who simply guessed correctly. It is also possible that these students lacked confidence in their reasoning and chose not to give any explanation (Gay & Thomas, 1993). Students need to have a reason for making decisions and solving problems in mathematics and the confidence to share that reasoning with others (Webb, 1994). 78 It is well documented that mathematical attitude is one of the strongest predictors of success in the mathematical sciences (McFate & Olmsted, 1999; Wagner, Sasser & DiBiase, 2002). There are, however, a number of non- cognitive factors such as study habits (consistent work), motivation (interest and desire to understand presented material) and self-confidence that may be equally or more important in the prediction of student success (Angel & LaLonde, 1998). The extent of students’ awareness of their strengths and weaknesses is known to be associated with their success or lack of success in some areas of mathematical performance. For example, in the literature on mathematical problem solving (Campione, Brown & Connell, 1988; Krutetskii, 1976; Schoenfeld, 1987), the successful problem solvers are described as those students who have a collection of powerful strategies available to them and who can reflect on their problem-solving activities effectively and efficiently. In contrast, descriptions of unsuccessful problem solvers tend to portray them as students who have command of fewer strategies and who do not function in a self-reflective or self-evaluative manner (Kenney & Silver, 1993). Students’ ability to monitor their learning is one of the key building blocks in selfregulated learning, which, in turn, is an essential requirement for success at tertiary level (Isaacson & Fujita, 2006). Students who are skilful at academic self-regulation understand their strengths and weaknesses as learners as well as the demands of specific tasks. Students who are expert learners know when they have mastered, or not mastered, the required academic tasks and can adjust their learning accordingly (Isaacson & Fujita, 2006). Such students are said to have high metacognitive ability. The inability to do so is especially harmful in the case of poor performers who become victims of an assessment regime that they do not understand and which they perceive themselves to be unable to control. Isaacson and Fujita (2006) have shown that low achieving students have lower metacognitive knowledge monitoring abilities. They are less able to predict their performance after writing a test, rely more on time spent on studying than on mastery of concepts to decide their confidence for success, 79 are less likely to adjust their self-efficacy depending on feedback received from taking a test and show the largest discrepancy between their actual performance and their expected performance, satisfaction goals and pride goals. Tobias and Everson (2002) have found that the ability to differentiate between what is known (learned) and unknown (unlearned) is an important ingredient for success in all academic settings. Metacognition has two components: it refers to knowledge about cognition and regulation of one’s own cognitive processes (Baker & Brown, 1984). The ability to know how well one is performing through monitoring and checking of outcomes of learning (self-assessment) is an essential requirement for the planning and control of appropriate behaviour to ensure mastery of subject content. Self-reflection and self-assessment of the confidence of a student in answering a test item, whether PRQ or CRQ, encourages sense making and autonomy. A number of studies have been reported where metacognitive ability of students was assessed and correlated with test performance by means of confidence judgement indicating the likelihood that the answers provided to each multiple choice question was correct (Carvalho, 2007; Sinkavich, 1995). Carvalho (2007) investigated the effects of test types (free response/short answers and multiple choice tests) on students’ performance, confidence judgements and the accuracy of those judgements. The results showed that the difference between performance and judgement accuracy was significantly larger for multiple choice than for short answer tests in undergraduate psychology. Students were significantly more confident in multiple choice than in short-answer tests, but their judgements were significantly more accurate in the short answer than in the multiple choice tests. In addition, upon repeated exposure to a short-answer test format both the performance and confidence of students increased, whereas that was not the case for multiple choice testing. Carvalho suggested a possible explanation for this observation is that multiple choice tests may require tasks of lower cognitive demand, such as recognition, as compared to the higher demand of recall and self-construction of responses. This may tempt students 80 into reduced metacognitive activity. They do not need to engage as deeply with the content and their mastery of the material in order to make an accurate judgement (Pressley, Ghatala, Woloshyn, & Pirie, 1990). Carvalho (2007) suggested that the continuous pairing of high confidence and low accuracy levels observed for multiple choice assessment could negatively affect students’ self-regulation of learning. If they do not understand the reasons why their judgements are consistently inaccurate despite their feeling of confidence, they may start to feel that they have no control over their learning and its relationship to the outcomes of assessment. When students are asked to express their confidence in the correctness of answers provided during assessment they are required to engage in the metacognitive activity of judging their conceptual understanding and/or mastery of skills and proper application to the task at hand. Assessment in mathematics must build learners’ confidence and competence (Anderson, 1995). As we look for increased achievement and motivation in our mathematics classrooms, we must acknowledge and develop self-assessment of confidence as one of the many ways to include authentic assessment as a key element in the learning process. The confidence index (CI), which is an indication of confidence, is discussed in Section 5.2.2. 81 CHAPTER 3: RESEARCH DESIGN AND METHODOLOGY INTRODUCTION In this chapter, I describe how I went about investigating my research questions (posed in section 3.2). I explain how I moved from an informal position, based on my observations and interpretation over many years as a mathematics lecturer of undergraduate students, to a formal research-oriented position. By speaking of ‘how’ I moved, I am referring to my methods of doing formal research and collecting ‘relevant’ data, and to my justification for the appropriateness of these methods. These methods, together with their motivations and characterisations, constitute the methodology of my research. Initially, in section 3.1 the research design is described. This is followed by my research questions formulated in section 3.2. Section 3.3 outlines the qualitative research methodology of the study in which the interviews with the sample of undergraduate students are described. In section 3.4, the quantitative research methodology is discussed. In this section the Rasch model, the particular statistical method employed, is described. Lastly, issues related to reliability, validity, bias and ethics are discussed in section 3.5. 3.1 RESEARCH DESIGN According to Burns and Grove (2003), the purpose of research design is to achieve greater control of the study and to improve the validity of the study by examining the research problem. In deciding which research design to use, the researcher has to consider a number of factors. These include the focus of the research (orientation of action), the unit of analysis (the person or object of data collection) and the time dimension (Bless & Higson-Smith, 1995). 82 Research designs can be classified as either non experimental or experimental. In non experimental designs the researcher studies phenomena as they exist. In contrast, the various experimental designs all involve researcher intervention (Gall, Gall & Borg, 2003). This research study is non experimental in design, and as the purpose of this study is prediction, a correlational research design is used. Correlational research refers to studies in which the purpose is to discover relationships between variables through the use of correlational statistics. The basic design in correlational research is very simple, involving collecting data on two or more variables for each individual in a sample and computing a correlation coefficient. Many studies in education have been done with this design. As in most research, the quality of correlational studies is determined not by the complexity of the design or the sophistication of analytical techniques, but by the depth of the rationale and theoretical constructs that guide the research design. The likelihood of obtaining an important research finding is greater if the researcher uses theory and the results of previous research to select variables to be correlated with one another (Gall, Gall & Borg, 2003). Correlational research designs are highly useful for studying problems in education and in the other social sciences. Their principal advantage over causal-comparative or experimental designs is that they enable researchers to analyse the relationships among a large number of variables in a single study. In education and social sciences, we frequently confront situations in which several variables influence a particular pattern of behaviour. Correlational designs allow us to analyse how these variables, either singly or in combination affect the pattern of behaviour. In this study, first year Mathematics Major students from the University of the Witwatersrand were selected from the MATH109 course and their performance on assessment in the PRQ format was compared to their performance on assessment in the CRQ format. In addition, students were asked to indicate a confidence of response corresponding to each test item, in both the CRQ and 83 PRQ assessment formats. Further data was collected from experts who indicated their opinions of the difficulty of the test items, both PRQs and CRQs, independent of the students’ performance in each question. Further discussion on the research methodology is presented in section 3.4. 3.2 RESEARCH QUESTIONS The objective of this research study is to design a model to measure how good a mathematics question is and to use the proposed model to determine which of the mathematics assessment components can be successfully assessed with respect to the PRQ format, and which can be successfully assessed with respect to the CRQ format. To meet the objective of the study described above, the study will be designed according to the following steps: [1] Three measuring criteria are used to develop a model for determining the quality of a mathematics question (the QI model). [2] The quality of all PRQs and CRQs are determined by means of the QI model. [3] A comparison is made within each assessment component between PRQ and CRQ assessment. Based on these design steps and having defined the concept of a good mathematics question, the research question is formulated as follows: Research question: Can we successfully use PRQs as an assessment format in undergraduate mathematics? In order to answer the research question, the following subquestions are formulated: 84 Subquestion 1: How do we measure the quality of a good mathematics question? Subquestion 2: Which of the mathematics assessment components can be successfully assessed using the PRQ assessment format and which of the mathematics assessment components can be successfully assessed using the CRQ assessment format? Subquestion 3: What are student preferences regarding different assessment formats? 3.3 QUALITATIVE RESEARCH METHODOLOGY Qualitative research in education has roots in many academic disciplines (Cresswell, 2002). Some qualitative researchers also have been influenced by the postmodern approach to inquiry that has emerged in recent years (Angrosino & Mays de Pérez, 2000; Merriam, 1998). Cresswell (1998, p150) lists the advantages of using qualitative research methodology as follows: ● Qualitative research is value laden ● The researcher has firsthand experience of the participant during observation ● Unusual aspects can be noted during observation ● Information can be recorded as it occurs during observation ● It saves the researcher transcription time ● The researcher can control the line of questioning in an interview ● The participants can provide historical information. 85 3.3.1 Qualitative data collection Purpose of the interviews The purpose of the interviews was to probe MATH109 students’ beliefs, attitudes and inner experiences about the different assessment formats they had been exposed to in their tests and examinations. The task in the interviews was designed with a research purpose; my responses (as interviewer) were more geared to finding out what the student was thinking (the research role) rather than assisting (the teacher role). The very fact that I was present at the interviews must also have affected the thinking and responses of the students that were being interviewed. The qualitative data will be used to address the third research subquestion of what student preferences are regarding different assessments formats. Interviews The interviews were structured along certain dimensions, and semi-structured along others. It was structured in that all students were asked exactly the same set of predetermined questions (see page 88 for the questions); it was semistructured in that my responses and prompts, as interviewer, depended to a large extent on the responses of the interviewee and on my relationship with that particular student. As the interviewer, I strove for consistency on certain dimensions in all interviews. Each interview was framed by the same set of questions and timeframe which provided a type of structure to the interview. Despite these commitments to a measure of consistency, the clinical interviews in this study (as in other educational research type studies) are necessarily not neutral. This is because clinical interviews, just like any other learner-teacher engagement, are social productions. In this regard, Minick, Stone and Forman (1993) assert: 86 Educationally significant human interactions do not involve abstract bearers of cognitive structures but real people who develop a variety of interpersonal relationships with one another in the course of their shared activity in a given institutional context. … For example, appropriating the speech or actions of another person requires a degree of identification with that person and cultural community he or she represents (p6). I was able to engage far more effectively with some students rather than others in the interview situations (in the sense of being able to generate more penetrative probes). For example, with certain students whose home language is not English, much of my time was spent on interpreting what they said. Format of the interviews Nine MATH109 students with various gradings (weak/average/good) based on their June class record marks, from different racial backgrounds and different gender classes were interviewed, one at a time over a period of about two weeks in October 2004. Each interview took place in my office and was tape recorded and later transcribed. The maximum duration of each interview was 30 minutes. Table 3.1 lists the MATH109 student interviewees and their academic backgrounds. [A: ≥75%; B: 70-74%; C: 60-69%; D: 50-59%; Fail: <50%] Table 3.1: MATH109 student interviewees and their academic backgrounds. INTERVIEWEE October Exam (%) Final (%) Symbol Class record [%] [1] 70.05 32.77 51.41 D [2] 80.67 85 82.84 A [3] 81.26 81 81 A [4] 58.11 29.16 43.64 Fail [5] 59.43 53.33 56.38 D [6] 42.92 26.28 34.65 Fail [7] 68.28 44.44 56.36 D [8] 74.48 82.22 78.35 A [9] 36.57 31.11 33.84 Fail 87 At the commencement of the interview, I reminded each student that I was doing research to probe their beliefs, attitudes and inner experiences about the different assessment formats they had been exposed to in their tests and examinations. My opening questions were to find out about the background of each student i.e. why they registered for Mathematics I Major; career choice etc. This seemed to put the student at ease and they found the situation less threatening. I then moved on to the ten interview questions. Interview questions: [1] I’m interested in your feelings about the different ways in which we asked questions in your maths tests, a percentage being multiple choice provided response questions and the other the more traditional open-ended constructed response questions. Do you like the different formats of assessment? [2] Why / Why not? [3] Which type of question do you prefer in maths? [4] Why do you prefer type A to type B? [5] Which type of questions did you perform better in? Why? [6] Do you feel that the mark you got for the MCQ sections is representative of your knowledge? What about the mark you got for the traditional long questions? Do you feel this is representative of your knowledge? [7] Do you have confidence in answering questions in maths tests which are different to the traditional types of questions? Elaborate. [8] What percentage of the maths tests do you recommend should be multiplechoice questions, and what percentage should be open-ended long questions? [9] How would you ask questions in maths tests if you were responsible for the course? [10] Is there opportunity for cheating in these different formats of assessment? Please tell me about them. After asking these ten questions, I concluded the interview by asking each student if they had anything else to add or if they had any questions for me. 88 Examples of responses will be given and discussed in greater detail in the qualitative data analysis presented in section 4.1. 3.4 QUANTITATIVE RESEARCH METHODOLOGY According to McMillan and Schumacher (2001), quantitative research involves the following: ● Explicit description of data collection and analysis procedures ● Scientific measurement and statistics used ● Deductive reasoning applied to numerical data ● Statements of statistical relevance and probability. The Rasch model was used as the quantitative research methodology in this study. It is a probabilistic model that estimates person ability and item difficulty (Rasch, 1960). Although it is common practice in the South African educational setting to use raw scores in tests and examinations as a measure of a student’s ability, research has shown that misleading and even incorrect results can stem from an erroneous assumption that raw scores are in fact linear measures (Planinic, Boone, Krsnik & Beilfuss, 2006). Linear measures, as used in the Rasch model, on the other hand, are on an interval scale, where arithmetic and statistical techniques can be applied and useful inferences can be made about the results (Rasch, 1980). 3.4.1 The Rasch model In the following poem written by Tang (1996), each verse highlights a different characteristic of the Rasch model: A model of probability; uniformity; sufficiency; invariance property; diagnosticity and ubiquity. 89 Poem: What is Rasch? Rasch is a model of probability that estimates person ability, that estimates item difficulty, that predicts response probability nothing but a function of ability and difficulty. Rasch is a model of uniformity that places the values of person ability and the values of item difficulty on the same scale with no diversity. Rasch is a model of sufficiency that uses number right for estimating person ability and count of correct responses for item difficulty; that relates raw score to person ability and response distribution to item difficulty -- with no ambiguity. Rasch is a model with invariance property that fosters person-free estimation of item difficulty and test-free estimation of person ability; that frees difficulty estimates from sample peculiarity and ability estimates from difference in test difficulty. Rasch is a model with diagnosticity that flags item away from unidimensionality, or items with local dependency; that identifies persons with response inconsistency, or person or groups measured with inappropriacy; that maintains construct fidelity and enhances test validity. Rash is a model of ubiquity; from educational assessment to sociology, from medical research to psychology, from item analysis to item banking technology, from test construction to test equity…. -- nothing beats its utility and popularity. (Huixing Tang, 1996, p507) 90 3.4.1.1 Historical background The Rasch model was developed during the years 1952 to 1960 by the Danish mathematician and statistician Georg Rasch (1901-1980). The development of the Rasch model took its beginning with the analysis of slow readers in 1952. The data in question were from children who had trouble reading during their time in school and for that reason were given supplementary education. There were several problems in the analysis of the slow readers. One was that the data had not been systematically collected. The children had for example not been tested with the same reading tests, and no effort had been made to standardise the difficulty of the tests. Another problem was that World War II had taken place between the two testings. This made it almost impossible to reconstruct the circumstances of the tests. It was therefore not possible to evaluate the slow readers by standardisation as was the usual method at the time (Andersen & Olsen, 1982). Accordingly, it was necessary for Rasch to develop a new method where the individual could be measured independent of which particular reading test had been used for testing the child. The method was as follows: two of the tests that had been used to test the slow readers were given to a sample of school children in January 1952. Rasch graphically compared the number of misreadings in the two tests by plotting the number of misreadings in test 1 against the number of misreadings in test 2 for all persons. This is illustrated in Figure 3.1. 91 Figure 3.1: Number of misreadings of nine subjects in two tests. ∝ ν2 30 20 10 0 0 10 20 ∝ ν1 (Source: Rasch ,1980 ) The graphical analysis showed that, apart from random variations, the number of misreadings in the two tests was proportional for all persons. Further, this relationship held, no matter which pair of reading tests he considered. To describe the random variation Rasch chose a Poisson model. The probability that person number v had misread α vi words in test number i he accordingly modelled as P(α vi ) = e − λvi (λvi ) α vi ! α vi (1.1) ; where λvi is the expected number of misread words. Rasch then interpreted the proportional relationship between the number of misreadings in the two tests as a corresponding relationship between the parameters of the model, i.e. λv1 λ01 λ = ⇔ λvi = v1 λ0i = θvδ i λvi λ0i λ01 (1.2) 92 Thus the parameter of the model factorised into a product of two parameters, a person parameter θv and an item parameter δ i . Inserting factorisation (1.2) in model (1.1), Rasch obtained the multiplicative Poisson model P (α vi ) = e −θvδi (θ vδ i ) α vi ! α vi (1.3) The way Rasch arrived at the multiplicative Poisson model was characteristic for his methods. He used graphical methods to understand the nature of a data set and then transferred his findings to a mathematical and a statistical formulation of the model. The graphical analysis, however, was not Rasch’s only reason to choose the multiplicative Poisson model. Rasch (1977) wrote: Obviously it is not a small step from Figure 1 [our Figure 3.1] to the Poisson distribution (1.1) with the parameter decomposition (1.2). I readily admit that I introduced this model with some mathematical hindsight: I realized that if the model thus defined was proven adequate, the statistical analysis of the experimental data and thus the assessment of the reading progress of the weak readers, would rest on a solid – and furthermore mathematically rather elegant – foundation. Fortunately the experimental result turned out to correspond satisfactorily to the model which became known as the multiplicative Poisson model (p63). Rasch later developed the “elegant foundation” of the multiplicative Poisson model into a concept. Though in the beginning of the 1950s Rasch merely used it as a tool to estimate the ability of the slow readers by a method he called bridge-building. The point in using the bridge-building is that one can estimate the attainment of the individual regardless of which particular item the individual has been tested with. Bridge-building can be exemplified by the multiplicative Poisson model as follows: Rasch writes that the main point of bridge-building is that it should be possible to assign to each item a degree of difficulty that is independent of the persons the item has been applied to (Rasch, 1960, pp20-22). This is possible in the 93 multiplicative Poisson model, because the distribution of a person’s responses to two different items conditioning on the sum of his responses only depends on the item parameters: P(α vi , α vj α vi + α vj ;θ v , δ i , δ j ) = g (δ i , δ j ). The person parameter, θv , is thus eliminated. Having estimated the item parameters in a distribution only depending on the item parameters, this estimate, Sˆi , may be inserted in the distribution (1.3) giving P(α vi ) = e −θ v Sˆi (θ v Sˆi ) α vi ! α vi (1.4) which only depends on the person parameter. Hence it is possible to estimate the parameter θv of the individual person even if only one item has been responded to. This is done by using a person’s frequency of misreadings as an estimate of i and solving the equation (1.4) with regard to θv . The way Rasch solved the problem of parameter separation for the slow readers was not the method he used later. But it represents the first trace of the idea of separating the estimation of item parameters from the estimation of person parameters. In comparison to traditional analysis techniques, the Rasch model can be used (i) to analyse and improve a test instrument; and (ii) to generate linear (interval strength) learner scores, thus meeting the assumptions of parametric statistical tests such as t-tests and ANOVA (Birnbaum, 1968). Rasch analysis has been the method of choice for moderate size data sets since 1965. Now the theoretical advantages and directly meaningful results of Rasch analysis can be easily obtained for large data sets, as follows: ● Scores and analyses dichotomous items, or sets of items with the same or different rating scale, partial credit, rank or count structures for up to 254 ordered categories per structure, with useful estimation of perfect scores. 94 ● Missing responses or non-administered items are no problem. ● Analyse several partially linked forms in one analysis. ● Analyse responses from computer-adaptive tests. ● Item reports and graphical output include calibrations, standard errors, fit statistics, detailed reports of the particular improbable person responses which cause item misfit, distracter counts, and complete DOS files for additional analysis of item statistics. ● Person reports and graphical output include measures, standard errors, fit statistics, detailed reports of the particular improbable item responses which cause person misfit, a table of measures for all possible complete scores, and complete DOS files for additional analysis of person statistics ● Rating scale, partial credit, rank and count structures reported numerically and graphically. ● Complete output files of observations, residuals and their errors for additional analyses of differential item function and other residual analyses. ● Observations listed in conjoint estimate order to display extent of stochastic Guttman order. The Guttman scale (also called ‘scalogram’) is a data matrix where the items are ranked from easy to difficult and the persons likewise are ranked from lowest achiever on the test to highest achiever on the test. ● Option to pre-set and/or delete some or all person measures and/or item calibrations for anchoring, equating and banking, and also to pre-set rating scale step calibrations (Rasch, 1980). The advantages of the Rasch model above other statistical procedures, used as the quantitative research methodology in this study, will be clarified further in section 3.4.1.4. 95 3.4.1.2 Latent trait One of the basic assumptions of the Rasch model is that a relatively stable latent trait underlies test results (Boone & Rogan, 2005). For this reason, the model is also sometimes called the ‘latent trait model’. Latent trait models focus on the interaction of a person with an item, rather than upon total test score (Wright & Stone, 1979). They use total test scores, but the mathematical model commences with a modelling of a person’s response to an item. They are concerned with how likely a person v of an ability βv on the ‘latent trait’ is to answer correctly, or partially correctly, an item i of difficulty δ i . The latent trait or theoretical construct of concern to the tester is an underlying, unobservable characteristic of an individual which cannot be directly measured, but will explain scores attained on a specific test pertaining to that attribute (Andrich & Marais, 2006). For instance, in this study, the latent trait is the mathematical performance of first year tertiary students. When items are conceived of as located, according to difficulty level, along a latent trait, the number of items a person answers correctly can vary according to the difficulties of the particular items included in the test. The relationship between person ability and total score is not linear. The non-linearity in this relationship means that test scores are not on an interval scale unless the items are evenly spaced in terms of difficulty. With a test designed according to the strategic of traditional test theory this would be unlikely to be the case because of the tendency to pick items clustered in the middle difficulty with only a few out towards the 0.8 and 0.2 levels of difficulty. In latent trait models, the construct or latent trait is conceived as a single dimension along which items can be located in terms of their difficulty (δ i ) and persons can be located in terms of their ability ( β v ) . 96 If the person’s ability βv is above the item’s difficulty δ i we would expect the probability of the person observed in category x of a rating scale applied to item i being correct to be greater than 0.5, i.e. if ( β v − δ i ) > 0, then P{χ vi = 1} > 0.5 If the person’s ability is below the item’s difficulty, we would expect the probability of a correct response to be less than 0.5, i.e. if ( β v − δ i ) < 0, then P{χ vi = 1} < 0.5 In the intermediate case where the person’s ability and the item’s difficulty are at the same point on the scale, the probability of a successful response would be 0.5 i.e. if ( β v − δ i ) = 0, then P{χ vi = 1} = 0.5 Figure 3.2 illustrates how differences between person ability and item difficulty ought to affect the probability of a correct response. 97 Figure 3.2: How differences between person ability and item difficulty ought to affect the probability of a correct response. β 1. When βv > δi (βv − δ i ) > 0 δi and P{χ vi = 1} > 1 2 βv 2. When βv < δi δi (βv − δ i ) < 0 and P{χ vi = 1} < 12 βv 3. When βv = δi (βv − δ i ) = 0 and δ P{χ vi = 1} = 1 2 (Source: Andrich & Marais (2006), Lecture 5, p60). The curve in Figure 3.3 summarises the implications of Figure 3.2 for all reasonable relationships between probabilities of correct responses and differences between person ability and item difficulty. This curve specifies the conditions a response model must fulfill. The difference ( β v − δ i ) could arise in 2 ways. It could arise from a variety of person abilities reacting to a single item, or it could arise from a variety of item difficulties testing the ability of one person. 98 When the curve is drawn with ability β as its variable so that it describes an item i , it is called an item characteristic curve, because it shows the way the item elicits responses from persons of every ability. Figure 3.3: The item characteristic curve. Ρ 1.0 1 P{χ vi = 1} > 2 The probability of a correct response P{χ vi = 1} < The relative position of βv and δ i 0.5 on the 1 2 (βv − δi ) 0.0 βv < δi βv = δi βv > δi P { χ vi = 1 β v , δ i } = f ( β v − δ i ) (Source: Andrich & Marais (2006), Lecture 5, p65). In Figure 3.3 if we thought of the horizontal axis as the latent trait, the item characteristic curve would show the probability of persons of varying abilities responding correctly to a particular item. The point on the latent trait at which this probability is 0.50 would be the point at which the item should be located. In order to construct a workable mathematical formula for the item characteristic curve in Figure 3.3, we begin by combining the parameters, βv for person ability, and δί for item difficulty through their difference ( β v − δ i ). We want this difference to govern the probability of what is supposed to happen when person v uses their ability βv against the difficulty δ i of item i . But the difference ( β v − δ i ) can 99 vary from minus infinity to plus infinity, while the probability of a successful response must remain between zero and one. That is 0 ≤ P{χ vi = 1} ≤ 1 −∞ ≤ β v − δ i ≤ +∞ (1) (2) If we use the difference between ability and difficulty as an exponent of the base e , the expression will have the limits of zero and infinity. That is 0 ≤ e( β v −δ i ) ≤ +∞ (3) With a further adjustment we can obtain an expression which has the limits zero and one and therefore could perhaps be a formula for the probability of a correct response. The expression and its limits are: 0≤ e ( β v −δ i ) ≤1 1 + e ( β v −δ i ) (4) If we take this formula to be an estimate of the probability of a correct response for person ν on item i , the relationship can be written as: P{χ vi = 1/ β v ,δ i } = e ( β v −δ i ) 1 + e ( β v −δ i ) (5) The left hand side of (5) represents the probability of person v being correct on item i (or of the response of person v to item i being scored 1), given the person’s ability βv and the item’s difficulty δ i . The function (5) which gives us the probability of a correct response is a simple logistic function. It provides a simple, useful response model that makes both linearity of scale and generality of measure possible. It is the formula Rasch chose when he developed the latent trait test theory. It is a simple logistic function. Rasch calls the special characteristic of the simple logistic function which makes generality in measurement possible specific objectivity (Rasch, 1960). He and others have shown that there is no alternative mathematical formula for the ogive curve in Figure 3.3 that allows estimation of the person 100 measures βv and the item calibrations δ i independently of one another (Andersen, 1973, 1977; Birnbaum, 1968; Rasch, 1960, 1980). 3.4.1.3 Family of Rasch models The responses of individual persons to individual items provide the raw data. Through the application of the Rasch model, raw scores undergo logarithmic transformations that render an interval scale where the intervals are equal, expressed as a ratio or log odd units or logits (Linacre, 1994). The Rasch model takes the raw data and makes from them item calibrations and person measures resulting in the following: ● valid items which can be demonstrated to define a variable ● valid response patterns which can be used to locate persons on the variable ● test-free measures that can be used to characterise persons in a general way ● linear measures that can be used to study growth and to compare groups (Bond & Fox, 2007). Through the years the Rasch model has been developed to include a family of models, not only addressing dichotomies, but also inter alia rating scale and partial credit models. 1. Dichotomous Rasch model The dichotomous Rasch model applies to items where a correct response is awarded a score of 1 and an incorrect response a score of 0. An example would be in the case of a multiple choice item (PRQ), where a person v provides an answer to an item i and attains a score of χ vi , with the person’s ability βv and the item difficulty level of δ i . Formula (5) in a simpler form is used for the dichotomous Rasch model: 101 Pvi = e ( β v −δ i ) 1 + e( β v −δ i ) As discussed before, this formula is a simple logistic function and the units are called ‘logits’. For example, if a person v with an ability of β v = 5 interacts with an item i of difficulty δ i = 2 , the probability of the person answering the item correctly will be: e(5− 2) P{χ vi = 1 β v , δ i } = 1 + e(5− 2) e3 = 1 + e3 = 20.086 21.086 = 0.95 Table 3.2 is a table of more examples of the probabilities generated from differences between ability and difficulty. Table 3.2: Probabilities of correct responses for persons on items of different relative difficulties. βv − δ i Probability 3 0.95 2 0.88 1 0.73 0 0.50 -1 0.27 -2 0.12 -3 0.05 The explanation of the dichotomous Rasch model is based on Andrich and Marais (2006). 102 One can generate many more probabilities from such differences and then represent the resulting function graphically. This graph is also known as the item characteristic curve. Figure 3.4 displays the function of the dichotomous Rasch model graphically. Figure 3.4: Item characteristic curve of the dichotomous Rasch model. Conditional probability 1.0 0.5 0.0 -5.0 0.0 5.0 Ability relative to item difficulty βv < δ i βv = δi βv > δi The item characteristic curve provides the opportunity to directly establish the probability of a person of ability βv answering an item of difficulty δ i correctly. For example, if in Figure 3.4 a person with ability β v = 0.0 interacts with an item of difficulty δ i = 0.0 the probability is 50% that the answer will be correct (see dotted line on graph). 2. Polytomous Rasch models The Greek meaning of the word ‘polytomous’ is literary ‘many cuts’ and is used to indicate the rating scale and partial credit models in Rasch. 103 Rasch-Andrich rating scale model Andrich (as cited in Linacre, 2007, p7) in a conceptual breakthrough, comprehended that a rating scale, for example a Likert-type scale, could be considered as a series of Rasch dichotomies. Linacre (2007) makes the point that similar to the Rasch original dichotomous model, a person’s ability or attitude is represented by βv , whereas δ i is the item difficulty or the ‘difficulty to endorse’. The difficulty or endorsability value is the ‘balance point’ of the item according to Bond and Fox (2007, p8), and is situated at the point where the probability of observing the highest category is equal to the probability of observing the lowest category (Linacre, 2007). In the Rasch-Andrich rating scale, a Rasch-Andrich threshold, Fx , is also located on the latent variable. This ‘threshold’ or ‘step’ is, according to Linacre (2005), the point on the latent variable (relative to the item difficulty) where the probability of being observed in category x equals the probability of being observed in the previous category x − 1. A threshold, in other words, is the transition between two categories. Wright and Mok (in Smith & Smith, 2004) are of the opinion that if Likert scale items have the same response categories, that it is quite reasonable to assume that the thresholds would be the same for all items. According to Linacre (2005), the Rasch-Andrich rating scale model specifies the probability, Pvix , that person v of ability βv is observed in category x of a rating scale applied to item i with difficulty level δ i as opposed to the probability Pvi ( x −1) of being observed in category x − 1 . In a Likert scale, x could represent ‘Strongly Agree’ and x − 1 would then be the previous category ‘Agree’. Mathematically the function is depicted as follows: P ln vix = β v − δ i − Fx Pvi( x −1) 104 In this research study, the categories for the Rasch-Andrich rating scale were: 1: Complete guess 2: Partial guess 3: Almost certain 4: Certain A high raw score on an item would indicate a lot of confidence. When this figure is transformed to a log odds or logit, as it is done in the Rasch model, a low Rasch measure of endorsability is obtained. According to Planinic and Boone (2006), it is better to invert the scale for easier interpretation, since a high logit would then correspond to high confidence. This is the strategy adopted in this study. Partial credit model The partial credit model applies for instance to achievement items where marks are allocated for partially correct answers or where a sequence of tasks has to be completed. Essentially, the partial credit model is the same as the rating scale model, with the only difference being that in the partial credit model, each item has its own threshold parameters. The threshold parameter, Fx , in the partial credit model becomes Fix and mathematically the Rasch-Andrich rating scale model changes to: P ln vix = β v − δ i − Fix Pvi( x −1) These models will be re-visited in Chapter 6 in the data analysis methodology, to show how they were applied in this study. 3.4.1.4 Traditional test theory versus Rasch latent trait theory In both traditional test theory and in the Rasch latent trait theory, total scores play a special role. In traditional test theory, test scores are test-bound and test 105 scores do not mark locations on their variable in a linear way. In traditional test theory, the observed measure used for a person’s performance would be the total score on the test. A higher total score on the test would be taken to reflect a higher level of understanding than would a lower total score on the test. The advice about item difficulties which develops from a traditional theory framework is that all items should be at a difficulty level of 0.5. Just how difficult an item needs to be for it to have a difficulty of 0.5 depends on how able the persons are who will take it. How able the persons are, is in turn judged from their performance on a set of items. There is no way within traditional test theory of breaking out of this reciprocal relationship other than through the performance of some carefully sampled normative reference group. The performance of individuals on subsequent uses of the test can be judged against the spread of performances in the normative group. The Rasch model focuses on the interaction of a person with an item rather than upon the total test score. Total test scores are used, but the model commences with a modelling of a person’s response to an item. The total score emerges as the key statistic with information about the ability β v . A feature of traditional test theory is that its various properties depend on the distribution of the abilities of the persons. Many of the statistics depends on the assumption that the true scores of people are normally distributed (Andrich, 1988). An important advantage of the Rasch latent trait model is that no assumptions need to be made about this distribution, and indeed, the distribution of abilities may be studied empirically. It was for this reason that the Rasch model was chosen above other traditional statistical procedures for the quantitative research methodology of this study. If we intend to use test results to study growth and to compare groups, then we must make use of the Rasch model for making measures from test scores that marks locations along the variable in an equal interval or linear way. A variable on an ordinal measurement scale would have the characteristics of classification into different distinct and ordered categories in terms of a certain 106 attribute on the one hand. On the other hand these categories can possess more of that attribute in an ascending fashion (Huysamen, 1983). Although scores on such a variable could be added and subtracted, careful consideration must be given to the meaning of the total scores. If careful thought is given to raw scores, it becomes evident that they also only act as a device to order persons in ascending or descending order, because there is no evidence that the difference (or distance) between two points, for instance on the lower part of the scale would be exactly the same as the difference between two points higher up on the scale. In other words, a person scoring 60 on a test has double the marks that a person scoring only 30 on the same test has, but it does not necessarily mean that the one has double the attribute that the other person has. The question arises if raw scores per se can be realistically viewed as measures. Wright and Linacre (1989, p56) state ‘a measure is a number with which arithmetic (and linear statistics) can be done, …yet with results that maintain their numerical meaning’. Measurement on an interval scale on the other hand, would be able to provide a distinction between more or less of an attribute, but also provide for equal distances or differences between two points on the scale. A zero point on this scale does not indicate a total absence of an attribute (Glass & Stanley, 1970). Bond and Fox (2007) argue strongly for the same rigour in measurement in the physical sciences to be applied in the field of psychology. This proposed rigour in measurement should be extended also to the field of education in South Africa. The Rasch model provides an avenue to attain this goal. 3.4.1.5 Reliability and validity Reliability and validity are approached differently in traditional test theory from the way they are approached in latent trait theory. The process of mapping the amount of a trait on a line necessarily involves numbers. The use of numbers in this way gives precision to certain kinds of work. However, there is always a 107 trade-off in the use of such numbers – in particular, they can be readily over interpreted because they appear to be so precise, hence affecting the reliability of the data. In addition, the instrument may not measure what we really want to measure and this affects the validity of the research. In the latent trait model, the use of a total score from a set of items implies an assumption of a single, unidimensional underlying trait which the items, and therefore the test, measure. Those reliability indices which reflect internal consistency provide a direct indication of whether a clear single dimension is present. If the reliability is low, there may be only a single dimension but one measured by items with considerable error. Alternatively, there may be other dimensions which the items tap to varying degrees. The calculation of a reliability index is not very common in latent trait theory. However, it is possible to calculate such an index, and in a simple way, once the ability estimates and the standard error of the persons is known. Instead of using the raw scores for the reliability index formula, the ability estimates are used, where the ability estimate β v for each person v can be expressed as the sum of the true latent ability and the error ε , i.e. β v = β v + Σεβ v The key feature of reliability in traditional test theory is that it indicates the degree to which there is systematic variance among the persons relative to the error variance i.e. it is the ratio of the estimated true variance relative to the true variance plus the error variance. In traditional test theory, the reliability index gives the impression that it is a property of the test, when it is actually a property of the persons as identified by the test. The same test administered to people of the same class or population but with a smaller true variance, would be shown to have a lower reliability. 108 Having the facility to capture the most well known and commonly used discrimination index of traditional test theory; to provide evidence of the degree of conformity of a set of responses to a Guttman or ‘scalogram’ scale in a probabilistic sense and to provide these from a latent trait formulation, indicates that Rasch’s simple logistic model provides an extremely economical and reliable perspective from which to evaluate test data (Andrich, 1982). 3.4.2 Quantitative data collection As discussed in Chapter 1, this study is set within the context of the Mathematics 1 Major Course at the University of the Witwatersrand. In Chapter 1, I indicated that the course has a mixed and heterogeneous student population; students coming from both the economically and culturally advanced sector of the population (for example, both parents may be university graduates) as well as from the economically and culturally disadvantaged sector (for example, one or more parents may be illiterate or innumerate). In the years of this study, July 2004 to July 2006, student numbers registering for MATH109 were high with 483 in 2004, 414 in 2005 and 376 in 2006. The reduction in numbers in 2006 coincided with the increase in the entrance requirements to the Faculty of Science at the University of the Witwatersrand. In each of these years, the students were allocated, subject to timetable constraints, to one of two parallel courses presented by different lecturers. The lectures took place six times a week (45 minutes per lecture) in a large lecture theatre. MATH109 consists of a Calculus and an Algebra component. In Semester 1, Algebra constituted one-third and Calculus two-thirds of each assessment task, corresponding to the same ratio of lectures. In Semester 2, Algebra and Calculus were weighted equally with students receiving 3 lectures of Algebra and 3 lectures of Calculus per week. I lectured one set of Calculus and one set of Algebra classes while my colleagues lectured the other parallel courses. All the students from the MATH109 classes constituted the group from which data was collected for this study. As course co-ordinator for the duration of the study, I had more contact with these students than my colleagues. I was 109 personally involved, either as examiner or as moderator, for all the tests and projects which contributed to the assessment programme. I was also directly responsible for the invigilation duties of this group and hence administered all the tests at which the data was collected. The collection of data for this study was directly related to the Mathematics I Major assessment programme as illustrated in Figure 3.5. Figure 3.5: Mathematics 1 Major (MATH109) assessment programme. Diagnostic and Formative (Continuous) ● to get more information about the progress of learning and teaching. Summative ● aimed at the results of the whole teaching process. ● from known to unknown ● from synthesis to consolidation. ● from corrective feedback to reinforcement Method of Assessment: Method of Assessment: Student’s Portfolio Final exam (3 hrs) November ● ● ● ● ● ● ● 2 MCQ tutorial tests Poster Groupwork tutorial tasks 2 Semester assignments: Calculus / Algebra Self-study tasks 3 class tests (1 hr) March/May/August 1 mid-year test (1.5 hrs) June 50% - 60% of overall grade 40% - 50% of overall grade Test instruments Data was collected from the 2 MCQ Tutorial tests, the 3 class tests (CRQs and PRQs) (1 hour) in March/May/August, the mid-year test (CRQs and PRQs) (1.5 hrs) in June and the final examination (CRQs and PRQs)(3 hrs) in November, in each of the years 2004, 2005 and 2006 respectively. 110 Tutorial tests Two tutorial MCQ tests were written during the course of the year in March and August respectively. Each test, of duration 20 minutes, consisted of 8 multiplechoice questions (total = 16 marks), 4 MCQs on Algebra content and 4 MCQs on Calculus content. Each of these MCQs was followed by a confidence of response question in which a student was asked to indicate their confidence about the correctness of their answer, where A implies no knowledge (complete guess), B a partial guess, C almost certain and D indicates complete confidence or certainty in the knowledge of the principles and laws required to arrive at the selected answer. Each of the MCQs had 3 distracters and 1 key, indicated by the letters A, B, C, or D. Sample MCQ calculus question 4 If f is continuous and ∫ f ( x)dx = 10 , find 0 A. B. C. D. ∫ 2 0 f (2 x)dx . 5 10 15 20 A COMPLETE GUESS B PARTIAL GUESS C ALMOST CERTAIN D CERTAIN (Adapted from MATH109 Tutorial Test, August 2005) Tutorial tests were written during the last 20 minutes of one of the 45 minutes compulsory tutorial periods, in the first semester and the second semester. The tests were administered by the tutor who handed out the question papers together with a blank computer card. The instruction to each student was to shade the correct answers on the computer card to questions 1-8 in the first column. In these questions there was only one possible answer. There was no negative marking. In addition, the students had to shade their confidence of response answers on the computer card corresponding to Questions 1-8 in the second column, i.e. Questions [26] – [33]. Students were reminded that there is no correct answer in the confidence of responses. Students were also informed 111 that marks were not awarded for the confidence of response answers, as these were purely for educational research purposes. Once the tests had been written, the tutor collected both the question paper and the computer cards. The question papers were kept for reference only should any queries arise, and not returned to the students. The computer cards were marked by the Computer and Networking Services (CNS) division of the University of the Witwatersrand. On completion, CNS provided a print out of the quantitative statistical analysis of data, including the performance index, discrimination index and easiness factor per question. CNS also captured the students’ confidence of responses. Class tests and examinations Three 1-hour class tests were written during the year in March, May and August. A 1.5 hour mid-year test was written in June and the final 3-hour examination took place in November. The final examination constituted 40% - 50% of the overall assessment grade. Each of these tests and exams followed the same format, with Section A following the PRQ format, in particular MCQs; Sections B and C followed the CRQ format with Section B testing the Algebra component of the course and Section C testing the Calculus component of the course. In 2005, confidence of response questions were not included in Section B and Section C. This data was only collected for the MCQs in Section A. From 2006 onwards, the confidence of response questions were included in all 3 sections, for both the CRQ and PRQ formats. In the CRQ sections, a confidence of response question followed each subquestion of the main question. 112 Sample CRQ question: Question 4. a. Give the condition that is required to ensure continuity of a function f ( x) at the point x = α. A COMPLETE GUESS B PARTIAL GUESS C ALMOST CERTAIN D CERTAIN b. Let ! x" be the greatest integer less than or equal to x . (i) Show that lim f ( x) exists if f ( x) = ! x" + ! − x" . x→2 A COMPLETE GUESS (ii) Is B PARTIAL GUESS C ALMOST CERTAIN D CERTAIN f ( x) = ! x" + ! − x" continuous at x = 2? Give reasons. A COMPLETE GUESS B PARTIAL GUESS C ALMOST CERTAIN D CERTAIN (Adapted from MATH109, Calculus, March 2006, Section C) For Section A, students were provided with blank computer cards to indicate their choice of answers and the corresponding confidence of responses. As in the tutorial tests, students were informed that no marks were awarded for the confidence of responses. In Sections B and C, students were provided with space on the question papers to complete their solutions. The computer cards were used only to indicate the corresponding confidence of responses. On completion of the tests, all three sections, together with the filled in computer card, were collected. CNS provided a print out of all the results for Section A, together with confidence of responses for Sections A, B and C. 113 Expert opinions In this study, the term expert refers to content experts. In this case the content experts were my colleagues who taught the MATH109 course, either Algebra or Calculus or both, as well as my supervisors from the University of Pretoria who were familiar with the content. In total, the opinions of eight experts on the level of difficulty of the questions were obtained, independent of each other. Five of the experts gave their opinions on Calculus, and six of the experts gave their opinions on Algebra. Each expert was given a full set of the following tests: MATH109 August Tutorial Test (2005); March Tutorial Test 1A (2006); March Tutorial Test 1B (2006); March Section A (2005); May Section A (2005); June Section A (2005); August Section A (2005); November Section A (2005); March Section A (2006); May Section A (2006); June Section A (2006); March Sections B & C (2005); May Sections B & C (2005); June Sections B & C (2005); August Sections B & C (2005); November Sections B & C (2005); March Sections B & C (2006); May Sections B & C (2006) and June Sections B & C (2006). The reader is to note that the August Tutorial Test was the same in both 2005 and 2006. Also the March Tutorial Test 1A which was written during a tutorial period on a Tuesday and March Tutorial Test 1B written during a tutorial period on the Wednesday of the same week, although testing the same content, were different. These tests were the same for 2005 and 2006. The experts chose to give their opinions on either the Calculus or Algebra questions, depending on which courses they taught. Hence for Calculus, Section C was appropriate and for Algebra, Section B was appropriate. In the MCQ Section A, there was a mixture of both Calculus and Algebra questions. Experts were asked for their opinions on the level of difficulty of both the PRQs and CRQs, and were asked to indicate their opinions as follows: ● Use a 1 if your opinion is that the students should find the question easy ● Use a 2 if your opinion is that the question is of average difficulty ● Use a 3 if your opinion is that the students would find the question difficult or challenging. Experts were informed that their opinions were completely independent of how the students performed in the questions. Experts worked independently and did 114 not collaborate with other experts. In the study, the students’ performance is referred to as novice performance. Once all the expert opinions were collected, the data was captured separately for Calculus and Algebra on spreadsheets. An expert opinion on the level of difficulty of each question (PRQs and CRQs) was calculated as the average of the eight expert opinions per question. 3.5 RELIABILITY, VALIDITY, BIAS AND RESEARCH ETHICS 3.5.1 Reliability of the study Reliability is the extent to which independent researchers could discover the same phenomena and to which there is agreement on the description of the phenomena between the researcher and participants (Schumacher & McMillan, 1993). As this study consisted of both a qualitative and quantitative component, it is necessary to examine both the constraints on qualitative and quantitative reliability. According to Schumacher and McMillan (1993), reliability in quantitative research refers to the consistency of the test instrument and test administration in the study. Reliability in qualitative research refers to the consistency of the researcher’s interactive style, data recording, data analysis and interpretation of participant meanings from the data. Schumacher and McMillan (1993) have suggested the following reliability threats to research. These are: ● the researcher’s role ● the informant selection of the sample ● the social context in which data is collected ● the data collection strategies ● the data analysis strategies ● the analytical premises i.e. the initial theoretical framework of the study. 115 In this study reliability was enhanced by means of the following: ● The importance of my social relationship with the students in my role as the co-ordinator and lecturer of the Mathematics 1 Major Course was carefully described. ● The selection of the population sample of this study and the decision process used in their selection was described in detail. ● The social context influencing the data collection was described physically, socially, interpersonally and functionally. Physical descriptions of the students, the time and the place of the assessment tasks, as well as of the interviews, assisted in data analysis. ● All data collection techniques were described. The interview method, how data was recorded and under what circumstances was noted. ● Data analysis strategies were identified. ● The theoretical framework which informs this study and from which findings from prior research could be integrated was made explicit. ● Stability was achieved by administering the same tutorial tests in March and August over the period 2004-2006. ● Equivalence was achieved over the period of study, by administering different tests to the same group of students. ● Internal consistency was achieved by correlating the items in each test to each other. ● A large number of data items were collected over the period of 2 years, and were all used in the data analysis. 3.5.2 Validity of the study In the context of research design, the term validity means the degree to which scientific explanations of phenomena match the realities of the world (Schumacher & McMillan, 1993). Test validity is the extent to which inferences made on the basis of numerical scores are appropriate, meaningful and useful. Validity, in other words, is a situation-specific concept. Validity is assessed 116 depending on the purpose, population and environmental characteristics in which measurement take place. In quantitative research there are two type of design validity. Internal validity expresses the extent to which extraneous variables have been controlled or accounted for. External validity refers to the generalisability of the results i.e. the extent to which the results and conclusion can be generalised to other people and settings. In this study, internal validity was addressed as the population sample of first year mainstream mathematics students were always fully informed and aware that their confidence of responses, in both the CRQs and PRQs, were not for assessment purposes, but used purely for this research study. All students wrote the same test on the same day in a single venue. All the data collected was used, irrespective of whether the students completed all of the confidence of responses, or not. According to Messick (1989), validity is articulated in terms of the following four ideas: content validity, concurrent validity, predictive validity and construct validity. ● Content validity would be established by experts judging whether the content was relevant ● Concurrent validity would be established by showing that the results on a particular test were related in the expected way with results on other relevant tests ● Predictive validity would be established by relating the results of a test with performance in the future on the same trait ● Construct validity would be established by demonstrating that the test was related to performances on other tests that were theoretically related. Andrich and Marais (2006) point out that it is now considered standard that construct validity is the overarching concept, and that the other three so called forms of validity are pieces of evidence for construct validity. Construct validation is addressed to the identification of the dimension in a substantive 117 sense. The test developer must have a clear idea of what the dimension is when the items are written. In order to enhance the validity of this study, the following steps were taken: ● The literature was examined in order to identify and develop the seven mathematical assessment components. ● The test instrument was validated after implementation by a panel consisting of my 2 supervisors at the University of Pretoria and 6 mathematics lecturers from the University of the Witwatersrand. ● The questions used for data collection were all moderated by colleagues and were in line with the theoretical framework. Minor adjustments were made to a number of test items to avoid ambiguity and to strengthen weak distracters. ● Expert opinions obtained from colleagues were completely independent of student performance (novice performance). ● Three measuring criteria were identified in order to develop a model for addressing the research questions. These criteria were modified and adapted in collaboration with my supervisors to address the issue of what constitutes a good mathematical question and how to measure how good a mathematics question is. ● All marking of PRQs was done by computers using the Augmented marking scheme. This programme accommodates the fact that not all questions are equally weighted. There was no negative marking. ● Marking of CRQs was done by the MATH109 team of lecturers, using a detailed marking memorandum which had been discussed prior to each marking session. In addition, all marking was moderated by the researcher, except for the examinations which were moderated by an external examiner. 3.5.3 Bias of the study Bias is defined by Gall, Gall and Borg (2003) as a set to perceive events in such a way that certain types of facts are habitually overlooked, distorted or falsified. 118 In this study, an attempt was made to decrease bias by the following: ● A representative sample of undergraduate students studying tertiary mathematics ● A comprehensive literature review ● Verified statistical methods and findings. 3.5.4 Ethics Ethics generally are considered to deal with beliefs about what is right or wrong, proper or improper, good or bad (Schumacher & McMillan, 1993). Most relevant for educational research is the set of ethical principles published by the American Psychological Association in 1963. The principles of most concern to educators are as follows: ● The primary investigator of a study is responsible for the ethical standards to which the study adheres. ● The investigator should inform the subjects of all aspects of the research that might influence willingness to participate. ● The investigator should be as open and honest with the subjects as possible. ● Subjects must be protected from physical and mental discomfort, harm and danger. ● The investigator should secure informed consent from the subjects before they participate in the research. In view of these principles, I took the following steps: ● Permission to conduct research in the first year Mathematics I Major course was sought and granted by the Registrar of the University of the Witwatersrand. Permission was granted on the understanding that information furnished to me by the University of the Witwatersrand may not be used in a manner that would bring the University in disrepute. I further agreed that my research may be used by the University if it is so desired (Declaration letter can be found in the Appendix A1, p265). 119 ● In the interview, all respondents were assured of confidentiality. Respondents were informed that they had been randomly selected, based on their June class record marks. Permission was obtained from each candidate to tape-record the interviews. Candidates were informed that they were free to withdraw from the interview or not to answer any question, if they wished. Candidates were assured of the confidentiality and anonymity of their responses and, in particular, that the information they provided for the research would not be divulged to the University or their lecturers at any time. ● The researcher assured all participants that all data collected from the confidence of responses would not affect their overall marks. No person, except the researcher, supervisors and the data analyst, would be able to access the raw data. All raw data was used, irrespective of whether the student indicated a confidence of response or not. ● The research report will be made available to the University of the Witwatersrand and to the University of Pretoria, should they so desire it. ● Informed consent was achieved by providing the subjects with an explanation of the research and an opportunity to terminate their participation at any time with no penalty. Since test data was collected over the research period to chart performance trends, the research was quite unobtrusive and had no risks to the subjects. The students were at no times inconvenienced in the data collection process, as all data was collected during the test times as set out in the assessment schedule for MATH109. ● In the data analysis, student names and student numbers were not used. Thus, confidentiality was ensured by making certain that the data cannot be linked to individual subjects by name. This was achieved by using the Rasch model. ● In my role as researcher, I will make every effort to communicate the results of my study so that misunderstanding and misuses of the research is minimised. ● To maximise both internal and external validity, research has shown it seems best if the subjects are unaware that they are being studied 120 (Schumacher & McMillan, 1993). In this regard, the research methodology was designed in order to collect data from the students during their normal tutorial times or formal test times. As a result, students did not feel threatened in any way and the resulting data was sufficiently objective. ● The methodology section of my study shows how the data was collected in sufficient detail to allow other researchers to extend the study. ● In my roles as co-ordinator, lecturer and researcher, I was very aware of ethical responsibilities that accompanied the gathering and reporting of data. The aims, objectives and methods of my research were described to all participants in this research study. 121 CHAPTER 4: QUALITATIVE INVESTIGATION In this chapter I address the third research subquestion: What are student preferences regarding different assessment formats? 4.1 QUALITATIVE DATA ANALYSIS According to Schumacher and McMillan (1993), qualitative data analysis is primarily an inductive process of organising the data into categories and identifying patterns (relationships) among the categories. Unlike quantitative procedures, most categories and patterns emerge from the data, rather than being imposed on the data prior to data collection. 4.2 QUALITATIVE INVESTIGATION In the qualitative component of my research study, I relied upon the qualitative method of interviewing. The format of the interview was described in section 3.3.1. In qualitative research, the role of the researcher in the study should be identified and the researcher should provide clear explanations to the participants. As researcher and interviewer, I investigated what the interviewees experienced being exposed to alternative assessment formats in their undergraduate studies and how they interpreted these experiences. The interview questions were presented in section 3.3.1. In this section, I present the data that was gathered, in the form of interviews and an analysis of the data. The qualitative data findings are presented as a narration of the interviewees’ responses. The data is used to illustrate and substantiate the third research subquestion of this research study related to student preferences i.e. What are student preferences regarding different assessment formats? Analysis is often intermixed with presentation of the data, which are usually quotes by the interviewees. 122 The issues discussed in this section focus on how a group of first year tertiary students, registered for the Mathematics I Major course at the University of the Witwatersrand, view the different assessment formats, both PRQ and CRQ, that they have been exposed to in their assessment programme. Relevant quotes from each interview were selected and will be discussed to highlight the most important beliefs, attitudes and inner experiences that this group of students had concerning the different assessment formats in their assessment programme. ● In favour of alternate assessment formats The interviewee was a Chinese female student with an October class record of 70%. The following extract from her interview illustrates that this student enjoyed both the PRQ and CRQ formats of assessment. Interviewer: You saw that a percentage of your tests was multiple choice and a percentage was always long questions and your tutorial tests were only multiple choice. Did you like those different formats? Candidate: Ja, I did, ‘cos multiple choice gives you an option of , y’know, the right answer’s there somewhere so it kind of relieves you a bit and then you balance it off with a nice, um, long question so it’s not... you aren’t just depending on your luck but you’re also applying your knowledge and I think that’s.. that’s cool. This candidate was an average to high achieving student with a good work ethic. She attended all her classes and tutorials and often came for additional assistance. She had a positive attitude towards the different assessment formats, explaining that she liked both PRQs and CRQs as ‘they balanced each other off’. She felt secure with both formats since in the MCQs she knew that one of the options provided was the correct answer, and the CRQs provided the opportunity to apply her knowledge which she felt very comfortable with. ● MCQs test a higher conceptual level The interviewee was a black male student with an October class record of 81%. The following extract from his interview illustrates the student’s perceptions of the different learning approaches he believed to have used for PRQs and CRQs. 123 Interviewer: Do you feel that the mark you got for the MCQ section is representative of your knowledge? Candidate: (Laughs) Well, it depends, I mean, if I got a low mark then it means that I don’t understand anything and it’s not exactly like that. So, I wouldn’t say it represents my knowledge or anything like that. Interviewer: So what does it represent? Candidate: (Laughs) Well, it simply means that maybe I didn’t understand all the concepts very very well. I’m not digging deep into the concept, I’m just doing it on the surface, that’s all. Interviewer: I see and is that what multiple choice probes? Candidate: I think so. Interviewer: Deeper? Candidate: Ja, ja. It requires a lot of knowledge because some questions are very short and we take the long way trying to do it and we run out of time. So you really need to understand what you are doing in multiple choice. This candidate was a high achieving student who performed consistently well throughout the MATH109 course. He was of the opinion that MCQs are not fully representative of his mathematical knowledge as he approaches MCQs on the surface, rather than adopting a deeper learning approach towards MCQs. However, he does admit that some MCQs do test a higher conceptual level of understanding and for such MCQs, one requires a good mathematical knowledge. He also mentions the problem that MCQs testing higher cognitive skills are time consuming, and if you do not have a good understanding of the concept you could ‘run out of time’. ● CRQs provide for partial credit The interviewee was a coloured female student with an October class record of 81%. The following extract from her interview illustrates that this student prefers CRQs to PRQs because of the factor of partial credit. Interviewer: Which type of question do you prefer? 124 Candidate: Um.. overall, I have to say traditional because in a way if you are doing an MCQ question and you get an answer and it doesn’t appear there, you like sort of... your heart sinks, you know, it’s like oh my word, what have I done wrong? But um... you know, also in traditional… ja, you can’t be right… you don’t know if you’re completely wrong or if you’re right and you know that at least you’ll get some marks along the way for doing what you could. So… but, overall, I do prefer the traditional questions because, ja, you can freestyle. (Laughs). This candidate was a high achiever and an independent student. Earlier on in the interview she had stated that she liked both assessment formats because: it’s good that we get asked different ways because it shows that we really understand and we know how to apply. It’s not just doing it like out of routine. When I probed her about the assessment format she preferred, she chose the CRQ format for the reason that if your answer to an MCQ was incorrect no marks were awarded, but even if your answer to a CRQ was incorrect, you could get partial marks for method. She also mentioned that since there was no negative marking in the MCQs, she always felt encouraged to answer these, even if at her first attempt her answer did not correspond to any of the provided options. ● Confidence plays an important role in assessment The interviewee was a white female student with an October class record of 58%. The following extract from her interview illustrates that this student had little confidence in her performance in the mathematics tests and examinations, both PRQ and CRQ. Interviewer: Do you have confidence in answering questions in maths tests which are different to the traditional types of questions? Candidate: Fluctuated. Bit of a roller coaster. Interviewer: Can you explain what you mean? 125 Candidate: It’s got a lot to do with mental blocks as well. I prepared a lot more for the June test and my head was more around it. Mark really helped me. I was sort of in the Resource Centre lots and he really helped me get my head around it. This candidate was an average ability student, struggling to cope with the pressures of her first year studies, as well as getting used to residence life away from her family. This candidate’s performance in the two types of assessment was very erratic. In the April test, she scored poorly in the MCQs, in the June test she scored higher in the MCQs than in the CRQs and in the September test she again scored poorly in MCQs. She justified this fluctuation due to her having ‘mental blocks’ about the MCQs which she appeared to have little confidence in. She did admit that her performance was also strongly linked to the amount of preparation before each test. For the June test, she received a lot of extra assistance from the tutor in the Mathematics Resource Centre which not only helped her to gain a greater understanding of the content material, but also improved her confidence. It was pointed out that none of the students had been exposed to the PRQ format in their secondary school education, and so this assessment format was totally unfamiliar to them. The students thus lacked the confidence which they had gained with the CRQ assessment format in their secondary education, in which the predominant assessment format in the mathematics tests and examinations was the traditional, long open-ended question. The candidate was of the opinion that she would have performed better in the MCQs if she had had more exposure to this format, thereby increasing her confidence in this assessment format. Another interesting quote from the candidate, linked to confidence, was the fact that she regarded the MCQs as more challenging than the CRQs. Interviewer: In your school background were you exposed to different types of questions in Mathematics? Candidate: We were, um, not as like... not such a broad spectrum but we were. We didn’t really do MCQ as such in Maths but um... I 126 think it… ja… the MCQs are definitely challenging because, I don’t know, in most subjects they are, you know, like… Interviewer: What makes them challenging? Candidate: I actually… it’s weird because whenever you write a test and then people are like “Is it MCQ or long questions?” If you say it’s long questions people are like phew… you know... Interviewer: Okay. Candidate: With MCQ it’s like, “Oh my word!” because I think also, besides the fact that you’re limited to one choice out of four, five, um… in long questions you can express yourself more because it’s not like this or that, you know, there is some inbetween. ● MCQs require good reading and comprehension skills The interviewee was a coloured male student with an October class record of 59%. The following extract from his interview illustrates his opinion on the importance of visual (graphical) PRQs and CRQs. Interviewer: How would you ask questions in Maths tests if you were responsible for the course? Candidate: Well, the way it’s been done is great, I think, um, because it’s not… it’s not the old boring do the sum, do that sum, there’s a whole lot of variations within the course which is great and it shouldn’t be boring… Interviewer: Okay. Candidate: …but it… I think this is good. Interviewer: Are there any other types of questions you could recommend that could be incorporated into Maths? Candidate: Um, no. Well, maybe reading of graphs. Interviewer: Okay. Candidate: And finding the intercepts and the… say if this is increasing or decreasing and… Interviewer: More graph interpretation questions? Candidate: Yes. 127 This candidate was an average performing student who showed a very positive attitude towards the variety of assessment formats in the mathematics course. Earlier on in the interview he expressed his beliefs why he did not seem to perform well in the MCQ assessment format. He felt that it was due to the phrasing of the questions. So this student linked his poor performance to his reading and comprehension inabilities. He recommended that more visual (graphical) items should be included in the different assessment formats. He was of the opinion that such types of questions did not rely on reading and comprehension skills as much as the more theoretical questions. Interviewer: When you looked at the multiple choice questions, what was it about them that you think made you perform badly? Candidate: I think it was just the phrasing in different ways ‘cos you phrased the question differently to what we expected. You didn’t expect to… to see that type of question, but it was tricky. ● PRQ format lends itself to guessing and cheating The interviewee was a black male student with an October class record of 43%. The following extract from his interview illustrates the student’s opinion about the guessing factor involved in MCQs. Interviewer: Which types of questions do you prefer in Maths? Candidate: Uh, I like long questions. Ja, I like long questions very much. I don’t like MCQs. Interviewer: Why? Candidate: Uh, MCQs… what can I say about them? Ja, sometimes they are like deceiving ‘cos maybe when you want to work out… work out the solution then you say, “Ah, I can’t do this thing,” you just maybe choose an answer randomly, but on long questions you… you are trying to make sure that, at least, you get a solution, you see, so that’s why I don’t like MCQs ‘cos somewhere we are not working as students. You just say, “Oh, I don’t get it,” then I tick A, but on long questions you are trying by all means to get that six marks or five marks. Interviewer: Oh, so it’s guessing? 128 Candidate: Ja! Ja, guessing, guessing. This candidate was a low achieving student who was not in favour of the alternate assessment formats. He believed that his poor performance was linked to the inclusion of the PRQ format in the mathematics tests and examinations. He went on to explain that he preferred the traditional long CRQs to the MCQs as he considered MCQs as questions that promote guessing. He believed that if you did not have any options to choose from, you would be more careful in your working out of the solution. He expressed the opinion that ‘we are not working as students’ with MCQs, because if he cannot arrive at one of the solutions in the options, he simply guesses the answer, whereas with the CRQs, he would try to achieve the allocated marks by ‘trying all means’ at finding the solution. He did not consider guessing as a fair method of arriving at a solution. In fact, later on in the interviewee, he hinted to the fact that he thought CRQs were more reliable as it was more difficult to cheat with CRQs than with MCQs. Candidate: …another point because MCQs, there’s.. there’s a great possibility of cheating. Interviewer: Okay. Candidate: ‘Cos if you can’t get something you just look to the person next to you. Oh, you just copy. ● Alternate formats add depth to assessment The interviewee was an Indian female student with an October class record of 68%. The following extract from her interview illustrates the student’s opinion about the proportion of PRQs and CRQs that should be included in mathematics tests and examinations. Interviewer: What percentage of questions should be MCQ and what percentage should be long questions? Candidate: I think about seventy percent should be MCQ and the rest should be long questions because it’s... sometimes it’s harder to understand than MCQ questioning despite understanding the knowledge, you know, understanding the maths and the theory 129 that you get ‘cos it’s very tricky sometimes. But I think it separates like your A’s from your B’s, you know, your like seventy-fives from your sixties. It’s a good way to see what type of student you are. This candidate was an average performing student who confessed that in mathematics the MCQ format had actually raised her marks. She explained that with MCQs, ‘there’s a whole technique to be learnt’, and she felt confident that she had mastered this technique. She expressed the opinion that a greater percentage of MCQ should be included in mathematics tests and examinations as she believed that this type of assessment format separated the distinction ‘A’ candidates from the good ‘B’ candidates. So in her opinion, the performance of the students in the MCQs was a good measuring stick of their overall mathematical ability. ● Diagnostic purpose The interviewee was an Indian male student with an October class record of 75%. The extract from his interview illustrates this candidate’s opinion on how MCQs could be used for diagnostic purposes. Interviewer: Do you like the different formats of assessment in your maths tests? Candidate: Um, no, it’s okay, but… Ja I think that… no, the papers have been up to standard so far. I don’t think there really is a problem, especially like, um, the MCQs I felt really like gives you… it really tests your understanding of how to, you know, of all your calculations and stuff. I don’t really think there’s a problem with the way we’ve been tested so far. Interviewer: Which type of questions do you prefer, MCQs or traditional long questions? Candidate: Well, personally, I don’t like the MCQs because sometimes you think you’ve got the right answer but, you know, you might have made a mistake somewhere in your calculations. You saw it or your right answer there then… but I think that the MCQs are 130 probably designed that way. Like you would have probably picked up what kind of mistakes we would have made so… so I think, ja, there should be a variety of different questions. This candidate was amongst the top achieving students in the class. He liked the challenging questions and expressed the opinion that these could be of the PRQ or CRQ format. For this candidate it was not about the format of the question, but rather the cognitive level of skills required to answer the question. He felt that the MCQs had the diagnostic purpose of really testing understanding of knowledge and of methods of solving. With MCQs, an incorrect distracter chosen by the student is often a good indicator of the ‘kind of mistakes we would have made’ in the CRQs, thus identifying any misconceptions that the student might have. This candidate felt that a variety of different questions was necessary to diagnose common errors. ● Distracters can cause confusion The candidate was a white male student, with an October class record of 37%. In the extract, the student expresses the frustrations he experienced with MCQs if two of the distracters were very similar to each other. Interviewer: Which type of questions do you prefer in Maths? Candidate: I feel more confident with the long questions than short questions, ja, than multiple choice ‘cos multiple choice… two answers can be really close and you think about what you could have done wrong or what could be…if it is actually right then keep on going over it and over it and then you end up choosing one and end up being wrong. This candidate was a poorly performing student, who admitted earlier in the interview that he had not been taking his studies seriously. He had not been attending classes regularly and had not studied for his tests. He did not have any preference for the type of assessment format, although he did feel more confident with the CRQ format. His lack of confidence in the MCQs was linked to the fact that often the distracters were very similar to each other and he found 131 it difficult to make the correct choice. He did not have enough confidence to trust his calculation of the correct answer, and when faced with the situation of two answers very close in value or nature to each other, he doubted his calculation. This lack of confidence was also evident in his performance in the CRQ format. In summary, a qualitative analysis of these interviews appears to indicate that there were two distinct camps; those in favour of PRQs and those in favour of CRQs. Those in favour of PRQs expressed their opinion that this assessment format did promote a higher conceptual level of understanding; greater accuracy; required good reading and comprehension skills and was very successful for diagnostic purposes. Those against PRQs were of the opinion that they encouraged guessing; gave no credit for incorrect responses; that students lacked confidence in this format linked to the choice of distracters and that PRQs promoted a surface learning approach. Those in favour of CRQs were of the opinion that this assessment format promoted a deeper learning approach to mathematics; required good reading and comprehension skills; partial marks could be awarded for method and students felt more confident with this more traditional approach. Those against CRQs generally felt that they were time consuming; did not provide any choice of distracters as a guide to a method of solution and that their poor performance in this assessment format was linked to their reading, comprehension and problem-solving inabilities. From the students’ responses, it seems as if the weaker students prefer CRQs. These students expressed a lack of confidence in PRQs, with one of the interviewees justifying her lack of confidence in this assessment format as a ‘mental block’. The weaker students seemed to perform better in CRQ assessment format, thus resulting in a greater confidence in this format. The attitudes of weaker students to the PRQ format illustrate the important role that confidence plays in assessment. Weaker ability students also felt threatened by the fact that if their answer to an MCQ was incorrect, no marks were awarded, 132 whereas with CRQs, partial marks were awarded even if the answer was incorrect. Weaker students often lack the necessary reading and comprehension skills required to answer MCQs successfully. One of the weaker students opposing MCQs felt that the PRQ format lends itself to ‘guessing and cheating’. The weaker ability students also expressed their frustration with MCQs if two or more of the distracters were very similar to each other. They felt that distracters can cause confusion, and this in turn would affect their performance. The results from the qualitative investigation highlighted the most important beliefs, attitudes and inner experiences that this group of students of various mathematical abilities had concerning the PRQ and CRQ assessment formats in their mathematics assessment programme. These results address the research subquestion regarding the student preferences with respect to the different assessment formats. 133 CHAPTER 5: THEORETICAL FRAMEWORK In this chapter, I identify an assessment taxonomy consisting of seven mathematics assessment components, based on the literature. I attempt to develop a theoretical framework with respect to the mathematics assessment components and with respect to three measuring criteria: discrimination index, confidence index and expert opinion. The theoretical framework forms the foundation against which I construct the proposed model for measuring how good a mathematics question is. In this way, the first two research subquestions are addressed: ● How do we measure the quality of a good mathematics question? and ; ● Which of the mathematics assessment components can be successfully assessed using the PRQ assessment format and which of the mathematics assessment components can be successfully assessed using the CRQ assessment format? I also elaborate on the parameters used in my research study for judging a test item. Finally, I describe the model developed for my research for measuring a good question. In Section 5.1, I wish to elaborate on the proposed mathematics assessment components which were originally identified in this study from the literature. I also identify and discuss question examples, both PRQs and CRQs, within each mathematics assessment component. In Section 5.2, I elaborate on the parameters I have identified for judging a test item. In Section 5.3, I develop a model for measuring how good a mathematics question is that will be used both to quantify and visualise the quality of a mathematics question. 134 5.1 MATHEMATICS ASSESSMENT COMPONENTS Based on the literature reviewed on assessment taxonomies in Section 2.4 and adapting Niss’s assessment model for mathematics (Niss, 1993) reviewed in Section 2.3, I propose an assessment taxonomy pertinent to mathematics. This taxonomy consists of a set of seven items, hereafter referred to as the mathematics assessment components. In this research study, I investigated which of the assessment components can be successfully assessed in the PRQ format, and which can be better assessed in the CRQ format. To assist with this process, I used the proposed hierarchical taxonomy of seven mathematics assessment components, ordered by the cognitive level, as well as the nature of the mathematical tasks associated with each component. This mathematics assessment component taxonomy is particularly useful for structuring assessment tasks in the mathematical context. The proposed set of seven mathematics assessment components are summarised below: (1) Technical (2) Disciplinary (3) Conceptual (4) Logical (5) Modelling (6) Problem solving (7) Consolidation Corresponding to Niss’s assessment model (Niss, 1993) reviewed in Section 2.3, in this proposed set of seven mathematics assessment components, questions involving manipulation and calculation would be regarded as technical. Those that rely on memory and recall of knowledge and facts would fall under the disciplinary component. Assessment components (1) and (2) include questions based on mathematical facts and standard methods and techniques. The conceptual component (3) involves comprehension skills with algebraic, verbal, numerical and visual (graphical) questions linked to standard applications. The assessment components (4), (5) and (6) correspond to the 135 logical ordering of proofs, modelling with translating words into mathematical symbols and problem solving involving word problems and finding mathematical methods to come to the solution. Assessment component (7), consolidation, includes the processes of synthesis (bringing together of different topics in a single question), analysis (breaking up of a question into different topics) and evaluation requiring exploration and the generation of hypothesis. Comparing with Bloom’s taxonomy (Bloom, 1956), reviewed in Section 2.4, components (1) and (2) would correspond to Bloom’s level 1: Knowledge. This lower-order cognitive level involves knowledge questions, requiring recall of facts, observations or definitions. In assessment tasks at this level, students are required to demonstrate that they know particular information. Components (3) and (4) correspond to Bloom’s level 2: Comprehension and level 3: Application. These middle-order cognitive levels involve comprehension and application type questions which call on the learner to demonstrate that she/he comprehends and can apply existing knowledge to a new context or to show that she/he understands relationships between various ideas. Mathematics assessment components (5), (6) and (7) all correspond to Bloom’s highest cognitive levels: level 4: Analysis; level 5: Synthesis and level 6: Evaluation. These levels involve tasks requiring higher-order skills such as analysing, synthesising and evaluating. At this cognitive level, the learner is required to go beyond what she/he knows, predict events and create or attach values to ideas. Problem solving might be required here where the learner is required to make use of principles, skills or his/her own creativity to generate ideas. A modification of Bloom’s taxonomy, adapted for assessment, called the MATH taxonomy (Smith et al., 1996) was discussed in Section 2.4 in the literature review. The MATH taxonomy has eight categories, falling into three main groups. Group A tasks include those tasks which require the skills of factual knowledge, comprehension and routine use of procedures. In the proposed mathematics assessment component taxonomy, assessment components (1) and (2) -Technical and Disciplinary, would correspond to these Group A tasks. In the MATH taxonomy Group B tasks, students are required to apply their 136 learning to new situations, or to present information in a new or different way. Such tasks require the skills of information transfer and applications in new situations, and would correspond to assessment components (3) - Conceptual and (4) - Logical. The third group in the MATH taxonomy, Group C encompasses the skills of justification, interpretation and evaluation. Such skills would relate to the mathematics assessment components (5) - Modelling, (6) Problem solving and (7) - Consolidation. One of the main differences between Bloom’s taxonomy and the MATH taxonomy is that the MATH taxonomy is context specific and is used to classify tasks ordered by the nature of the activity required to complete each task successfully, rather than in terms of difficulty. Using Bloom’s taxonomy and the MATH taxonomy, the proposed mathematics assessment components can be classified according to the cognitive level of difficulty of the tasks as shown in Table 5.1 Table 5.1: Mathematics assessment component taxonomy and cognitive level of difficulty. Mathematics assessment components 1. Technical 2. Disciplinary 3. Conceptual 4. Logical 5. Modelling 6. Problem solving 7. Consolidation Cognitive level of difficulty Lower order / Group A Middle order / Group B Higher order / Group C Table 5.2 summarises the proposed mathematics assessment components and the corresponding cognitive skills required within each component. These skills were identified by the researcher, based on the literature review, as being the necessary cognitive skills required by students to complete the mathematical tasks within each mathematics assessment component. 137 Table 5.2: Mathematics assessment component taxonomy and cognitive skills. Mathematics assessment Components 1. Technical 2. Disciplinary 3. Conceptual 4. Logical 5. Modelling 6. Problem solving 7. Consolidation Cognitive skills ● Manipulation ● Calculation ● Recall (memory) ● Knowledge (facts) Comprehension: ● algebraic ● verbal ● numerical ● visual (graphical) ● Ordering ● Proofs Translating words into mathematical symbols Identifying and applying a mathematical method to arrive at a solution ● Analysis ● Synthesis ● Evaluation 5.1.1 Question examples in assessment components In the following discussion, one question within each mathematics assessment component has been identified according to Table 5.2, from the MATH109 tests and examinations. The classification of the question according to one of the assessment components was validated by a team of lecturers (experts) involved in teaching the first year Mathematics Major course at the University of the Witwatersrand. In addition, the examiner of each test or examination was asked to analyse the question paper by indicating which assessment component best represented each question. In this way, the examiner could also verify that there was a sufficient spread of questions across assessment components, and in particular, that there was not an over-emphasis on questions in the technical and disciplinary components. This exercise of indicating the assessment component next to each question also assisted the moderator and external examiner to check that the range of questions included all seven mathematics assessment components, from those tasks requiring lower-order cognitive skills to those requiring higher-order cognitive skills. 138 Assessment Component 1: Technical If z = 3 + 2i and w = 1 − 4i , then in real-imaginary form A. −5 14i + 17 17 B. 5 14i − 15 15 C. 3 − 4i D. 11 14i + 17 17 z equals: w MATH109 August 2005, Tutorial Test, Question 5. In this technical question, students are required to manipulate the quotient of complex numbers, z and w , by multiplying the numerator and denominator by the complex conjugate w , and then to calculate and simplify the resulting quotient by rewriting it in the real-imaginary form, α + bi . Assessment Component 2: Disciplinary If f ( x ) = sin x , x ≠ 0, which of the following is true? x A. f is not a function. B. f is an even function. C. f is a one-to-one function. D. f is an odd function. MATH109 March 2005, Tutorial Test A, Question 1. In this disciplinary question, students have to recall the definitions and properties of a function, an even function, a one-to-one function and an odd function, in order to decide which one of the given statements correctly describe the given function f ( x) . Such a question requires the cognitive skill of memorising facts and then remembering this knowledge when choosing the best option. 139 In the following discussion, three question examples have been chosen to illustrate three of the comprehension type cognitive skills: verbal, numerical and visual (graphical), that are required by students to complete the tasks within the conceptual mathematics assessment component. Assessment Component 3: Conceptual State why the Mean Value Theorem does not apply to the function f ( x) = 2 ( x + 1) 2 on the interval [ −3, 0] A. f (−3) ≠ f (0) B. f is not continuous C. f is not continuous at x = −3 and x = 0 D. Both A and B E. None of the above MATH109 June 2006, Section A: MCQ, Question 7. In the above conceptual question, the student is required to apply his/her knowledge of the Mean Value theorem to a new, unfamiliar situation which requires that the student selects the best verbal reason why the Mean Value theorem does not apply to the function f ( x) and the interval given in the question. This question requires a comprehension of all the hypotheses of the Mean Value theorem and tests the students’ understanding of a situation where one of the hypotheses to the theorem fails. Assessment Component 3: Conceptual x 2 lim 1 + = x →∞ x A. 2 B. e2 C. ∞ D. 1 E. Does not exist MATH109 November 2005, Section A: MCQ, Question 2. 140 In the conceptual question above, the student is required to apply his/her knowledge of the definition of Euler’s number e, which is defined in lectures as: x 1 lim 1 + = e x →∞ x They need to make a conjecture and extrapolate from this definition to choose x 2 the best numerical option for lim 1 + . x →∞ x This result had not been discussed in class, and hence is not a familiar result to the students. Assessment Component 3: Conceptual Determine from the graph of y = f ( x) whether f possesses extrema on the interval [a, b] y f a b x A. Maximum at x = a; minimum at x = b. B. Maximum at x = b; minimum at x = a. C. No extrema. D. No maximum; minimum at x = a. MATH109 May 2006, Section A: MCQ, Question 1. In this graphical conceptual question, students are required to apply their knowledge of the Extreme Value theorem and the definition of relative extrema on an interval I. There is no algebraic calculation necessary of the values of the extrema on the closed interval [a,b]. The Extreme Value theorem is an existence theorem because it tells of the existence of minimum and maximum values, but does not show how to find these values. Students need to examine the graph of 141 the given function f and consider how f behaves at the end points as well as how the continuity (or lack of it) has affected the existence of extrema on the given interval. The choice of the correct option is assisted by having a visual figure when the decision is made. Assessment Component 4: Logical (PRQ) Decide whether Rolle’s theorem can be applied to f ( x) = x 2 + 3x on the interval [0, 2] . If Rolle’s theorem can be applied, find the value(s) of c in the interval such that f '(c) = 0 . If Rolle’s theorem cannot be applied, state why. A. Rolle’s theorem can be applied; c = −3 2 B. Rolle’s theorem can be applied; c = 0, c = 3 C. Rolle’s theorem does not apply because f (0) ≠ f (2) D. Rolle’s theorem does not apply because f ( x ) is not continuous on [0, 2] MATH109 May 2006, Section A: MCQ, Question 5. This logical PRQ firstly requires the student to recall the conditions of Rolle’s theorem to decide whether Rolle’s theorem can be applied to the given function. Such a decision requires the conceptual skill of ordering the conditions stated in the proof of Rolle’s theorem, and checking that the three conditions of: (i) continuity on [0, 2] , (ii) differentiability on (0, 2) and (iii) f (0) = f (2) , are met. Once the decision is made, the student can proceed to the second part of the question which requires the student to find the value(s) of c in (0, 2) such that f '(c) = 0 . The logical ordering of the conditions of Rolle’s theorem leads to the student realising that since the last condition is not met i.e. f (0) ≠ f (2) , Rolle’s theorem does not apply. A further example within the logical assessment component has been provided below, this example being a constructed response question appearing in MATH 109 June 2006, Section C: Calculus. 142 Assessment Component 4: Logical (CRQ) (a) In the proof of the following theorem, the order of the statements is incorrect. Give a correct proof of the theorem by reordering the statements. You need only list the statement numbers in their correct order. Theorem: If a function f is continuous on the closed interval [ a, b] and F is an antiderivative of f on the interval [ a, b] , then ∫ b a f ( x)dx = F (b) − F (a) $ Since F is the antiderivative of f , F '(ci ) = f (ci ) % ∴ f (ci ) = &∴ n ∑ i =1 F ( xi ) − F ( xi − 1) ∆xi n f (ci )∆xi = ∑ [ F ( xi ) − F ( xi − 1 )] = F (b) − F (a ) i =1 ' By the Mean Value theorem, there exists F '(ci ) = ci ∈ ( xi − 1, xi ) such that F ( xi ) − F ( xi − 1) xi − xi − 1 ( Divide the closed interval [a, b] into n subintervals by the points a = x 0 < x1 < x 2 < ... < xi − 1 < xi < ... < xn − 1 < xn = b ) Taking the limit as * F (b) − F ( a ) = n b i =1 a n → ∞, F (b) − F (a ) = lim ∑ f (ci )∆xi = ∫ f ( x)dx n →∞ n ∑ [ F ( x ) − F ( x )] i i −1 i =1 +∴ f (ci )∆xi = F ( xi ) − F ( xi − 1) Correct order: (Only list the statement numbers.) (b) What is the theorem called? (c) What kind of series is the series on the right hand side of statement *? MATH109 June 2006, Section C: Calculus, Question 4. This logical CRQ requires the students to recall the proof of the Fundamental Theorem of Calculus. Although the proof is given, the statements appear in the incorrect order. The students are required to reorder the given statements to correct the proof. Such a reordering process involves the cognitive skill of logical ordering. 143 Assessment Component 5: Modelling (CRQ) Following the record number in attendance during the opening day of the Rand Easter show this year, organisers are planning a special event for the opening eve in 2007. Murula.com will sponsor a ten-seater jumbo jet, carrying all eight members of the organisation committee, to fly in a western direction at 5000 m/minute, at an altitude of 4000 m, over the show grounds that evening. In order to ensure that all people participating in this event will be able to follow the jet from the surface at the show grounds, a special 10 000 W searchlight will be installed at the main entrance gate to keep track of the plane. The searchlight is due to be kept shining on the plane at all times. N W E S ⇦ Plane 4000 m θ ⊗ Searchlight x m What will be the rate of change of the angle of the searchlight when the jet is due east of the light at a horizontal distance of 2000 m? MATH109 May 2006, Section C: Calculus, Question 2. 144 In this modelling CRQ, students are required to translate the words into mathematical symbols and to use related rates to solve the real-life problem. To solve the related-rate problem, students firstly have to identify all the given quantities as well as the quantities to be determined. A sketch has been provided which can assist students to identify and label all these quantities. Secondly, students have to write an equation involving the variables whose rates of change either are given or are to be determined. Thirdly, using the Chain Rule, both sides of the equation must be implicitly differentiated with respect to time. Finally, all known values for the variables and their rates of change must be substituted into the resulting equation, so that the required rate of change can be solved for. In modelling type questions, students have to develop a mathematical model to represent actual data. Such a procedure requires two conceptual skills: accuracy and simplicity. This means that the student’s goal should be to develop a model that is simple enough to be workable, yet accurate enough to produce meaningful results. Assessment Component 6: Problem solving (PRQ) Which of the following is an antiderivative for A. F ( x) = 1 2 x cos x + 4 2 B. F ( x) = 1 2 x sin x + 5 2 C. F ( x) = x sin x + cos x − 1 D. F ( x) = x cos x + sin x − 2 f ( x) = x cos x ? E. None of the above. MATH109 June 2006, Section A: MCQ, Question 5. In this problem solving MCQ, the student is required to find his or her own method to arrive at the solution. Firstly, the student has to know what the 145 antiderivative of a function is in order to decide on a method. The solution can be arrived at by either integrating f ( x) using the technique of integration by parts, since f ( x) is a product of two differentiable functions, or by differentiating each function F ( x) provided in the distracters, using the Product Rule, until the original function f ( x) is obtained. Assessment Component 6: Problem solving (CRQ) This question deals with the statement P(n) : n3 + (n + 1)3 + (n + 2)3 is divisible by 9 , for all n ∈ Ν, n ≥ 2 (1.1) Show that the statement is true for n = 2 . (1.2) Use Pascal’s triangle to expand and then simplify (1.3) Hence, assuming that P ( k ) is true for k > 2 with k ∈ Ν , prove that P ( k + 1) is true. (1.4) Based on the above results, justify what you can conclude about the statement P ( n) . (k + 3)3 . MATH109 June 2006, Section B: Algebra. Question 1. In the problem solving CRQ, the students are required to use the principle of Mathematical Induction to prove that the statement P (n) is true for all natural numbers n ≥ 2 . The CRQ has been subdivided into smaller subquestions involving different cognitive skills to assist the student with the method of solving using mathematical induction. In subquestion (1.1), the students need to establish truth for n = 2 by actually testing whether the statement P (n) is true for n = 2. Hence (1.1) assess within the technical mathematics assessment component. Subquestion (1.2) involves a numerical calculation, the result of which will be used in the proof by induction. Hence (1.2) also assesses within the technical assessment component. In subquestion (1.3), students are required to complete the proof by induction, by assuming the inductive 146 hypothesis that P (k ) is true for k > 2, k ∈ Ν , and proving that P (k + 1) is true. Since subquestion (1.3) requires the cognitive skills of identifying and applying the principle of Mathematical Induction to arrive at a solution, (1.3) assesses within the problem solving mathematics assessment component. Subquestion (1.4) concludes the proof by requiring the students to justify that both of the conditions of the principle hold, and therefore by the principle of induction P (n) is true for every n ≥ 2, n ∈ Ν . Hence (1.4), requiring no more than a simple manipulation, assesses within the technical assessment component. This problem solving CRQ illustrates that often those questions involving higher order cognitive skills subsume the lower order cognitive skills. Assessment Component 7: Consolidation (PRQ) Let y = f ( x ) = cos(arcsin x ) . Then the range of f is A. { y 0 ≤ y ≤ 1} B. { y −1 ≤ y ≤ 1} C. {y − D. {y − π π π π < y< } 2 2 ≤ y≤ } 2 2 E. None of the above. MATH 109 May 2006, Section A: MCQ, Question 1. In the assessment component of consolidation, questions require the conceptual skills of analysis and synthesis and in certain cases evaluation. In the MCQ under discussion, students are required to analyse the nature of the function f , being a composition of both the functions cos x and arcsin x . Within this analysis, consideration of the domain and range of each separate function has to be made. Once all the individual functions have been analysed with their restrictions on their domain and range, all this information has to be synthesised in order to make a conclusion about the resulting composite function, and the 147 restrictions on the domain and range of the composite function. An evaluation is finally required of the correct option which best describes the restriction on the range of the composite function. Assessment Component 7: Consolidation (CRQ) Let ! x" be the greatest integer less than or equal to x . (i) Show that (ii) Is lim f ( x) exists if f ( x) = ! x" + ! − x" . x→2 f ( x) = ! x" + ! − x" continuous at x = 2 ? Give reasons. MATH109 March 2006, Section C: Calculus, Question 4. In the consolidation CRQ provided, students are expected to go beyond what they know about the greatest integer function ! x" . Part (i) requires an analysis of the behaviour of the function f ( x) , being the sum of two greatest integer functions, as x approaches 2 . In this analysis, the limit of each individual greatest integer function, ! x" and ! − x" , needs to be investigated as x approaches 2 . Synthesis is then required to complete the question, by summing up each individual limit, if they exist. In part (ii), the student is required to make an evaluation, based on the results from part (i). A further condition of continuity needs to be checked i.e. the value of f (2) , and together with the result obtained in part (i), the student can make a judgement decision about the continuity of the function at x = 2 . In this question, a consolidation of both the results from parts (i) and (ii) assists the student to make the overall evaluation. Such techniques of justifying, interpreting and evaluation are considered to be integral to the consolidation assessment component. 148 5.2 DEFINING THE PARAMETERS In this research study, in order to define the parameters for developing a model to measure how good a mathematics question is, a few assumptions are made about mathematical questions. Firstly, we assume that the question is clear, well-written and checked for accuracy. We also assume that the question tests what it sets out to do. Issues such as ambiguity etc. are not considered. These are right or wrong and we assume correctness. For developing a model for measuring a good question (described in section 5.3), we depart from the following four premises: ● A good question should discriminate well. In other words, high performing students should score well on this question and poor performing students are not expected to do well. ● Students’ confidence when dealing with the question should correspond to the level of difficulty of the question. There is a problem with a question when it is experienced as misleadingly simple by students and subsequently leads to an incorrect response. In this case, students are over confident and do not judge the level of difficulty of the question correctly. Similarly, there is a problem if a simple question is experienced as misleadingly difficult and students have no confidence in doing it. ● The level of difficulty of the question should be judged correctly by the lecturer. When setting a question, the lecturer judges the level of difficulty intuitively. There is a problem with the question when the lecturer over or underestimates the level of difficulty as experienced by students. ● The level of difficulty of a question does not make it a good or poor question. Difficult questions can be good or poor, just as easy questions can be. With these premises as background, three parameters were identified: (i) Discrimination index (ii) Confidence index 149 (iii) Expert opinion Although only these three parameters were used to develop a model to quantify the quality of a question, a fourth parameter was used to qualitatively contribute to the characteristics of a question: (iv) Level of difficulty How these parameters were amalgamated to develop the model will be discussed in section 5.3. In this section we only clarify the parameters. 5.2.1 Discrimination index The extent to which test items discriminate among students is one of the basic measures of item quality. It is useful to define an index of discrimination to measure this quality. The discrimination index (DI) is computed from equalsized high and low scoring groups on the test (say the top and bottom 27%) as follows: DI = (CH – CL)/N ; where CH = number of students in the high group that responded correctly; CL = number of students in the low group that responded correctly; N = number of students in both groups. Using this definition, the discrimination index can vary from -1 to +1. Ideally, the DI should be close to 1. If equal numbers of ‘high’ and ‘low’ students answer correctly, the item is unsuccessful as a discrimination (DI = 0). If more ‘low’ than ‘high’ students get an item correct, the DI is negative, a signal for the examiner to improve the question. For purposes of building up a test bank, a DI value of 0.3 is an acceptable lower limit. Using the 27% sample group size, values of 0.4 and above are regarded as high and less than 0.2 as low (Ebel, 1972). 150 The proportion of students answering an item correctly also affects its discrimination. Items answered correctly (or incorrectly) by a large proportion of students (more than 85%) have markedly reduced power to discriminate. On a good test, most items will be answered correctly by 30% to 80% of the students. A few basic rules for improving the ability of test items to discriminate follow: 1. Items that correlate less than 0.2 with the total test score should probably be restructured. Such items do not measure the same skill or ability as does the test on the whole or are confusing or misleading to students. Generally, a test is better (i.e. more reliable) the more homogeneous the items. It is generally acknowledged that well constructed mathematics tests are more homogeneous than well constructed tests in social science (Kehoe, 1995). Homogeneous tests are those intended to measure the unified content area of mathematics. A second issue involving test homogeneity is that of the precision of a student’s obtained test score as an estimate of that student’s “true” score on the skill tested. Precision (reliability) increases as the average item-test correlation increases. 2. Distracters for PRQs that are not chosen by any students should be replaced or eliminated. They are not contributing to the test’s ability to discriminate the good students from the poor students. One should be suspicious about the correctness of any item in which a single distracter is chosen more often than all other options, including the answer, and especially so if the distracter’s correlation with the total score is positive. 3. Items that virtually everyone gets right are unsuccessful for discriminating among students and should be replaced by more difficult items (Ebel, 1965). The Rasch model specifies that item discrimination, also called the item slope, be uniform across items. Empirically, however, item discriminations vary. The software package, Winsteps, estimates what the item discrimination parameter 151 would have been if it had been parameterised. During the estimation phase of Winsteps, all item discriminations are asserted to be equal, of value 1.0, and to fit the Rasch model. As empirical item discriminations never are exactly equal, Winsteps can report an estimate of those discriminations post-hoc (as a type of fit statistic). The empirical discrimination is computed after first computing and anchoring the Rasch measures. In a post-hoc analysis, a discrimination parameter, ai , is estimated for each item. The estimation model is of the form: P ln vix = ai ( β v − δ i − Fx ); where Pvi( x −1) Pvix = probability that person v of ability β v is observed in category x of a rating scale applied to item i with difficulty level δ i ; Fx = Rasch-Andrich threshold. In Winsteps, item discrimination is not a parameter. It is merely a descriptive statistic. The Winsteps reported values of item discrimination are a first approximation to the precise value of ai . The possible range of ai is −∞ to +∞ , where +∞ corresponds to a Guttman data pattem (perfect discrimination) and −∞ to a reversed Guttman pattem. The Guttman scale (also called ‘scalogram’) is a data matrix where the items are ranked from easy to difficult and the persons likewise are ranked from lowest achiever on the test to highest achiever on the test. Rasch estimation usually forces the average item discrimination to be near 1.0. An estimated discrimination of 1.0 accords with Rasch model expectations. Values greater than 1.0 indicate over-discrimination, and values less than 1.0 indicate under-discrimination. Over-discrimination is thought to be beneficial under classical (raw-score) test theory conventions (Linacre, 2005). In classical test theory, the ideal item acts like a switch i.e. high performers pass, low performers fail. This is perfect discrimination, and is ideal for sample stratification. Such an item provides no information about the relative performance of low performers, or the relative performance of high performers. Rasch analysis, on the other hand, requires items that provide indication of relative performance along the latent variable as discussed in section 3.4. It is 152 this information which is used to construct measures. From a Rasch perspective, over-discriminating items tend to act like switches, not measuring devices. Under-discriminating items tend neither to stratify nor to provide information about the relative performance of students on those items. A second important characteristic of a good item is that the best achieving students are more likely to get it right than are the worst achieving students. Item discrimination indicates the extent to which success on an item corresponds to success on the whole test. Since all items in a test are intended to cooperate to generate an overall test score, any item with negative or zero discrimination undermines the test. Positive item discrimination is generally productive, unless it is so high that the item merely repeats the information provided by other items on the test. 5.2.2 Confidence index The confidence index (CI) has its origins in the social sciences, where it is used particularly in surveys and where a respondent is requested to indicate the degree of confidence he has in his own ability to select and utilise wellestablished knowledge, concepts or laws to arrive at an answer. In the science education literature, as well as the measurement literature (as discussed in section 2.14), a range of studies has considered some aspects of student confidence and how such confidence may impact students’ test performance. Students’ self-reported confidence levels have also been studied in the field of educational measurement to assess over- and underconfidence bias in students’ test-taking practices (Pallier, Wilkinson, Danthiir, Kleitman, Knezevic, Stankov & Robertsw, 2002). In physics education research, Hasan et al. (1999) used a confidence index in conjunction with the correctness or not of a response, to distinguish between students’ embedded misconceptions (wrong answer and high confidence) and lack of knowledge (wrong answer and low confidence) and to restrict guessing (Table 5.3). The CI is usually based on some scale. For example, in Hasan’s (1999) study, a six-point scale (0 – 5) was used in which 0 implies no knowledge (total guess) of methods or laws required for answering a 153 particular question, while 5 indicates complete confidence in the knowledge of the principles and laws required to arrive at the selected answer. When a student is asked to provide an indication of confidence along with each answer, we are in effect requesting him to provide his own assessment of the certainty he has in his selection of the laws and methods utilised to get to the answer (Webb, 1994). The decision matrix in Table 5.3 is used for identifying misconceptions in a group of students. Table 5.3: Decision matrix for an individual student and for a given question, based on combinations of correct or wrong answers and of low or high average CI. Correct answer Low CI High CI Lucky guess Sufficient knowledge (understanding of concepts) Wrong answer Lack of knowledge Misconception (Adapted from Hasan et al., 1999, p296). If the degree of certainty is low i.e. low CI, then it suggests that guesswork played a significant part in the determination of the answer. Irrespective of whether the answer was correct or wrong, a low CI value indicates guessing, which, in tum, implies a lack of knowledge. If the CI is high, then the student has a high degree of confidence in his choice of the laws and methods used to arrive at the answer. In this situation, if the student arrived at the correct answer, it would indicate that the high degree of certainty was justified. Such a student is classified as having adequate knowledge and understanding of the concept. However, if the answer was wrong, the high certainty would indicate a misplaced confidence in his/her knowledge of the subject matter. This misplaced certainty in the applicability of certain laws and methods to a specific question is an indication of the existence of misconceptions. Hasan et al. (1999) recommend that if the answers and related CI values indicate the presence of misconceptions, then feedback to students can be 154 modified with the explicit intent of removing the misconceptions. Furthermore, the information obtained by utilising the CI can also be used to address other areas of instruction. In particular, it can be used: ● as a means of assessing the suitability of the emphasis placed on different sections of a course ● as a diagnostic tool, enabling the teacher to modify feedback ● as a tool for assessing progress or teaching effectiveness when both preand post-tests are administered ● as a tool for comparing the effectiveness of different teaching approaches, including technology-integrated approaches, in promoting understanding and problem-solving proficiency. In a study conducted by Potgieter, Rogan and Howie (2005) on the chemical concepts inventory of Grade 12 learners and University of Pretoria Foundation year students, the CI indicated general overconfidence of learners about the correctness of answers provided. It also showed that the guessing factor was less serious a complication than anticipated in the analysis of multiple choice items for the prevalence of specific misconceptions. Engelbrecht, Harding and Potgieter (2005) reported that first year tertiary students are also more confident of their ability to handle conceptual problems than to handle procedural problems in mathematics. They argue that the CI cannot always be used to distinguish between a lack of knowledge (wrong answer, low CI) and a misconception (wrong answer, high CI), since students could just be overconfident, or in procedural problems, students with high confidence may make numerical errors. The literature is divided about whether self-evaluation bias facilitates subsequent performance. In some studies overconfidence appears to be associated with better performance (Blanton, Buunk, Gibbons & Kuyper, 1999), whereas other studies showed no long term performance advantage of overconfidence (Robins & Beer, 2001). Pressley et al. (1990) argue that the relationship between self-evaluation bias and subsequent performance depends on the motivational factors contributing to the exaggeration of confidence. 155 Exaggerated self-reports that are motivated by avoidance of self-protection are associated with poor subsequent performance, whereas exaggeration motivated by a strong achievement motivation is associated with improved future performance. Ochse (2003) differentiated between overestimators, realists and underestimators based on the projection that students in third-year psychology made of their expected subsequent performance. Ochse found that, on average, overestimators (38% of sample) expected significantly higher marks than both realists and underestimators, were significantly more confident about the accuracy of their estimations, perceived themselves to have significantly higher ability than their peers, but achieved the lowest marks of the three groups (11.5% below class average, 20.6% lower than predicted). Underestimators, on the other hand (17% of sample), achieved the highest marks of the three groups (17.5% above class average, 14.3% above prediction) despite their unfavourable perceptions of their own ability and low confidence in their projected achievements. Ochse suggested that overoptimism may reflect ignorance of required standards and may result in complacency, inappropriate preparation or carelessness. The result of such ignorance is disappointment, frustration and anger when actual performance falls far short of expectations. It should be noted that research on self-efficacy indicates a strong relationship between self-assessment and subsequent performance. Ehrlinger (2008) has pointed out that this relationship depends on the ability of respondents to control or regulate their actions in order to achieve the desired outcome. The close correlation between prediction of performance and self-efficacy also requires an accurate specification of a specific task. In this research study, the CI values per item were calculated according to a 4point Likert scale in which 1 implied a ‘complete guess’, 2 implied a ‘partial guess’, 3 for ‘almost certain’, while 4 indicated ‘certain’. In terms of the Rasch model, a Likert scale is a format for observing responses wherein the categories increase in the level of the variable they define, and this increase is uniform for 156 all agents of measurement. The polytomous Rasch-Andrich rating scale model, discussed in section 3.4.1.3, was used in the Winsteps calculation of the CI. 5.2.3 Expert opinion For purposes of this study, subject specialists were referred to as experts in terms of their mathematical knowledge of the content, as well as their experience in the methodological and pedagogical issues involved in teaching the content. Experts were asked to review test and examination items in the first-year mathematics major course and to express their opinions on the level of difficulty of these questions. The aim of this exercise was to encourage the experts to look more critically at the questions, both PRQs and CRQs, and to express their opinions on the level of difficulty of each test item, independent of the students’ performance in these items i.e. the predicted level of difficulty. The opinions were categorised into three main types using the following scale: 1: student should find the question easy 2: student should find the question of average difficulty but fair 3: student should find the question difficult or challenging. For the purpose of this study we consider the term expert opinion equivalent to predicted performance. While giving their opinions, experts could reflect on the learning outcomes of the course, and on the assessment components corresponding to each test item. Such reflection would assist experts to write questions that guide students towards the kinds of intellectual activities they wish to foster, and raise their awareness of the effects of the kinds of questions they ask on their students’ learning. In this context, Hubbard (2001) refers to Ausubel’s meaningful learning, Skemp’s description of relational understanding, Tall’s definition of different types of generalisation and abstraction and Dubinsky and Lewin’s reflective abstraction as all investigating in different ways, the kinds of intellectual activities which we desire our students to engage in. The experts involved in giving their opinions were not asked to familiarise themselves with 157 any of the above research papers. However, it was hoped that because they were successful mathematics thinkers themselves, the task of giving their opinions would enable them to recognise the intellectual activities required to solve different types of questions, in both the PRQ and CRQ formats. All questions for which the experts expressed their opinion, involved subject matter which was familiar and covered a wide range of teaching and learning purposes. No model examples were given to the experts so that they would not be influenced by the researcher’s views. The researcher did explain to the team of experts that their individual opinions would in no way classify questions as good or bad. This was not the intention of the task. To anticipate the problem that experts might have when trying to express their opinions on questions as being easy, average difficulty or challenging, not knowing exactly what information had been provided to students in lectures and tutorials, those involved in teaching the calculus course were asked for their expert opinions on the calculus PRQs and CRQs only, and those involved in teaching the algebra course were asked for their opinions on the algebra PRQs and CRQs only. In this way, the experts were completely familiar with the content, in particular knowing whether a question was identical or similar to one for which a specific model solution had been provided in lectures or tutorials, or whether this was not the case. The mathematical content is important because learning objectives that are not subject specific are more difficult for subject specialists to apply. One of the difficulties experienced by the experts in giving their opinions on how students experience the difficulty level of the test items, is that most experts are accustomed to thinking exclusively about the subject matter of the test item and their own view of mathematics, rather than about what might be going on in the minds of their students as they tried to answer the questions. By giving their opinions, there is an expectation that when experts set assessment tasks in the future, they will be influenced by their experiences and reflect on the purpose of their questions. The wording of the questions needs to reflect what kind of intellectual activity they intend for their students to engage in. 158 In this study, a panel of 8 experts were asked for their opinions. As this number was too low to apply any Rasch model, the expert opinion per item was calculated as the average of the individual expert opinions given per item. Winsteps will operate with a minimum of two observations per item or person. For statistically stable measures to be estimated, at least 30 observations per element are needed. The sample size needed to have 99% confidence that no item calibration is more than 1 logit away from its stable value is in the range 27 < Ν < 61 . Thus, a sample of 50 well-targeted examinees is conservative for obtaining useful, stable estimates. 30 examinees/observations is enough for well-designed pilot studies. Hence the Rasch model was not used in the calculation of the expert opinion per item. 5.2.4 Level of difficulty Student performance was used as an estimate of the level of difficulty of an item, a common practice. The level of difficulty, although not a direct indication of the quality of the question, is a useful parameter when selecting questions to assemble a well-balanced set of questions. In traditional test theory, difficulty level is defined as: Difficulty level = number of correct responses/total number of responses. An item that everyone gets wrong (difficulty level = 0.0) is unsuccessful. Equally unsuccessful is an item that everyone gets right (difficulty level = 1.0). In the Rasch logit-linear models, as discussed in Chapter 3, Rasch analysis produces a single difficulty estimate for each item and an ability estimate for each student. Through the application of this model, raw scores undergo logarithmic transformations that render an interval scale where the intervals are equal, expressed as a ratio or log odds units or logits (Linacre, 1994). A logit is the unit of measure used by Rasch for calibrating items and measuring persons. The difficulty scale starts from easy items (negative logits) and moves to more difficult ones (positive logits). 159 5.3 MODEL FOR MEASURING A GOOD QUESTION In this section a model for measuring how good a mathematics question is will be developed that will be used both to quantify and visualise the quality of a good mathematics question. 5.3.1 Measuring criteria To address the research questions of this study, three measuring criteria, based on the parameters discussed in section 5.2, were identified. These criteria form the foundation of the theoretical framework developed for the purpose of this study, and were used to diagnose the quality of a test item. (1) Point measure as a discrimination index. (2) Confidence deviation: the deviation between the expected students’ confidence level and the actual student confidence for the particular item. (3) Expert opinion deviation: the deviation between the expected student performance according to experts and the actual student performance. (1) Point measure as a discrimination index According to literature (Wright, 1992), there are numerous ways of conceptualising and mathematically reporting discrimination. The point measure and the Rasch discrimination index are two of them. In classical test theory, the point biserial correlation is the Pearson correlation between responses to a particular item and scores on the total test. In the Rasch model, the point measure correlation is a more general indication of the relationship between the performance on a specific item and the total test score, and is computed in the same way as the point biserial, except that Rasch measures replace total scores. It was therefore decided to use the point measure as the measure of discrimination, rather than the Rasch discrimination index. The point measure ( rpm) is a number between 0 and 1. 160 In order to assign the same measuring scale to all three criteria, the discrimination was adapted by subtracting the point measure values ( rpm) from 1 (the perfect correlation). ∴ Adapted discrimination = 1 − rpm (0 ≤ rpm ≤ 1) The discrimination was adapted in this way so that the amount of departure of the point measure values from the perfect correlation value of 1 could be investigated. Thus, in this model, the closer the adapted discrimination is to 0, the better the correlation. (2) Confidence deviation In this study, the CI values per item were calculated according to a 4-point Likert scale as discussed in section 5.2.2: 1 : complete guess 2 : partial guess 3 : almost certain 4 : certain To measure the confidence deviation, the confidence measure (average over the students) for each item was plotted against each corresponding item difficulty. A best fit regression line was fitted to the points, as shown in Figure 5.1. Illustration of confidence deviation from the best fit line between item difficulty and confidence. Y itemi item Confidence Confidence Figure 5.1: i (x (xi,i,yi)yi) yi deviation deviation ∆ = yi − yˆi ŷi y = f (x) 0 xxi i Item Difficulty X 161 For any given item difficulty, the amount of deviation between the actual confidence measures and the confidence values as predicted by the best fit line, is measured by the vertical distance yi − yˆi , where yi is the observed confidence value and yˆi is the predicted confidence value from the best fit line for item i . Small confidence deviation measures (close to 0) represent a small deviation of the confidence index from the item difficulty. Ideally an item should lie on this regression line and should have a confidence deviation of 0. An item that lies far away from the line indicates that students were either over confident or under confident for an item of that particular level of difficulty. (3) Expert opinion deviation In this study, eight experts were asked to give their opinions on the difficulty values per item according to a scale as discussed in section 5.2.3: 1: student should find the question easy 2: student should find the question of average difficulty, but fair 3: student should find the question difficult or challenging. The expert opinion deviation from the item difficulty was measured by the amount of deviation of the expert opinion (average of eight expert opinions) from the best fit line fitted to the regression between the item difficulties and the expert opinion measures over all the items. As with confidence deviation, the amount of deviation between the observed expert opinion measures ( yi ) and the expected expert opinion values ( yˆi ) (which we will refer to as expected performance) on the students’ actual performance in that item, is represented by the vertical distance from the best fit line for each item, as shown in Figure 5.2. Thus, for the point ( xi , yi ) which lies far from the best fit line, the actual expert opinion on the difficulty level differs greatly from the expected difficulty level which means that for this item i , the experts as a group misjudged the difficulty of the question as per student performance. 162 Figure 5.2: Illustration of expert opinion deviation from the best fit line between item difficulty and expert opinion. Y Expert Opinion item i (xi, yi) yi yŷi deviation deviation ∆ = yi − yˆi yˆi = f (xi ) 0 X xi Item Difficulty Figures 5.1 and 5.2 show that the larger the deviation of the predicted value from the observed value, the further the observed value is from the regression line and the worse the situation is in terms of an indication of quality. 5.3.2 Defining the Quality Index (QI) The three measuring criteria discussed in section 5.3 were considered together as an indication of the quality of an item. In future, this will be referred to as the Quality Index (QI). In this study, we do not enter into a debate which of the three measuring criteria are more important. In the proposed QI model, all three criteria are considered to be equally important in their contribution to the overall quality of a question. In order to graphically represent the qualities of a question, 3-axes radar plots were constructed, where each of the three measuring criteria is represented as one of the three arms of the radar plot. In order to compare and plot all three criteria, the measurement direction for the three axes was standardised between 0 and 1. This was done using the transformation formula, 163 y= x−a , where the original scale interval [a,b] is now transformed into the b−a required scale [0,1] on each axis, with a being the minimum value and b the maximum value for each of the respective three criteria. In order to spread out the values between 0 and 1 on each axis, a further normalisation of the data on the interval [0,1] was done. In Figure 5.3, a visual representation of the three axes of the QI is given. The axes were assigned on an ad hoc basis, with adapted discrimination of the first axis, adapted confidence deviation on the second axis and adapted expert opinion deviation of the third axis. On each axis, the value of 0.5 is indicated as a cut-off point between weak and strong and between small and large. The closer the values are to 0, the more successful the criteria are considered to be in their contribution to the quality of a question. Figure 5.3: Visual representation of the three axes of the QI. Adapted discrimination 1 weak 0.5 0 strong small small 0.5 0.5 1 large large 1 Adapted expert opinion deviation Adapted confidence deviation 164 Figure 5.4 depicts an example of a radar plot. Figure 5.4: Quality Index for PRQ C65M08 Adapted discrimination 0.749 QI =0.488 0.437 0.674 Adapted expert opinion deviation Adapted confidence deviation The Quality Index (QI) is defined to be the area of the radar plot. The area formula is: QI = 3 [(Discr × Conf dev ) + (Conf dev × EO dev ) + (EO dev × Discr )] where 4 Discr = Adapted discrimination; Conf dev = Adapted confidence deviation; EO dev = Adapted expert opinion deviation The QI combines all three measuring criteria and can now be used to compare the quality of the PRQs with the CRQs within each assessment component. For the proposed model, the smaller the area of the radar plot, i.e. the closer the QI value is to zero, the better the quality of the question. A sample group of test items was used, in total 207 items, of which 94 of the items were PRQs and 113 were CRQs. The median QI value for all the test items was calculated and this value of 0.282 was used as a cut-off value to define the quality of an item as follows: Good quality : QI < 0.282 Poor quality : QI ≥ 0.282 165 If the QI of an item is close to 0.282, the item quality is considered to be moderately good/poor. In the following two figures an example of a small QI, which constitutes a good quality item, versus an example of a large QI constituting an item of lesser quality are presented. In Figures 5.5 and 5.6 an example of a small QI, which constitutes a good quality item, versus an example of a large QI constituting an item of lower quality are represented for comparison purposes. Figure 5.5: A good quality item. ∑ ( ) (−1) n Show that r =0 n r r = 0. CRQ, Algebra, June 2005, Q1b. A651b (Good quality) Adapted discrimination QI =0.079 0.213 0.240 Adapted expert opinion deviation 0.291 Adapted confidence deviation 166 Figure 5.6: A poor quality item. Consider the following theorem: Theorem: If a function f is continuous on the closed interval [a, b] and F is an antiderivative b of f on [a, b] then ∫ f ( x)dx =F (b) − F (a) . a Consider the proof to this theorem: Proof: Divide the interval [a, b] into n sub-intervals by the points a = x0 < x1 < ... < xn −1 < xn = b . n Show that F (b) − F (a ) = ∑ [ F ( xi ) − F ( xi −1 )] . i =1 CRQ, Calculus, September 2005, Q3b. C953b (Poor quality) 1 0.831 QI =0.927 0.865 3 0.839 2 5.3.3 Visualising the difficulty level Difficulty level is an important parameter, but does not contribute to classifying a question as good or not. Both easy questions and difficult questions can be classified as good. In this study, the range of difficulty levels over the 207 test items was calculated to be a value of 0.12 using the maximum difficulty value of 4.56 and the 167 minimum difficulty value of -5.56. The standard deviation for this range was calculated to be a value of 1.59. Using these parameters, the distribution of the difficulty levels was investigated by creating a histogram with six intervals of difficulty of 1.5 logits each, as indicated in Figure 5.7. Figure 5.7: Distribution of six difficulty levels. For each of the six intervals, a corresponding shading of the radar chart was chosen to represent the six difficulty levels: very easy; easy; moderately easy; moderately difficult; difficult; very difficult. Table 5.4 represents the classification and shading of the difficulty intervals. The greater the level of difficulty, the darker the shading of the radar plot, i.e. the intensity of the shading increases from white for the very easy items , through increasing shades of grey to black for the very difficult items. For example, in Figures 5.5 and Figures 5.6 the dark grey shading of the radar plots represents a difficult item. So Figure 5.5 visually represents a difficult, good quality item and Figure 5.6 represents a difficult, poor quality item. 168 Table 5.4: Classification of difficulty intervals. Interval Degree of difficulty (-6; -3] Very easy (-3; -1.5] Easy (-1.5; 0] Shading Moderately easy 169 Interval (0; 1.5] (1.5; 3] (3; 6] Degree of difficulty Shading Moderately difficult Difficult Very difficult 170 In Chapter 6, in the research findings, a quantitative data analysis will be presented. In this chapter, I report on and compare good quality items and poor quality items, both PRQs and CRQs, within each of the seven mathematics assessment components in terms of the Quality Index developed in section 5.3.2. 171 CHAPTER 6: 6.1 RESEARCH FINDINGS QUANTITATIVE DATA ANALYSIS In this chapter on the research findings, an analysis of good quality items and poor quality items, both PRQs and CRQs, in terms of the Quality Index developed in section 5.3.2, within each of the seven mathematics assessment components, will be presented. 6.1.1 Methodology Stage 1 The traditional statistical analysis of data, supplied by the Computer Network Services (CNS) Division of the University of the Witwatersrand, include the Performance Index, Discrimination Index and Easiness/Difficulty factor per question for all tests (PRQ and CRQ) during the period of study, July 2004 to July 2006. Raw data, including students’ responses to test items and confidence of responses, was obtained from the Computer Network Services (CNS) Division of the University of the Witwatersrand. Spreadsheets were constructed using a ‘Mathematica’ programme developed by a statistician from the School of Statistics at the University of the Witwatersrand. The following information was captured on every spreadsheet per test: ● students’ responses to all test items, both PRQ and CRQ ● students’ confidence of responses per test item, both PRQ and CRQ. The correct answers and mathematics assessment components per test item were also recorded for reference purposes. Student numbers were not recorded on every spreadsheet. In constructing these spreadsheets, records were excluded if: 172 (i) the student had failed to provide an answer; or (ii) the student had failed to provide a confidence of response; or (iii) the student had filled in the MCQ card incorrectly. It should be noted that in most cases the excluded records were due to (ii) above. The proportion of all the records excluded in this manner ranged between 7,2% and 8,9% across the tests. All subsequent calculations were performed on this filtered data. For PRQs and CRQs, the Performance Index (PI) per question was equal to the proportion of (filtered) respondents who obtained the correct answer. It should be noted that the “easiness/difficulty” statistic provided on the CNS printouts is equal to the Performance Index i.e. Performance Index = Difficulty Index. An overall Confidence Index (per assessment component) was calculated by averaging the CIs per question for all questions in that assessment component. An overall Performance Index or Difficulty Index (per assessment component) was calculated in a similar manner by averaging the PIs per question for all questions in that assessment component. Stability (test- retest) was achieved by administering the same tutorial tests in March and August over the period 2004-2006. Equivalence was achieved over the period of study by administering different tests to the same cohort of students (Mathematics I Major) in each of the 3 years, 2004, 2005 and 2006 respectively. Internal consistency was achieved by correlating and equating the items in each test to each other, as described under test item calibration in section 6.2.1. Stage 2 The Rasch model (Rasch, 1960), as discussed in section 3.4.1, was used to evaluate both the attitudinal data (confidence levels) as well as test data. The Winsteps (Linacre & Wright, 1999) Rasch analysis programme was utilised by a data analyst from the University of Pretoria for the quantitative data analysis in this research study. In particular the WINSTEPS® Version 3.55.0 was used to 173 analyse the data in this study. SAS Version 9 and Microsoft EXCEL 2003 were also used in calculating totals and means. The Winsteps software, developed by John M Linacre in 2005, constructs Rasch measures from simple rectangular data sets, usually of persons and items. Item types that can be combined in one analysis include dichotomous, multiple choice and partial credit items. Paired comparisons and rank-order data can also be analysed. Missing data is no problem. Winsteps is designed as a tool that facilitates exploration and communication. The structure of the items and persons can be examined in depth. Unexpected data points are identified and reported in numerous ways. Powerful diagnosis of multidimensionality through principal components analysis of residuals detects and quantified substructures in the data. The working of rating scales can be examined thoroughly, and rating scales can be recoded and items regrouped to share rating scales as desired (Linacre, 2002). Measures can be fixed (anchored) at pre-set values (Linacre, 2005). In order to prepare the data in an ASCII format to import into Winsteps, SAS was used to create ASCII files with a specific layout. Control files were prepared in Winsteps for each part of each test, i.e. the PRQ part, the CRQ part as well as the confidence index part. This was done as the different Rasch models, discussed in section 3.4.1.3, were applicable to the different types of data. These parts of the tests were first analysed separately to check for model “fit”. Such “fit” statistics help detect possible idiosyncratic behaviour on the part of respondents and test items. Those respondents who exhibited “misfit” were first investigated for coding errors, and then their raw hard-copy responses were reviewed for evidence of non-attention to the test. Such individuals might be ones who are haphazardly circling responses or those who are guessing and/or miscoding. Winsteps provides ways of diagnosing problems in the analysis. In the first place the point measure values were considered. Where items exhibited negative point measure values, these items were scrutinised for errors such as an 174 incorrect key and corrected. If the point measure stayed negative, the item was removed from the analysis. Subsequently, the output tables for person ability and item difficulty were checked for misfitting entries. Person ability tables were considered first. Misfit Some explanation in terms of misfitting items or students is in order. One would expect that a student of medium mathematical ability would be able to respond correctly to easier items in the test and incorrectly to the difficult items in a test. Where the item difficulty matches the ability of the student, one would expect the student to answer some of these items correct and some incorrectly. If an item’s difficulty corresponds exactly to the student’s ability, the probability of success of the student on that item is 0.5, in other words, success or failure is expected equally. The Rasch model assumes this pattern of responses, and the Infit and Outfit mean-square statistics are 1.0. If for example, a student would guess the answer to a difficult item correctly (one that the student should really get wrong) the Outfit statistic would be much larger than 1.0 because it is sensitive to outliers. The approach used in the analyses of this study’s data was that items and persons were accepted as not misfitting when Infit mean-square statistics was from 0.5 to 1.5. Where the values were less than 0.5, too much predictability or overfit was experienced and when the value exceeded 1.5, too much noise was present in the data or a situation of underfit existed. The Infit statistics were considered first, and then the Outfit statistics. Mean-square statistics indicate the size of the misfit, but the “significance” of the improbability of the misfit is important. Misfitting persons were deleted, and the analysis was repeated. Another round of misfitting persons were removed from the analysis. Only then were the fit 175 statistics of the items considered. If an item proved to be problematic in terms of the fit statistics, the item was also removed from any subsequent analysis. The same procedure was followed to explore the misfitting persons and items in terms of the CRQs and the confidence index. For the PRQs, the dichotomous Rasch model applies: e ( β v −δ i ) Pvi = 1 + e( β v −δ i ) In the confidence index, the same categories were available throughout and were thus analysed according to the Rasch-Andrich rating scale model: P ln vix = β v − δ i − Fx Pvi( x −1) CRQs were analysed through the application of the Partial Credit model: P ln vix = β v − δ i − Fix Pvi( x −1) These various Rasch models have already been discussed in more detail in section 3.4.1. Test item calibration Through the application of the Rasch family of models it is also possible to put the measures of different tests onto the same scale if certain assumptions are made. The tests can be linked either through common items on the tests or through common students writing the tests. A challenge in terms of the data faced the researcher. Although, as mentioned previously, it was known that the same cohort of students wrote the same tests in a calendar year, the student identification numbers were not available on all the data sets and therefore no linking could take place on a one-to-one basis. The strong assumption was then made that the subject matter of the different tests were distinct and that the tests 176 could therefore be regarded as independent. In other words, it was assumed that because the subject matter was distinct, students’ ability did not improve progressively throughout the year. This assumption led to the decision that all the data could be calibrated together, anchoring the items that were common over the three years. In this way, the item difficulties and the student measures were on the same scale and were deemed directly comparable. Fit statistics were again considered and if in the combined calibration of items any misfitting items were identified, they were excluded from the analysis. A small number of items misfitted, and this is not to be unexpected in such a large data set. The same procedure was followed in terms of the CRQs. In order to place the measures of the PRQs and the CRQs on the same scale, a combined calibration of these items was also executed. Another challenge presented itself. At first, when the PRQs and the CRQs were calibrated together, the whole set of CRQs misfitted. It was then decided to recode the partial credit items into dichotomous items in the following way: If a student scored less than half the marks, the student was awarded a 0 for that specific item; if the student scored half or more of the marks on an item, the student was awarded a 1 for the item. The CRQs were therefore eventually analysed through the same model as the PRQs i.e. the dichotomous Rasch model, and the combined calibration of items then produced a set of items that mostly fitted the Rasch model. Confidence level item calibration A similar process was followed to determine the item difficulties of the confidence levels. The item difficulty for a rating scale is defined as the point where the top and bottom categories are equally probable (Linacre, 2005). 177 6.2 DATA DESCRIPTION Response data from 14 different mathematics tests written between August 2004 and June 2006 were available. Table 6.1 is a representation of the tests written, the number of provided response items (PRQs) per test, the number of constructed response items (CRQs) and the number of students per test. The same cohort of students (Mathematics I Major) wrote the tests in each of the three years, 2004, 2005 and 2006 respectively. Table 6.1: Characteristics of tests written. Year Month Number of PRQs Number of CRQs Number of students 2004 August 10 0 457 2005 March 8 0 410 2005 April Tutorial A 8 0 263 2005 April Tutorial B 8 0 126 2005 May 8 0 403 2005 June 12 17 414 2005 August 10 0 389 2005 September 8 17 387 2005 November 15 18 385 2006 March 8 15 352 2006 April Tutorial A 8 0 245 2006 April Tutorial B 8 0 105 2006 May 8 14 359 2006 June 12 24 348 Out of a total of 221 PRQ and CRQ items, seven items were discarded because their fit statistics indicated that they did not fit the model. Table 6.2 included in the Appendix A5, presents these items with their fit statistics. Another seven items (I115M09 – I115M15) were discarded because the actual items were not available. Finally, 207 items were included in the analyses. The Rasch statistics 178 for all 207 test items analysed are included in Appendix A6. Confidence level items Rasch statistics are included in Appendix A7. 6.3 COMPONENT ANALYSIS Examples of questions in the different mathematics assessment components are now presented. Within each of the seven assessment components, both PRQs and CRQs, ranging from easy to difficult, and of good and poor quality are presented. For each item, the question is followed by a radar plot and a table summarising the quality parameters of the test item i.e. item difficulty; discrimination; confidence index; expert opinion and the final quality index, as discussed in the theoretical framework in Chapter 5. Each of the axes of the radar plots are labelled with the corresponding values for discrimination, confidence index and expert opinion. The Quality Index (QI) is displayed alongside the radar plot. The shading of the radar plot corresponds to one of the six item difficulty levels as classified in Table 5.4. The comments briefly summarise the difficulty level, the three measuring criteria as developed in the theoretical framework and the overall quality of the item. 179 1. Technical component A651(a) 12 2 1 Find the constant term in − x + x CRQ, Algebra, June 2005, Q1a A651a Assessment Component Comment Technical PRQ/CRQ CRQ Item Difficulty 1.10 Moderately difficult Discrimination 0.295 Discriminates well Confidence Index 0.385 Small deviation from expected confidence level Expert Opinion 0.236 Small deviation from expected performance Quality Index 0.119 Good quality CRQ (excellent) 180 A652(a) Write −2 cos x + 2 3 sin x in the form R cos( x − θ ) CRQ, Algebra, June 2005, Q2a A652a Assessment Component Comment Technical PRQ/CRQ CRQ Item Difficulty -0.33 Moderately easy Discrimination 0.501 Discriminates fairly well Confidence Index 0.318 Small deviation from expected confidence level Expert Opinion 0.574 Large deviation from expected performance Quality Index 0.273 Good quality CRQ (moderate) 181 C115MO7 The limit of the sequence A. −5 B. 1 C. 0 1 n ( −5 + (−1) ) is n! D. the sequence diverges PRQ, Calculus, November 2005, Q7 C115M07 Assessment Component Comment Technical PRQ/CRQ PRQ Item Difficulty -1.12 Moderately easy Discrimination 0.666 Does not discriminate well Confidence Index 0.343 Small deviation from expected confidence level Expert Opinion 0.416 Small deviation from expected performance Quality Index 0.281 Good quality PRQ (moderate) 182 A1155bii 1 1 1 bc ca Let A = ab a +b b +c c + a For what value(s) of a, b, c does A−1 exist? CRQ, Algebra, November 2005, Q5bii A1155bii Assessment Component Comment Technical PRQ/CRQ CRQ Item Difficulty 2.23 Difficult Discrimination 0.522 Discriminates poorly Confidence Index 0.347 Small deviation from expected confidence level Expert Opinion 0.736 Large deviation from expected performance Quality Index 0.356 Poor quality CRQ 183 A661.1 P(n) = n3 + (n + 1)3 + (n + 2)3 is divisible by 9 Show that the statement is true for n=2 CRQ, Algebra, June 2006, Q1.1 A661.1 Assessment Component Comment Technical PRQ/CRQ CRQ Item Difficulty -2.35 Easy Discrimination 0.975 Discriminates weakly Confidence Index 0.324 Small deviation from expected confidence level Expert Opinion 0.410 Small deviation from expected performance Quality Index 0.367 Poor quality CRQ 184 A56MO2 The exact value of A. 5π B. −5π C. −π D. π E. 2π arctan(tan(5π )) is 3 3 3 3 3 3 PRQ, Algebra, May 2006, Q2 A56M02 Assessment Component Comment Technical PRQ/CRQ PRQ Item Difficulty 0.77 Moderately difficult Discrimination 0.563 Weak discrimination Confidence Index 0.643 Large deviation from expected confidence level Expert Opinion 0.453 Small deviation from expected performance Quality Index 0.393 Poor quality PRQ 185 C65M08 If 5 5 5 3 3 3 ∫ g ( x)dx = 5 and ∫ h( x)dx = −1, then ∫ ( 2 g ( x) − 5h( x ) ) dx = A. 5 B. 15 C. 7 D. 0 E. -27 PRQ, Calculus, June 2005, Q8 C65M08 Assessment Component Comment Technical PRQ/CRQ PRQ Item Difficulty -1.04 Moderately easy Discrimination 0.749 Weak discrimination Confidence Index 0.437 Small deviation from expected confidence level Expert Opinion 0.674 Large deviation from expected performance Quality Index 0.488 Poor quality PRQ 186 2. Disciplinary component A35M08 Let a, b and c be real numbers. Which of the following is the correct statement? A. a < b ⇒ a + b > b + c. B. a > b ⇒ ac > bc. C. D. E. x > a ⇔ −a < x < a. c 2 = c. 0<a<b⇒ 1 1 < . b a PRQ, Algebra, March 2005, Q8 A35M08 Assessment Component Comment Disciplinary PRQ/CRQ PRQ Item Difficulty 2.25 Difficult Discrimination 0.069 Discriminates very well Confidence Index 0.842 Large deviation from expected confidence level Expert Opinion 0.355 Small deviation from expected performance Quality Index 0.165 Good quality PRQ 187 C363b Prove, using the Intermediate Value Theorem, that there is a number exactly 1 more than its cube. CRQ, Calculus, March 2006, Q3b C363b Assessment Component Comment Disciplinary PRQ/CRQ CRQ Item Difficulty 3.94 Very difficult Discrimination 0.295 Discriminates well Confidence Index 0.274 Small deviation from expected confidence level Expert Opinion 0.574 Large deviation from expected performance Quality Index 0.177 Good quality CRQ 188 C561a(i) A bacterial colony is estimated to have a population of P (t ) = 24t + 10 t2 +1 million, t hours after the introduction of a toxin. At what rate is the population changing 1 hour after the toxin is introduced? CRQ, Calculus, May 2006, Q1a(i) C561ai Assessment Component Comment Disciplinary PRQ/CRQ CRQ Item Difficulty -2.63 Easy Discrimination 0.543 Discriminates fairly well Confidence Index 0.460 Small deviation from expected confidence level Expert Opinion 0.262 Small deviation from expected performance Quality Index 0.222 Good quality CRQ 189 A55M07 The Cartesian coordinates A. (− 6, − 6) B. (− 6, 6) C. ( 6, − 6) D. (−3, 2) ( x, y ) of the point (r , θ ) = (2 3, 3π ) are: 4 PRQ, Algebra, May 2005, Q7 A55M07 Assessment Component Comment Disciplinary PRQ/CRQ PRQ Item Difficulty -0.76 Moderately easy Discrimination 0.790 Does not discriminate well Confidence Index 0.294 Small deviation from expected confidence level Expert Opinion 0.290 Small deviation from expected performance Quality Index 0.236 Good quality PRQ (moderate) 190 C364b(i) Let ! x" be the greatest integer less than or equal to x. Show that lim f ( x) exists if f ( x) = ! x" + ! − x" . x→2 CRQ, Calculus, March 2006, Q4b(i) C364bi Assessment Component Comment Disciplinary PRQ/CRQ CRQ Item Difficulty 4.19 Very difficult Discrimination 0.501 Discriminates fairly well Confidence Index 0.501 Expert Opinion 0.547 Large deviation from expected performance Quality Index 0.346 Poor quality CRQ Average deviation from expected confidence level 191 C563a(i) Consider the following theorem: Let f be a function that satisfies the following three conditions: (1) f is continuous on the closed interval [a, b]. (2) f is differentiable on the open interval (a, b). (3) f (a ) = f (b). Then there exists a number c ∈ (a, b) such that f ′(c) = 0. What is this theorem called? CRQ, Calculus, May 2006, Q3a(i) C563ai Assessment Component Comment Disciplinary PRQ/CRQ CRQ Item Difficulty -4.74 Very easy Discrimination 0.831 Discriminates poorly Confidence Index 0.545 Large deviation from expected confidence level Expert Opinion 0.273 Small deviation from expected performance Quality Index 0.359 Poor quality CRQ 192 C45MB5 If lim f ( x) A. f (2) is undefined B. f (2) = 3 C. f (2) = 2 D. f (2) is unknown x→2 exists, then PRQ, Calculus, March 2005, Tut Test 1B, Q5 C45MB5 Assessment Component Comment Disciplinary PRQ/CRQ PRQ Item Difficulty 1.91 Difficult Discrimination 0.749 Discriminates poorly Confidence Index 0.521 Large deviation from expected confidence level Expert Opinion 0.409 Small deviation from expected performance Quality Index 0.394 Poor quality PRQ 193 C36M02 Find the following limit: lim x→2 x2 − 4 x−2 A. does not exist B. −2 C. 4 D. 2 E. 1 PRQ, Calculus, March 2006, Q2 C36M02 Assessment Component Comment Disciplinary PRQ/CRQ PRQ Item Difficulty -5.05 Very easy Discrimination 0.872 Discriminates very poorly Confidence Index 0.822 Expert Opinion 0.239 Small deviation from expected performance Quality Index 0.486 Poor quality PRQ Very large deviation from expected confidence level 194 3. Conceptual component C65M09 Choose the correct statement, given that ∫ 2 A. ∫ 0 B. ∫ 2 C. ∫ 2 D. 0 2 5 0 ∫ 5 0 5 f ( x)dx = 9 and ∫ f ( x)dx = −1. 2 f ( x)dx = 10 f ( x)dx = 10 f ( x)dx = −1 f ( x)dx = 8 E. None of the above PRQ, Calculus, June 2005, Q9 C65M09 Assessment Component Comment Conceptual PRQ/CRQ PRQ Item Difficulty 1.72 Difficult Discrimination 0.110 Discriminates well Confidence Index 0.351 Small deviation from expected confidence level Expert Opinion 0.608 Large deviation from expected performance Quality Index 0.138 Good quality PRQ 195 A1152b Find the equation of the plane which passes through the point A(2,3, −5) and which contains l : (−1,3, −2) + t (−2,1,5) the line CRQ, Algebra, November 2005, Q2b A1152b Assessment Component Comment Conceptual PRQ/CRQ CRQ Item Difficulty 2.93 Difficult Discrimination 0.357 Discriminates well Confidence Index 0.255 Small deviation from expected confidence level Expert Opinion 0.373 Small deviation from expected performance Quality Index 0.138 Good quality CRQ (excellent) 196 C1157a Find ∫ x cos xdx CRQ, Calculus, November 2005, Q7a C1157a Assessment Component Comment Conceptual PRQ/CRQ CRQ Item Difficulty -1.45 Moderately easy Discrimination 0.522 Average discrimination Confidence Index 0.249 Small deviation from expected confidence level Expert Opinion 0.483 Small deviation from expected performance Quality Index 0.218 Good quality CRQ 197 C45MB8 3 f ( x) − ( g ( x))2 = If lim f ( x) = 2 and lim g ( x ) = 3 then lim x →a x →a x →a g ( x) A. 13 3 B. −1 C. − D. 1 3 2 PRQ, Calculus, March 2005, Tut Test 1B, Q8 C45MB8 Assessment Component Comment Conceptual PRQ/CRQ PRQ Item Difficulty -1.94 Easy Discrimination 0.604 Discriminates poorly Confidence Index 0.410 Small deviation from expected confidence level Expert Opinion 0.284 Small deviation from expected performance Quality Index 0.232 Good quality CRQ (moderate) 198 A95M02 ˆ equals PQR is a triangle with vertices P(3,1), Q(5, 2) and R(4,3). PQR A. arccos B. arccos C. 4 5 1 10 π − arccos D.arccos 4 1 − arccos 5 10 −1 10 PRQ, Algebra, August 2005, Tut Test, Q2 A95M02 Assessment Component Comment Conceptual PRQ/CRQ PRQ Item Difficulty -3.22 Very easy Discrimination 0.769 Discriminates poorly Confidence Index 0.406 Expert Opinion 0.333 Small deviation from expected performance Quality Index 0.305 Poor quality PRQ (moderate) Fairly small deviation from expected confidence level 199 C55M04 The graph below is of the derivative of a function g ( x) , i.e. the graph of y = g ′ ( x ) . y 2 y = g′(x) 1 -4 -3 -2 -1 1 2 3 4 x -1 -2 The critical numbers of g ( x ) are A. −2, 2 C. −2, 2, −3,3 B. −3,3 D. −2, −3,3 PRQ, Calculus, May 2005, 04 C55M04 Assessment Component Comment Conceptual PRQ/CRQ PRQ Item Difficulty 1.50 Moderately difficult Discrimination 0.336 Discriminates well Confidence Index 0.723 Large deviation from expected confidence level Expert Opinion 0.546 Large deviation from expected performance Quality Index 0.356 Poor quality PRQ 200 C953a Consider the following theorem: Theorem: If a function of f on [a, b], then ∫ b a f is continuous on the closed interval [a, b] and F is an antiderivative f ( x)dx = F (b) − F (a ) . What is this theorem called? CRQ, Calculus, August 2005, Q3a C953a Assessment Component Comment Conceptual PRQ/CRQ CRQ Item Difficulty -5.56 Very easy Discrimination 1.000 Discriminates very poorly Confidence Index 0.497 Large deviation from expected confidence level Expert Opinion 0.434 Fairly small deviation from expected performance Quality Index 0.562 Poor quality CRQ 201 C953b Consider the following theorem: Theorem: If a function of f on [a, b], then ∫ b a f is continuous on the closed interval [a, b] and F is an antiderivative f ( x)dx = F (b) − F (a ) . Consider the proof of this theorem: Proof: Divide the interval [a, b] into n sub-intervals by the points a = x0 < x1 < ... < xn −1 < xn = b . n Show that F (b) − F (a ) = ∑ [ F ( xi ) − F ( xi −1 )]. i =1 CRQ, Calculus, August 2005, Q3b C953b Assessment Component PRQ/CRQ Comment Conceptual CRQ Item Difficulty 2.4 Discrimination 0.831 Discriminates poorly Confidence Index 0.839 Large deviation from expected confidence level Expert Opinion 0.865 Large deviation from expected performance Quality Index 0.927 Poor quality CRQ Difficult 202 4. Logical component A662.2 Use properties of sigma notation and the fact that n ∑r = r =1 n(n + 1) to prove that 2 n ∑r r =1 2 = n(n + 1)(2n + 1) . 6 CRQ, Algebra, June 2006, Q2.2 A662.2 Assessment Component Comment Logical PRQ/CRQ CRQ Item Difficulty 1.52 Difficult Discrimination 0.048 Discriminates well Confidence Index 0.495 Expert Opinion 0.251 Small deviation from expected performance Quality Index 0.069 Good quality CRQ (excellent) Average deviation from expected confidence level 203 A55M08 You are given the sector OAB of a circle of radius 2 with AC = p. C p B Arc length AB equals: A. 2 B. arcsin 2/ p C. arctan p/2 D. A O 2 arctan ( p / 2) PRQ, Algebra, May 2005, Q8 A55M08 Assessment Component Comment Logical PRQ/CRQ PRQ Item Difficulty 0.15 Moderately difficult Discrimination 0.378 Discriminates well Confidence Index 0.479 Small deviation from expected confidence level Expert Opinion 0.504 Average deviation from expected performance Quality Index 0.265 Good quality PRQ (moderate) 204 A562a A polar graph is defined by the equation Is the graph symmetric about the r (θ ) = 5cos 3θ for θ ∈ [0, 2π ] x − axis, the y − axis, both or neither? Motivate your answer. CRQ, Algebra, May 2006, Q2a A562a Assessment Component Comment Logical PRQ/CRQ CRQ Item Difficulty -1.62 Easy Discrimination 0.295 Discriminates well Confidence Index 0.620 Large deviation from expected confidence level Expert Opinion 0.487 Small deviation from expected performance Quality Index 0.272 Good quality CRQ (moderate) 205 A85M05 If z = 3 + 2i and w = 1 − 4i, then in real-imaginary form z equals: w 5 14 + i 17 17 A. − B. 5 14 − i 15 15 C. 3 − 4i D. 11 14 + i 17 17 PRQ, Algebra, August 2005, Tut Test Q5 A85M05 Assessment Component Comment Logical PRQ/CRQ PRQ Item Difficulty -2.31 Easy Discrimination 0.687 Discriminates poorly Confidence Index 0.652 Large deviation from expected confidence level Expert Opinion 0.249 Small deviation from expected performance Quality Index 0.338 Poor quality PRQ 206 C46MA5 If lim[ f ( x) + g ( x)] exists, then A. lim f ( x) = lim g ( x). x →a x →a x→a B. neither C. both lim f ( x) nor lim g ( x) exists. x →a x →a lim f ( x) and lim g ( x) exist. x →a D. we cannot tell if x →a lim f ( x) or lim g ( x) exists. x →a x →a PRQ, Calculus, March 2006, Tut Test A,Q5 C46MA5 Assessment Component Comment Logical PRQ/CRQ PRQ Item Difficulty 2.47 Difficult Discrimination 0.481 Average discrimination Confidence Index 0.700 Large deviation from expected confidence level Expert Opinion 0.470 Small deviation from expected performance Quality Index 0.386 Poor quality PRQ 207 A562d A polar graph is defined by the equation r (θ ) = 5cos 3θ for θ ∈ [0, 2π ]. What is the name of this polar graph? CRQ, Algebra, May 2006, Q2d A562d Assessment Component Comment Logical PRQ/CRQ CRQ Item Difficulty -1.42 Moderately easy Discrimination 0.625 Discriminates poorly Confidence Index 0.743 Large deviation from expected confidence level Expert Opinion 0.424 Small deviation from expected performance Quality Index 0.452 Poor quality CRQ 208 C563aii Consider the following theorem: Let f be a function that satisfies the following three conditions: (1) f is continuous on the closed interval [a, b]. (2) f is differentiable on the open interval (a, b). (3) f (a ) = f (b). Then there exists a number Let c ∈ (a, b) such that f ′(c) = 0. f ( x) > f (a ) for some x ∈ (a, b). Give a complete proof of the theorem in this case. CRQ, Calculus, May 2006, Q3aii C563aii Assessment Component Comment Logical PRQ/CRQ CRQ Item Difficulty -0.46 Moderately easy Discrimination 0.481 Average discrimination Confidence Index 0.688 Large deviation from expected confidence level Expert Opinion 0.466 Small deviation from expected performance Quality Index 0.379 Poor quality CRQ 209 5. Modelling component A652b Solve −2 cos x + 2 3 sin x = 4 cos 2 x − 4sin 2 x CRQ, Algebra, June 2005, Q2b A652b Assessment Component Comment Modelling PRQ/CRQ CRQ Item Difficulty 2.81 Difficult Discrimination 0.295 Discriminates well Confidence Index 0.465 Small deviation from expected confidence level Expert Opinion 0.360 Small deviation from expected performance Quality Index 0.178 Good quality CRQ (excellent) 210 A95M03 If # # # $# # $# # # # $# a = (1, 2), b = (−1, 3), c = (4, −2) and d = (3, −3), then (a ⋅ d )b − (b ⋅ c)d equals A. (−54,12) B. −4 C. 3(11, −13) D. not possible PRQ, Algebra, August 2005, Tut Test, Q3 A95M03 Assessment Component Comment Modelling PRQ/CRQ PRQ Item Difficulty 0.84 Moderately difficult Discrimination 0.357 Discriminates well Confidence Index 0.443 Small deviation from expected confidence level Expert Opinion 0.460 Small deviation from expected performance Quality Index 0.228 Good quality PRQ 211 C35M01 lim h →0 A. 9+ h −3 is equal to h lim h →0 1 9+h +3 B. The slope of the tangent line to y = x at the point P (9,3) C. The slope of the tangent line to y = x at the point P (9, −3) D. Both ( A) and ( B ) E. All of ( A), ( B ) and (C ) PRQ, Calculus, March 2005, Q1 C35M01 Assessment Component Comment Modelling PRQ/CRQ PRQ Item Difficulty -0.36 Moderately easy Discrimination 0.460 Discriminates well Confidence Index 0.587 Large deviation from expected confidence level Expert Opinion 0.309 Small deviation from expected performance Quality Index 0.257 Good quality PRQ (moderate) 212 C1156a Match each of the differential equations given in Column A with the type listed in Column B. A. Differential Equation B. Type a. dy y − = ln x dx x 1. Variable separable b. dy e x = ey dx 2. Homogeneous c. ( x 2 + y 2 )dx + 2 xydy = 0 3. Exact d. 2 x + y 3 + (3 xy 2 + ye 2 y ) dy =0 dx 4. Linear CRQ, Calculus, November 2005, Q6a C1156a Assessment Component Comment Modelling PRQ/CRQ CRQ Item Difficulty -0.22 Moderately easy Discrimination 0.295 Discriminates well Confidence Index 0.472 Small deviation from expected confidence level Expert Opinion 0.617 Large deviation from expected performance Quality Index 0.265 Good quality CRQ (moderate) 213 C66M06 Let f ( x) be a function such that f (4) = −1 and f ′(4) = 2. If x < 4, then f ′′( x) < 0 and if x > 4, then f ′′( x) > 0. The point (4, −1) is a of the graph of f. A. Relative maximum B. Relative minimum C. Critical point D. Point of inflection E. None of the above PRQ, Calculus, June 2006, Q6 C66M06 Assessment Component Comment Modelling PRQ/CRQ PRQ Item Difficulty -1.00 Moderately easy Discrimination 0.687 Discriminates poorly Confidence Index 0.452 Small deviation from expected confidence level Expert Opinion 0.496 Average deviation from expected performance Quality Index 0.379 Poor quality PRQ 214 C561aii A bacterial colony is estimated to have a population of P(t ) = 24t + 10 t2 +1 million, t hours after the introduction of a toxin. Is the population increasing or decreasing at this time? CRQ, Calculus, May 2006, Q1aii C561aii Assessment Component Comment Modelling PRQ/CRQ CRQ Item Difficulty -4.51 Very easy Discrimination 0.810 Discriminates poorly Confidence Index 0.549 Large deviation from expected confidence level Expert Opinion 0.613 Large deviation from expected performance Quality Index 0.553 Poor quality CRQ 215 6. Problem solving component C1152a Split 3 into partial fractions. ( x − 1)( x 2 + x + 1) CRQ, Calculus, November 2005, Q2a C1152a Assessment Component Comment Problem solving PRQ/CRQ CRQ Item Difficulty -1.37 Moderately easy Discrimination 0.439 Discriminates well Confidence Index 0.352 Small deviation from expected confidence level Expert Opinion 0.272 Small deviation from expected performance Quality Index 0.160 Good quality CRQ (moderate) 216 C65M10 The points of inflection for the function A. (π ,8π ) and (2π ,16π + 2) B. (π , 2) and (2π ,16π + 2) C. (π ,8π ) and (2π ,16π ) D. (π ,8π + 2) and (2π ,16π + 2) E. (π ,8π + 2) and (2π ,16π ) f ( x) = 8 x + 2 − sin x for 0 < x < 3π , are PRQ, Calculus, June 2005, Q10 C65M10 Assessment Component Comment Problem solving PRQ/CRQ PRQ Item Difficulty 1.73 Difficult Discrimination 0.213 Discriminates well Confidence Index 0.352 Small deviation from expected confidence level Expert Opinion 0.609 Large deviation from expected performance Quality Index 0.181 Good quality PRQ 217 A65M04 If 1 π arccos 2 x = , then x equals 2 2 A. 0 B. −1 C. 1 2 D. − 1 2 PRQ, Algebra, June 2005, Q4 A65M04 Assessment Component Comment Problem solving PRQ/CRQ PRQ Item Difficulty 0.14 Moderately difficult Discrimination 0.522 Average discrimination Confidence Index 0.358 Small deviation from expected confidence level Expert Opinion 0.280 Small deviation from expected performance Quality Index 0.188 Good quality PRQ 218 A951 100 Evaluate ∑ [(r + 1) r +1 − r r ]. r =1 CRQ, Algebra, August 2005, Q1 A951 Assessment Component Comment Problem solving PRQ/CRQ CRQ Item Difficulty 0.67 Moderately difficult Discrimination 0.439 Discriminates well Confidence Index 0.480 Small deviation from expected confidence level Expert Opinion 0.372 Small deviation from expected performance Quality Index 0.239 Good quality CRQ (moderate) 219 A65M02 k ∑π = i = r +1 A. π (r + 1 − k ) B. k (r − π + 1) C. π (k − r + 2) D. π (k − r ) PRQ, Algebra, June 2005, Q2 A65M02 Assessment Component Comment Problem solving PRQ/CRQ PRQ Item Difficulty 0.98 Moderately difficult Discrimination 0.357 Discriminates well Confidence Index 0.598 Large deviation from expected confidence level Expert Opinion 0.475 Small deviation from expected performance Quality Index 0.289 Poor quality PRQ (moderate) 220 C55M01 Determine from the graph of y = f ( x) whether f possesses extrema on the interval [a, b]. y f a x b A. Maximum at x = a; minimum at x = b. B. Maximum at x = b; minimum at x = a. C. No extrema. D. No maximum; minimum at x = a. PRQ, Calculus, May 2005, Q1 C55M01 Assessment Component Comment Problem solving PRQ/CRQ PRQ Item Difficulty -0.50 Moderately easy Discrimination 0.728 Discriminates poorly Confidence Index 0.288 Small deviation from expected confidence level Expert Opinion 0.587 Large deviation from expected performance Quality Index 0.349 Poor quality PRQ 221 C663c In a given semi-circle of radius 2, a rectangle is inscribed as shown in the figure below. 2 x Find the value of for θ θ θ y x corresponding to the maximum area, and test whether this value gives a maximum. CRQ, Calculus, June 2006, Q3c C663c Assessment Component Comment Problem solving PRQ/CRQ CRQ Item Difficulty -0.13 Moderately easy Discrimination 0.604 Discriminates poorly Confidence Index 0.411 Small deviation from expected confidence level Expert Opinion 0.577 Large deviation from expected performance Quality Index 0.361 Poor quality CRQ 222 A1154bii −3 : 3 1 −2 5 : M = −1 3 −4 4 −5 k 2 − 15 : k + 12 Suppose the system given by M represents three planes, P1 , P2 , P3 . That is, we have: = 3 P1 : x − 2 y − 3z = −4 P2 : − x + 3 y + 5z 2 P3 : 4 x − 5 y + (k − 15) z = k + 12 Find the value(s) of k such that the three planes intersect in a single point. Do not calculate the co-ordinates of that point. CRQ, Algebra, November 2005, Q4biii A1154biii Assessment Component Comment Problem solving PRQ/CRQ CRQ Item Difficulty 0.35 Moderately difficult Discrimination 0.316 Discriminates well Confidence Index 0.717 Large deviation from expected confidence level Expert Opinion 0.964 Large deviation from expected performance Quality Index 0.529 Poor quality CRQ 223 7. Consolidation component C951 Rewrite the following integral as the sum of integrals such that there are no absolute values. DO NOT solve the integral. Give full reasons for your answer. ∫ 5 −2 4x − x 2 dx CRQ, Calculus, August 2005, Q1 C951 Assessment Component Comment Consolidation PRQ/CRQ CRQ Item Difficulty 0.86 Moderately difficult Discrimination 0.419 Discriminates well Confidence Index 0.392 Small deviation from expected confidence level Expert Opinion 0.323 Small deviation from expected performance Quality Index 0.185 Good quality CRQ 224 A45MA4 If f is an odd function and g is an even function then A. f % g is an even function B. f % g is an odd function C. f is a one-to-one function D. g is a one-to-one function PRQ, Algebra, March 2005, Tut Test A, Q4 A45MA4 Assessment Component Comment Consolidation PRQ/CRQ PRQ Item Difficulty 1.11 Moderately difficult Discrimination 0.275 Discriminates well Confidence Index 0.698 Large deviation from expected confidence level Expert Opinion 0.296 Small deviation from expected performance Quality Index 0.207 Good quality PRQ 225 A661.2 This question deals with the statement P ( n) : n 3 + (n + 1)3 + (n + 2)3 is divisible by 9. Use Pascal’s triangle to expand and then simplify ( k + 3) 3 . CRQ, Algebra, June 2006, Q1.2 A661.2 Assessment Component Comment Consolidation PRQ/CRQ CRQ Item Difficulty 0.02 Moderately difficult Discrimination 0.666 Discriminates poorly Confidence Index 0.379 Small deviation from expected confidence level Expert Opinion 0.301 Small deviation from expected performance Quality Index 0.246 Good quality CRQ (moderate) 226 C85M07 On which interval is the function A. (ln 9, ∞) B. (0, ∞) C. (−∞, ∞) D. ( − f ( x) = e3 x − e x increasing? 1 ln 3, ∞) 2 E. None of the above PRQ, Calculus, August 2005, Q7 C85M07 Assessment Component Comment Consolidation PRQ/CRQ PRQ Item Difficulty -1.17 Moderately easy Discrimination 0.687 Discriminates poorly Confidence Index 0.230 Small deviation from expected confidence level Expert Opinion 0.514 Average deviation from expected performance Quality Index 0.272 Good quality PRQ (moderate) 227 C654 State the Fundamental Theorem of Calculus. CRQ, Calculus, June 2005, Q4 C654 Assessment Component Comment Consolidation PRQ/CRQ CRQ Item Difficulty 0.29 Moderately difficult Discrimination 0.481 Average discrimination Confidence Index 0.248 Small deviation from expected confidence level Expert Opinion 0.819 Large deviation from expected performance Quality Index 0.310 Poor quality CRQ (moderate) 228 A56M01 Let y = f ( x) = cos(arcsin x). Then the range of f is A. { y 0 ≤ y ≤ 1} B. { y −1 ≤ y ≤ 1} C. {y − D. {y − π π π π < y< } 2 2 ≤ y≤ } 2 2 E. None of the above PRQ, Algebra, May 2006, Q1 A56M01 Assessment Component Comment Consolidation PRQ/CRQ PRQ Item Difficulty 3.07 Very difficult Discrimination 0.460 Discriminates fairly well Confidence Index 0.655 Large deviation from expected confidence level Expert Opinion 0.389 Small deviation from expected performance Quality Index 0.318 Poor quality PRQ (moderate) 229 C662f Let f ( x) = x2 . ( x − 2) 2 You may assume that f ′( x) = Find the points of inflection of 8x + 8 −4 x and f ′′( x ) = . 3 ( x − 2) ( x − 2) 4 f (if any). CRQ, Calculus, June 2006 Q2f C662f Assessment Component Comment Consolidation PRQ/CRQ CRQ Item Difficulty 3.75 Very difficult Discrimination 0.646 Discriminates poorly Confidence Index 0.783 Large deviation from expected confidence level Expert Opinion 0.609 Large deviation from expected performance Quality Index 0.595 Poor quality CRQ 230 C46MB6 x2 + 4 x + 3 = x →−1 x2 −1 lim A. −1 B. 0 C. undefined D. 4 PRQ, Calculus, March 2006, Tut Test B, Q6 C46MB6 Assessment Component Comment Consolidation PRQ/CRQ PRQ Item Difficulty -2.24 Easy Discrimination 0.996 Discriminates poorly Confidence Index 1.000 Large deviation from expected confidence level Expert Opinion 0.544 Large deviation from expected performance Quality Index 0.933 Poor quality PRQ 231 6.4 RESULTS 6.4.1 Comparison of PRQs and CRQs within each assessment component Table 6.3 summarises the quality of the item, both PRQs and CRQs, within each assessment component. Within each component the number of good and poor quality items are given, both for the PRQ and CRQ formats. The numbers are also given as percentages of the total number of items. Table 6.3: Component analysis – trends. No. of PRQs No. of CRQs Total no. of items 1.Technical 11 22 33 2.Disciplinary 24 34 58 3.Conceptual 26 30 56 4.Logical 7 6 13 5.Modelling 3 10 13 6.Problem solving 7 4 11 16 7 23 COMPONENT 7.Consolidation 1. Good quality items Poor quality items Good PRQs Good CRQs Poor PRQs Poor CRQs 17 [52%] 28 [48%] 28 [50%] 5 [39%] 8 [62%] 6 [55%] 12 [52%] 16 [48%] 30 [52%] 28 [50%] 8 [61%] 5 [38%] 5 [45%] 11 [48%] 8 [73%] 12 [50%] 14 [54%] 1 [14%] 2 [67%] 4 [57%] 7 [44%] 9 [41%] 16 [47%] 14 [47%] 4 [67%] 6 [60%] 2 [50%] 5 [71%] 3 [27%] 12 [50%] 12 [46%] 6 [86%] 1 [33%] 3 [43%] 9 [56%] 13 [59%] 18 [53%] 16 [53%] 2 [33%] 4 [40%] 2 [50%] 2 [29%] Technical In the technical assessment component, there is a higher percentage (73%) of good PRQs than good CRQs (41%). 73% good PRQs compared to good 41% CRQs shows us that PRQs are more successful than CRQs as an assessment format in the technical component. There is also a much higher percentage (73%) of good PRQs than poor PRQs (27%). CRQs, however, are not that successful in this component, with the results showing 59% poor CRQs compared to 41% good CRQs. The conclusion is that the technical assessment component lends itself better to PRQs than to CRQs. 232 2. Disciplinary In this study, the disciplinary component is the assessment component with the most items (58), of which 34 were CRQs and 24 were PRQs. In this component it is interesting to note that the percentages of good PRQs (50%) and good CRQs (47%) are almost equal. In addition, there is no difference between the good PRQs (50%) and the poor PRQs (50%), with very little difference between the good CRQs (47%) and poor CRQs (53%). PRQs and CRQs can be considered as equally successful assessment formats in the disciplinary component. 3. Conceptual The conceptual component also contained many items (56), with an almost equal number of PRQs and CRQs (26 PRQs versus 30 CRQs). 50% of the items are of good quality and 50% are of poor quality. In this component, there is no clear trend that PRQs are better than CRQs or vice versa. There is a slight leaning towards good PRQ assessment (47% good CRQs compared to 54% good PRQs). Therefore, in the conceptual assessment component, PRQs could be used as successfully as CRQs as a format of assessment. 4. Logical In this study, it is interesting to note that the majority of questions within the logical component were of a poor quality mainly due to the large percentage of poor PRQs. There are noticeably more good quality CRQs (67%) than good quality PRQs (14%), and noticeably more poor quality PRQs (86%) than poor quality CRQs (33%). A very high percentage of the PRQs (86%) in the logical component were of a poor quality. The conclusion is that the logical assessment component lends itself better to CRQs than to PRQs. 5. Modelling In the modelling component, very few PRQs were used as assessment items in comparison to CRQs, 3 PRQs versus 10 CRQs, probably because it is difficult to set PRQs in this component. Despite the small number of PRQs, it was encouraging to note that the good PRQs (67%) far outweighed the poor PRQs 233 (33%). So in terms of quality, the PRQs were highly successful in the modelling component. There are also more good CRQs (60%) than poor CRQs (40%). It appears that although more difficult to set in the modelling component, PRQs could be used as successfully in the modelling assessment component as CRQs. 6. Problem solving Although the problem solving component had the least number of items (11), it is interesting to note that there are more PRQs (7) than CRQs (4). There is a slightly higher percentage (57%) of good PRQs than good CRQs (50%). Although the sample is too small to make definite conclusions, there is no reason to disregard the use of PRQs in this assessment component. In fact, PRQs seem to be slightly more successful than CRQs, and the conclusion is that PRQ assessment format can add value to the assessment of the problem solving component. 7. Consolidation It was somewhat surprising to note that corresponding to the highest level of conceptual difficulty, the consolidation component displayed an unusually higher proportion of PRQs (16) to CRQs (7). This supports the earlier claim that PRQs are not only appropriate for testing lower level cognitive skills (Adkins, 1974; Aiken, 1987; Haladyna, 1999; Isaacs, 1994; Johnson, 1989; Oosterhof, 1994; Thorndike, 1997; Williams, 2006). In the consolidation component there is a significant higher percentage (71%) of good CRQs than good PRQs (44%). In addition, there is a higher percentage of poor PRQs (56%) than good PRQs (44%). The high percentage of good CRQs (71%) in comparison to poor CRQs (29%) indicates that the consolidation assessment component lends itself better to CRQs than to PRQs. 234 CHAPTER 7: DISCUSSION AND CONCLUSIONS In Chapter 7, I set about discussing my research results. The discussion in this chapter will include the interpretation of the results and the implications for future research. I intend to discuss how the research results could have implications for assessment practices in undergraduate mathematics. Using the Quality Index model, as developed in section 5.3, I will illustrate which items can be classified as good or poor quality mathematics questions. A comparison of good and poor quality mathematics questions in each of the PRQ and CRQ assessment formats will be made. Furthermore, I draw conclusions from my research about which of the mathematics assessment components, as defined in section 5.1, can be successfully assessed with respect to each of the two assessment formats, PRQ and CRQ. In this way, I endeavour to probe and clarify the first two research subquestions as stated in section 3.2 i.e. How do we measure the quality of a good mathematics question? and; Which of the mathematics assessment components can be successfully assessed using the PRQ assessment format and which of the mathematics assessment components can be successfully assessed using the CRQ assessment format? 7.1 GOOD AND POOR QUALITY MATHEMATICS QUESTIONS Section 7.1 summarises the development and features of the QI model for the sake of completeness of this chapter. In section 5.3, the Quality Index (QI) was defined in terms of the three measuring criteria: discrimination, confidence deviation and expert opinion deviation. Each of these three criteria represented the three arms of a radar plot. In the proposed QI model, all three criteria were considered to be equally important in their contribution to the overall quality of a question. 235 The QI model can be used both to quantify and visualise how good or how poor the quality of a mathematics question is. The following three features of the radar plots could assist us to visualise the quality and the difficulty of the item: (1) the shape of the radar plot; (2) the area of the radar plot; (3) the shading of the radar plot. 1. Shape of the radar plot When comparing the radar plots for the good quality items with those of the poor quality items, it is evident that the shapes of these radar plots are also very different. For the good mathematics questions, the shape seems to resemble a small equilateral triangle. This ideal shape is achieved when all three arms of the radar plot are shorter than the average length of 0.5 on each axis i.e. are all very close to 0, as well as all three arms being almost equal in magnitude. Such a situation would be ideal for a mathematics question of good quality, since all three measuring criteria would be close to zero which indicates a small deviation from the expected confidence level as well as a small deviation from the expected student performance, and would also indicate an item that discriminates well. In contrast, those radar plots corresponding to items of a poor quality did not display this small equilateral triangular shape. One notices that these radar plots are skewed in the direction of one or more of the three axes. This skewness in the shape of the radar plot reflects that the three measuring criteria do not balance each other out. The axis towards which the shape is skewed reflects which of the criteria contribute to the overall poor quality of the question. However, there are poor quality items which have radar plots resembling the shape of a large equilateral triangle. The difference is that although the plot has three arms equal in magnitude, all three arms are longer than the average length of 0.5 and are in fact all very close to 1 (i.e. very far from 0). 236 2. Area of the radar plot Another visual feature of the radar plot is its area. In this study, the area of the radar plot represents the Quality Index (QI) of the item. By defining the QI as the area, a balance is obtained between the three measuring criteria. If the QI value is less than 0.282 (the median QI), then the question is classified as a good quality mathematics question. If the QI value is greater than or equal to 0.282, the question is considered to be of a poor quality. When investigating the area of the good quality items, it is evident that such items have a small area i.e. a QI value close to zero. In such radar plots, the three arms are all shorter than the average length of 0.5 on each axis, and are all close to 0. For the poor quality items, the corresponding radar plot has a large area with QI values far from 0 (i.e. close to 1). In such radar plots, the three arms are generally longer than the average length of 0.5 on each axis, and are all far away from 0. The closer the QI value is to 0, the better the quality of the question. We can conclude that both the area and the shape of the radar plot assist us to form an opinion on the quality of a question. In Figure 7.1, both the shape and the area of the radar plot indicate a good quality assessment item. The shape resembles an equilateral triangle and the area is small. Figure 7.2 visually illustrates an assessment item of poor quality. The shape is skewed in the direction of both the discrimination and confidence axes and the radar plot has a large area. The poor performance of all three measuring criteria contributes to this item being a poor quality item. The item does not discriminate well and both students and experts misjudged the difficulty of the question. The large, skewed shape of the radar plot indicates an item of poor quality. 237 Figure 7.1: A good quality item. 3. Figure 7.2: A poor quality item. Shading of the radar plot In this study, the shading of the radar plot helped us to visualise the difficulty level of the question. Six shades of grey, ranging from white through to black (as shown in Table 5.4), represented the six corresponding difficulty levels chosen in this study ranging from very easy through to very difficult. Difficulty level is an important parameter, but does not contribute to classifying a question as good or not. Both easy questions and difficult questions can be classified as good or poor. Not all difficult questions are of a good quality, and not all easy questions are of a poor quality. For example, in Figure 7.3, the dark grey shading of the radar plot represents a difficult item. The large area and skew shape of the plot represents a poor quality item. So Figure 7.3 visually represents a difficult, poor quality item. In Figure 7.4, the very light shading of the radar plot represents an easy item. The small area and shape of the radar plot represents a good quality item. So Figure 7.4 visually represents an easy, good quality item. 238 Figure 7.3: A difficult, poor quality item. 7.2 Figure 7.4: An easy, good quality item. A COMPARISON OF PRQs AND CRQs IN THE MATHEMATICS ASSESSMENT COMPONENTS In section 6.3, Table 6.3 summarised the quality of both PRQs and CRQs within each assessment component. It was noted that certain assessment components lend themselves better to PRQs than to CRQs. For example, in the technical assessment component, there were almost twice as many good quality PRQs than good quality CRQs. For the assessor, this means that the PRQ assessment format can be successfully used to assess mathematics content which requires students to adopt a routine, surface learning approach. In this component, PRQs can successfully assess content which students will have been given in lectures or will have practised extensively in tutorials. In addition there were more than twice as many poor quality CRQs than poor quality PRQs. The conclusion is that the PRQ format successfully assesses cognitive skills such as manipulation and calculation, associated with the technical assessment component. 239 Another component in which PRQs can be used successfully is the disciplinary assessment component. In this component, there was no difference between the good quality PRQs and the poor quality PRQs, with very little difference between the good quality CRQs and the poor quality CRQs. The PRQ format can be used to assess cognitive skills involving recall (memory) and knowledge (facts) equally successfully as the CRQ format. Thus in the disciplinary assessment component, results show that it is easy to set PRQs of a good quality, thus saving time in both the setting and marking of questions involving knowledge and recall. As we proceed to the higher order conceptual assessment component, it is once again encouraging that the results indicate that PRQs can hold more than their own against CRQs. PRQs could be used successfully as a format of assessment for tasks involving comprehension skills whereby students are required to apply their learning to new situations or to present information in a new or different way. The results challenge the viewpoint of Berg and Smith (1994) that PRQs cannot successfully assess graphing abilities. The shift away from a surface approach to learning to a deeper approach, as mentioned by Smith et al. (1996), can be just as successfully assessed with PRQs as with the more traditional open-ended CRQs. The conclusion is that the PRQ assessment format can be successfully used in the conceptual assessment component. The modelling assessment component tasks, requiring higher order cognitive skills of translating words into mathematical symbols, have traditionally been assessed using the CRQ format. The results from this study show that although there are few PRQs corresponding to this component, probably due to the fact that it is more difficult to set PRQs than CRQs of a modelling nature, the PRQs were highly successful. The perhaps somewhat surprising conclusion is that PRQs can be used very successfully in the modelling component. This result disproves the claim made by Gibbs (1992) that one of the main disadvantages of PRQs is that they do not measure the depth of student thinking. It also puts to rest the concern expressed by Black (1998) and Resnick & Resnick (1992) that the PRQ assessment format encourages students to adopt a surface 240 learning approach. Although PRQs are more difficult and time consuming to set in the modelling assessment component (Andresen et al., 1993), these results encourage assessors to think more about our attempts at constructing PRQs which require words to be translated into mathematical symbols. The results show that there is no reason why PRQs cannot be authentic and characteristic of the real world, the very objections made by Bork (1984) and Fuhrman (1996) against the whole principle of the PRQ assessment format. Another very encouraging result was the high percentage of good quality PRQs as opposed to poor quality PRQs in the problem solving assessment component. This component encompasses tasks requiring the identification and application of a mathematical method to arrive at a solution. It appears that PRQs are slightly more successful than CRQs in this assessment component which encourages a deep approach to learning. Greater care is required when setting problem-solving questions, whether PRQs or CRQs, but the results show that PRQ assessment can add value to the assessment of the problem solving component. Once again this result shows that PRQs do not have to be restricted to the lower order cognitive skills so typical of a surface approach to learning (Wood & Smith, 2002). The results indicate that PRQs were not as successful in the logical and consolidation assessment components. In the logical assessment component, there were noticeably more poor quality PRQs than poor quality CRQs. The nature of the tasks involving ordering and proofs lends itself better to the CRQ assessment format. There were very few good PRQs in the logical assessment component. The high percentage of the poor quality PRQs in the logical assessment component leads to the conclusion that this component lends itself better to CRQs than to PRQs. In the consolidation assessment component, involving cognitive skills of analysis, synthesis and evaluation, there were noticeably more good quality CRQs than good quality PRQs. This trend towards more successful CRQs than PRQs indicates that CRQs add more value to the assessment of this 241 component. This is not an unexpected result, as at this highest level of conceptual difficulty, assessment tasks require students to display skills such as justification, interpretation and evaluation. Such skills would be more difficult to assess using the PRQ format. However, as shown by many authors (Gronlund, 1988; Johnson, 1989; Tamir, 1990), the ‘best answer’ variety in contrast to the ‘correct answer’ variety of PRQs does cater for a wide range of cognitive abilities. In these alternative types of PRQs the student is faced with the task of carefully analysing the various options and of making a judgement to select the answer which best fits the context and the data given. The conclusion is that the consolidation assessment component encourages the educator or assessor to think more about their attempts at constructing suitable assessment tasks. According to Wood and Smith (2002), assessment tasks corresponding to a high level of conceptual difficulty should provide a useful check on whether we have tested all the skills, knowledge and abilities that we wish our students to demonstrate. As the results have shown, PRQs can be used as successfully as CRQs as an assessment method for those mathematics assessment components which require a deeper learning approach for their successful completion. 7.3 CONCLUSIONS The mathematics assessment component taxonomy, proposed by the author in section 5.1, is hierarchical in nature, with cognitive skills that need a surface approach to learning at one end, while those requiring a deeper approach appear at the other end of the taxonomy. The results of this research study have shown that it is not necessary to restrict the PRQ assessment format to the lower cognitive tasks requiring a surface approach. The PRQ assessment format can, and does add value to the assessment of those components involving higher cognitive skills requiring a deeper approach to learning. According to Smith et al. (1996), many students enter tertiary institutions with a surface approach to learning mathematics and this affects their results at university. The results of this research study have addressed the research question of whether we can successfully use PRQs as an assessment format in 242 undergraduate mathematics and the mathematics assessment component taxonomy was proposed to encourage a deep approach to learning. In certain assessment components, PRQs are more difficult to set than CRQs, but this should not deter the assessor from including the PRQ assessment format within these assessment components. As the discussion of the results has shown, good quality PRQs can be set within most of the assessment components in the taxonomy which do promote a deeper approach to learning. In the Niss (1993) model, discussed in section 2.3, the first three content objects require knowledge of facts, mastery of standard methods and techniques and performance of standard applications of mathematics, all in typical, familiar situations. Results of this study have shown that PRQs are highly successful as an assessment format for Niss’s first three content objects. As we proceed towards the content objects in the higher levels of Niss’s assessment model, students are assessed according to their abilities to activate or even create methods of proofs; to solve open-ended, complex problems; to perform mathematical modelling of open-ended real situations and to explore situations and generate hypotheses. Results of this study again show that even though PRQs are more difficult to set at these higher cognitive levels, they can add value to the assessment at these levels. Results of this study show that the more cognitively demanding conceptual and problem solving assessment components are better for CRQs. Traditional assessment formats such as the CRQ assessment format have in many cases been responsible for hindering or slowing down curriculum reform (Webb & Romberg, 1992). The PRQ assessment format can successfully assess in a valid and reliable way, the knowledge, insights, abilities and skills related to the understanding and mastering of mathematics in its essential aspects. As shown by the qualitative results, PRQs can provide assistance to the learner in monitoring and improving his/her acquisition of mathematical insight and power, while also improving their confidence levels. Furthermore, PRQs can assist the educator to improve his/her teaching, guidance, supervision and counselling, while also saving time. The PRQ assessment format can reduce marking loads 243 for mathematical educators, without compromising the value of instruction in any way. Inclusion of the PRQ assessment format into the higher cognitive levels would bring new dimensions of validity into the assessment of mathematics. Table 7.1 presents a comparison of the success of PRQs and CRQs in the mathematics assessment components. Table 7.1: A comparison of the success of PRQs and CRQs in the mathematics assessment components. Mathematics assessment Component 1. Technical Comparison of success PRQs can be used successfully 2. Disciplinary No difference 3. Conceptual PRQs can be used successfully 4. Logical CRQs more successful 5. Modelling PRQs can be used successfully 6. Problem solving PRQs can be used successfully 7. Consolidation CRQs more successful As Table 7.1 illustrates, the enlightening conclusion is that there are only two components where CRQs outperform PRQs, namely the logical and consolidation assessment components. In two other components, PRQs are observed to slightly outperform CRQs, namely the conceptual and problem solving assessment components. The PRQs outperform the CRQs substantially in the technical and modelling assessment components. In one component there is no observable difference, the disciplinary assessment component. 7.4 ADDRESSING THE RESEARCH QUESTIONS In this study, a model has been developed to measure the quality of a mathematics question. This model, referred to as the Quality Index (QI) model, was used to address the research question and subquestions as follows: 244 Research question: Can we successfully use PRQs as an assessment format in undergraduate mathematics? Subquestion 1: How do we measure the quality of a good mathematics question? Subquestion 2: Which of the mathematics assessment components can be successfully assessed using the PRQ assessment format and which of the mathematics assessment components can be successfully assessed using the CRQ assessment format? Subquestion 3: What are student preferences regarding different assessment formats? ● Addressing the first subquestion: There is no single way of measuring the quality of a good question. I, as author of the thesis, have proposed one model as a measure of the quality of a question. I have illustrated the use of this model and found it to be an effective and quantifiable measure. The QI model can assist mathematics educators and assessors to judge the quality of the mathematics questions in their assessment programmes, thereby deciding which of their questions are good or poor. Retaining unsatisfactory questions is contrary to the goal of good mathematics assessment (Kerr, 1991). Mathematics educators should optimise both the quantity and the quality of their assessment, and thereby optimise the learning of their students (Romberg, 1992). 245 The QI model for judging how good a mathematics question is has a number of apparent benefits. The model is visually satisfying; whether a question is of good or poor quality can be witnessed at a single glance. Visualising the difficulty level in terms of shades of grey adds convenience to the model. Another visual advantage of this model is that shortcomings in different aspects of an item, such as that experts completely under estimate the expected level of student performance in the particular item, can also be instantly visualised. In addition, the model provides a quantifiable measure of the quality of a question, an aspect that makes the model useful for comparison purposes. The fact that the model can be applied to judge the level of difficulty of both PRQs and CRQs makes it useful for both traditional “long question” environments, as well as the increasingly popular online, computer centred environments. ● Addressing the second subquestion: In terms of the mathematics assessment components, it was noted that certain assessment components lend themselves better to PRQs than to CRQs. In particular, the PRQ format proved to be more successful in the technical, conceptual, modelling and problem solving assessment components, with very little difference in the disciplinary component, thus representing a range of assessment levels from the lower cognitive levels to the higher cognitive levels. Although CRQs proved to be more successful than PRQs in the logical and consolidation assessment components, PRQs can add value to the assessment of these higher cognitive component levels. Greater care is needed when setting PRQs in the logical and consolidation assessment components. The inclusion of the PRQ format in all seven assessment components can reduce marking loads for mathematics educators, without compromising the validity of the assessment. The PRQ assessment format can successfully assess in a valid and reliable way. The results have shown, both quantitatively and qualitatively, that PRQs can improve students’ acquisition of mathematical insight and knowledge, while also improving their confidence levels. The PRQ assessment format can be used as successfully as the CRQ format to encourage students to adopt a deeper approach to the learning of mathematics. 246 ● Addressing the third subquestion: With respect to the student preferences regarding different mathematics assessment formats, the results from the qualitative investigation seemed to indicate that there were two distinct camps; those in favour of PRQs and those in favour of CRQs. Those in favour of PRQs expressed their opinion that this assessment format did promote a higher conceptual level of understanding and greater accuracy; required good reading and comprehension skills and was very successful for diagnostic purposes. Those in favour of CRQs were of the opinion that this assessment format promoted a deeper learning approach to mathematics; required good reading and comprehension skills; partial marks could be awarded for method and students felt more confident with this more traditional approach. Furthermore, from the students’ responses, it also seemed as if the weaker ability students preferred the CRQ assessment format above the PRQ assessment format. The reasons for this preference were varied: CRQs provide for partial credit; there was a greater confidence with CRQs than with PRQs; PRQs require good reading and comprehension skills; PRQs encourage guessing and the distracters cause confusion. ● Addressing the main research question: As this study aimed to show, PRQs can be constructed to evaluate higher order levels of thinking and learning, such as integrating material from several sources, critically evaluating data and contrasting and comparing information. The conclusion is that PRQs can be successfully used as an assessment format in undergraduate mathematics, more so in some assessment components than in others. 7.5 LIMITATIONS OF STUDY The tests used in this study were conducted with tertiary students in their first year of study at the University of the Witwatersrand, Johannesburg, enrolled for the mainstream Mathematics I Major course. The study could be extended to other tertiary institutions and to mathematics courses beyond the first year level. 247 The judgement of how good or poor a mathematics question is, is modulo the QI model developed in this study. In the proposed QI model, I assumed that the three arms of the radar plot contribute equally to the overall quality of the mathematics question. This assumption needs to be investigated. The qualitative component of this study was not the most important part of the research. The small sample of students interviewed was carefully selected to include differences in mathematical ability, from different racial backgrounds and different gender classes. Consequently, I regarded their responses as being indicative of the opinions of the Mathematics I Major cohort of students. The third research subquestion, dealing with student preferences regarding the different assessment formats, was included as a small subsection of the study and was not the main focus of this study. The qualitative component could be expanded in future by increasing the sample size of interviewees and by using questionnaires in which all the students in the first year mathematics major course could be asked to express their feelings and opinions regarding different mathematics assessment formats. 7.6 IMPLICATIONS FOR FURTHER RESEARCH Collection of confidence-level data in conceptual mathematics tests provides valuable information about the quality of a mathematics question. The analysis suggests that confidence of responses should be collected, but also that it is critical to consider not only students’ overall confidence but to consider separately confidence in both correct and incorrect answers. The prevalence of overconfidence in the calibration of performance presents a paradox of educational practice. On the one hand, we want students to have a healthy sense of academic selfconcept and persist in their educational endeavours. On the other hand, we hope that a more realistic understanding of their limitations will be the impetus for educational development. The challenge for educators is to implement constructive interventions that lead to improved calibration and performance 248 without destroying students’ self-esteem and confidence (Bol & Hacker, 2008, p2). In this study, three parameters were identified to measure the quality of a mathematics question: discrimination index, confidence index and expert opinion. Further work needs to be carried out to investigate whether more contributing measuring criteria can be identified to measure the overall quality of a good mathematics question, and how this would affect the calculation of the Quality Index (QI) as discussed in section 5.3.2. As the assumption was made that the three parameters contributed equally to the quality of a mathematics question, the QI was defined as the area of the radar plot. The QI model could be adjusted or refined using other formulae. It is common practice in the South African educational setting to use raw scores in tests and examinations as a measure of a student’s ability in a subject. According to Planinic et al. (2006), misleading and even incorrect results can stem from an erroneous assumption that raw scores are in fact linear measures. Rasch analysis, the statistical method used in this research, is a technique that enables researchers to look objectively at data. The Rasch model (1960), can provide linear measures of item difficulties and students’ confidence levels. Often, analysis of raw test score data or attitudinal data is carried out, but it is not always the case that such raw scores can be immediately assumed to be linear measures, and linear measures facilitate objective comparison of students and items (Planinic et al. 2006). According to Wright and Stone (1979), the Rasch model is a more precise and moral technique that can be used to comment on a person’s ability and that the introduction thereof is long overdue. The Rasch method of data analysis could be valuable for other researchers in the fields of mathematics and science education research. It might be important for mathematics educators and researchers to further explore the QI model with questions not limited to Calculus and Linear Algebra topics of many traditional first year tertiary mathematics courses. In doing so, mathematics educators and assessors can be provided with an important model 249 to improve the overall quality of their assessment programmes and enhance student learning in mathematics. This research study could be expanded to other universities. Tertiary mathematics educators need to use models of the type developed in this study to quantify the quality of the mathematics questions in their undergraduate mathematics assessment programmes. The QI model can also be used by tertiary mathematics educators to design different formats of assessment tasks which will be significant learning experiences in themselves and will provide the kind of feedback that leads to success for the individual student, thus reinforcing positive attitudes and confidence levels in the students’ performance in undergraduate mathematics. The way students are assessed influences what and how they learn more than any other teaching practice (Nightingale et al., 1996, p7). Good quality assessment of students’ knowledge, skills and abilities is crucial to the process of learning. In this research study, I have shown that the more traditional CRQ format is not always the only and best way to assess our students in undergraduate mathematics. PRQs can be constructed to evaluate higher order levels of thinking and learning. The research study conclusively shows that the PRQ format can be successfully used as an assessment format in undergraduate mathematics. As mathematics educators and assessors, we need to radically review our assessment strategies to cope with changing conditions we have to face in South African higher education. The possibility that innovative assessment encourages students to take a deep approach to their learning and foster intrinsic interest in their studies is widely welcomed (Brown & Knight, 1994, p24). 250 REFERENCES Adkins, D.C. (1974). Test construction: development and interpretation of achievement tests (2nd ed.). Columbus, Otl: Charles Merrill Publishing. Adler, J. (2001). Teaching mathematics in multilingual classrooms. Dordrecht: Kluwer Academic Publishers. Aiken, L.R. (1987). Testing with multiple-choice items. Journal of Research and Development in Education, 20, 44-58. American Psychological Association (1963). Ethical standards of psychologists. American Psychologist, 23, 357-361. Andersen, E.B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, 123-140. Andersen, E.B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69-81. Andersen, E.B. & Olsen, L.W. (1982). The life of Georg Rasch as a mathematician and as a statistician. In A. Boomsma, M.A.J. van Duijn & T.A.B. Sniders (Eds.), Essays in item response theory. New York: Springer. Anderson, J.R. (1995). Cognitive psychology and its implications (4th ed.). W.H. Freeman Publishers. Andresen, L., Nightingale, P., Boud, D. & Magin, D. (1993). Strategies for assessing students. Birmingham: SCED. Andrich, D. (1982). An index of person separation in latent trait theory the traditional KR.20 index, and the Guttman scale response pattern. Educational Research and Perspectives, UWA, 9(1), 95-104. Andrich, D. (1988). Rasch models for measurements. USA: Sage Publications, Inc. Andrich, D. & Marais, I. (2006). EDU435/635. Instrument Design with Rasch IRT and Data Analysis 1, Unit Materials - Semester 2. Perth, Western Australia: Murdoch University. Angel, S.A. & LaLonde, D.E. (1998). Science success strategies: An interdisciplinary course for improving science and mathematics education. Journal of Chemical Education, 75(11), 1437-41. Angrosino, M.V. & Mays de Pérez, K.A. (2000). Rethinking observation: From method to context. In N.K. Denzin & Y.S. Lincoln (Eds.), Handbook of qualitative research (2nd ed.) (pp. 673-702). Thousand Oaks, CA: Sage. Anguelov, R., Engelbrecht J. & Harding, A. (2001). Use of technology in undergraduate mathematics teaching in South African universities. Quaestiones Mathematicae, Suppl. 1, 183-191. Astin, A.W. (1991). Assessment for excellence. New York: Macmillan. 251 Aubrecht II, G.J. & Aubrecht, J.D. (1983). Constructing objective tests. Am. J. Phys., 51(7), 613-620. Baker, L. & Brown, A. (1984). Metacognitive skills and reading. In P.D. Pearson, M. Kamil, R. Barr & P. Rosenthal (Eds.), Handbook of reading research (pp. 353-394). New York: Longman. Ball, G., Stephenson, B., Smith, G.H., Wood, L.N., Coupland, M. & Crawford, K. (1998). Creating a diversity of experiences for tertiary students. Int. J. Math. Educ. Sci. Technol., 29(6), 827-841. Baron, M.A. & Boschee, F. (1995). Outcome-based education: Providing direction for performance-based objectives. Educational Planning, 10(2), 25-36. Barak, M. & Rafaeli, S. (2004). On-line question-posing and peer-assessment as means for web-based knowledge sharing in learning. Int. J. Human – Computer Studies, 61, 84-103. Begle, E.G. & Wilson, J.W. (1970). Evaluation of mathematics programs. In E.G. Begle (Ed.), Mathematics Education (69th Yearbook of the National Society for the study of Education, Part I, 376-404). Chicago: University of Chicago Press. Beichner, R. (1994). Testing student interpretation of kinematics graphs. American Journal of Physics, 62, 750-762. Berg, C.A. & Smith, P. (1994). Assessing students’ abilities to construct and interpret line graphs: Disparities between multiple-choice and free-response instruments. Science Education, 78, 527-554. Biggs, J. & Collis, N.F. (1982). Mathematics Profile Series Operations Test. In J.B. Biggs (Ed.), Evaluating the quality of learning: the SOLO Taxonomy (pp. 82-89). New York: Academic Press. Biggs, J. (1991). Student learning in the context of school. In J. Biggs (Ed.), Teaching for learning: the view from cognitive psychology (pp. 7-20). Hawthorn, Victoria: Australian Council for Educational Research. Biggs, J. (1994). Learning outcomes: competence or expertise? Australian and New Zealand Research, 2(1), 1-18. Biggs, J. (2000). Teaching for quality learning at university. Buckingham: Open University Press. Birenbaum, M. & Dochy, F. (1996). Alternatives in assessment of achievements, learning processes and prior knowledge. Boston: Kluwer Academic Publishers. Birnbaum, A. (1968). Some latent trait models and their uses in inferring an examinee’s ability. In F.M. Lord & M.R. Novick (Eds.), Statistical theories of mental test scores (pp. 395-479). Reading, MA: Addison-Wesley. Black, P. (1998). Testing: friend or foe? Theory and practice of assessment and testing. London: Falmer Press. 252 Blanton, H., Buunk, B.P., Gibbons, F.X. & Kuyper, H. (1999). When better-thanothers compare upward: Choice of comparison and comparative evaluation as independent predictors of academic performance. Journal of Personality and social Psychology 76, 420-430. Bless, C. & Higson-Smith, C. (1995). Fundamentals of social research methods: An African perspective. Boston: Allan & Bacon. Bloom, B.S. (Ed.) (1956). Taxonomy of educational objectives. The classification of educational goals. Handbook 1: The cognitive domain. New York: David McKay. Bloom, B.S., Hastings, J.T., & Madaus, G.F. (1971). Handbook on formative and summative evaluation of student learning. New York: McGraw-Hill. Bol, L. & Hacker, D.J. (2008). Focus on research: Understanding and improving calibration accuracy. Retrieved on 1 March, 2007 from http://uhaweb.hartford.edu/ssrl/research.htm Bond, T.G. & Fox, C.M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah N J: Erlbaum Assoc. Boone, W. & Rogan, J. (2005). Rigour in quantitative analysis: “The promise of Rasch analysis techniques”. African Journal of research in SMT Education, 9(1), 25-38. Bork, A. (1984). “Letter to the Editor”. Am. J. Phys., 52, 873-874. Boud, D. (1990). Assessment and the promotion of academic values. Studies in higher education, 15(11), 101-111. Boud, D. (1995). Enhancing learning through self-assessment. London: Kogan Page. Braswell, J.S. & Jackson, C.A. (1995). An introduction of a new free-response item type in mathematics. Paper presented at the Annual meeting of the National Council on Measurement in Education. San Francisco: CA. Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and multiple-choice format. Journal of Educational Measurement, 29, 253-271. Brown, G., Bull, J. & Pendlebury, M. (1997). Assessing student learning in higher education. New York: Routledge. Brown, S. & Knight, P. (1994). Assessing learners in higher education. London: Kogan Page. Brown, S. (1999). Institutional strategies for assessment. In S. Brown & A. Glasner (Eds.), Assessment matter in higher education. Choosing and using diverse approaches (pp. 3-13). Buckingham: Open University Press. Burns, N. & Grove, S.K. (2003). Understanding nursing research (3rd ed.). Philadelphia: W.B. Saunders Company. 253 California Mathematics Council (CMC) and EQUALS. (1989). Assessment alternatives in mathematics: An overview of assessment techniques that promote learning. University of California, Berkeley: CMC and EQUALS. Campione, J.C., Brown, A.L. & Connell, M.L. (1988). Metacognition: On the importance of understanding what you are doing. In R.I. Charles & E.A. Silver (Eds.), The teaching and assessing of mathematical problem solving (pp. 93-114). Hillsdale, NJ: Lawrence Erlbaum Associates. Carvalho, M.K. (2007). Confidence judgments in real classroom settings: Monitoring performance in different types of tests. International Journal of Pyschology, 1-16. Case, S.M. & Swanson, D.B. (1989). Strategies for student assessment. In Boud, D. & Feletti, G. (Eds.), The challenge of problem-based learning (pp 269-283). London: Kogan Page. Collis, K.F. (1987). Levels of reasoning and the assessment of mathematical performance. In T.A. Romberg & D.M. Stewart (Eds.), The monitoring of school mathematics: Background papers. Madison: Wisconsin Center for Education Research. Corcoran, M. & Gibb, E.G. (1961). Appraising attitudes in the learning of mathematics. In Yearbook (1961) – National Council of Teachers of Mathematics. Reston, VA: NCTM. Cresswell, J.W. (1998). Qualitative inquiry and research design: Choosing among five traditions. Thousand Oaks, CA: Sage. Cresswell, J.W. (2002). Educational Research: Planning, conducting and evaluating quantitative and qualitative research. Upper Saddle River, New Jersey: Pearson Education, Inc. Cretchley, P.C. (1999). An argument for more diversity in early undergraduate mathematics assessment. Delta: 1999. The Challenge of Diversity, 17-80. Cretchley, P.C. & Harman, C.J. (2001). Balancing the scales of confidence – computers in early undergraduate mathematics learning. Quaestiones Mathematicae, Suppl. 1, 17-25. Crooks, T.J. (1988). The impact of classroom evaluation practices on students. Review of Educational Research, 58(4), 43-81. Cumming, J.J. & Maxwell, G.S. (1999). Contextualising authentic assessment. Assessment in Education, 6(2), 177-194. Dahlgren, L. (1984). Outcomes of learning. In F. Marton, D. Hounsell & N. Entwistle (Eds.), The experience of learning. Edinburgh: Scottish Academic Press. De Lange, J. (1994). Assessment: No change without problems. In T.A Romberg (Ed.), Reform in School Mathematics and authentic assessment (pp. 87-172). Albany NY: SUNY Press. 254 Dison, L. & Pinto, D. (2000). Example of curriculum development under the South African National Qualifications Framework. In S. Makoni (Ed.), Improving teaching and learning in higher education. A handbook for Southern Africa (pp. 201-202). Johannesburg, South Africa: Wits University Press. Ebel, R. (1965). Confidence weighting and test reliability. Journal of Educational Mesurement, 2, 49-57. Ebel, R. (1972). Essentials of educational measurement. New York: Prentice Hall. Ebel, R. & Frisbie, D.A. (1986). Essentials of educational measurement. Englewood Cliffs, NJ: Prentice Hall. Ehrlinger, J. (2008). Skill level, self-views and self-theories as sources of error in self-assessment. Social and Personality Psychology Compass, 2(1), 382-398. Eisenberg, T. (1975). Behaviorism: The bane of school mathematics. Journal of Mathematical Education, Science and Technology, 6(2), 163-171. Elton, L. (1987). Teaching in higher education: Appraisal and training. London: Kogan Page. Engelbrecht, J. & Harding, A. (2002). Is mathematics running out of numbers? South African Journal of Science, 99(1/2), 17-20. Engelbrecht, J. & Harding, A. (2003). Online assessment in mathematics: multiple assessment formats. New Zealand Journal of Mathematics, 32 (Supp.), 57-66. Engelbrecht, J. & Harding, A. (2004). Combining online and paper assessment in a web-based course in undergraduate mathematics. Journal of computers in Mathematics and Science Teaching, 23(3), 217-231. Engelbrecht, J., Harding, A. & Potgieter, M. (2005). Undergraduate students’ performance and confidence in procedural and conceptual mathematics. Int. J. Math. Educ. Sci. Technol., 36(7), 701-712. Engelbrecht, J. & Harding, A. (2006). Impact of web-based undergraduate mathematics teaching on developing academic maturity: A qualitative investigation. Proceedings of the 8th Annual Conference on WWW Applications. Bloemfontein, South Africa. Entwistle, N. (1992). The impact of teaching on learning outcomes in higher education: A literature review. Sheffield: Committee of Vice-Chancellors and Principals of the Universities of the United Kingdom, Universities’ Staff Development Unit. Erwin, T.D. (1991). Assessing student learning and development: A guide to the principles, goals and methods of determining college outcomes. San Francisco: Jossey-Bass. Freeman, J. & Byrne, P. (1976). The assessment of postgraduate training in general practice (2nd ed.). Surrey: SRHE. 255 Freeman, R. & Lewis, R. (1998). Planning and implementing assessment. London: Kogan Page. Friel, S. & Johnstone, A.H. (1978). Scoring systems which allow for partial knowledge. Journal of Chemical Education, 55, 717-719. Fuhrman, M. (1996). Developing good multiple-choice tests and test questions. Journal of Geoscience Education, 44, 379-384. Gall, M.D., Gall, J.P. & Borg, W.R. (2003). Educational Research: an introduction (7th ed.). USA: Pearson Education Inc. Gay, S. & Thomas, M. (1993). Just because they got it right, does not mean they know it? In N.L. Webb and A.F. Coxford (Eds.), Assessment in the mathematics classroom. Reston, VA: NCTM. Geyser, H. (2004). Learning from assessment. In S. Gravett & H. Geyser. (Eds.), Teaching and learning in higher education (pp. 90-110). Pretoria, South Africa: Van Schaik. Gibbs, G. (1992). Assessing more students. Oxford: The Oxford Centre for Staff Development. Gibbs, G., Habeshaw, S. & Habeshaw, T. (1988). 53 interesting ways to assess your students (2nd ed.). Bristol: Technical and Educational Services Ltd. Gifford, B.R. & O’Connor, M.C. (1992). Changing assessments: Alternative views of aptitude, achievement and instruction. Boston and Dordrecht: Kluwer. Glaser, R. (1988). Cognitive and environmental perspectives on assessing achievement. In E. Freeman (Ed.), Assessment in the service of learning: Proceedings of the 1987 ETS Invitational Conference (pp. 40-42). Princeton, N.J.: Educational Testing Service. Glass, G.V. & Stanley, J.C. (1970). Measurement, scales and statistics. Statistical methods in education and psychology, (pp. 7-25). New Jersey: Prentice Hall. Greenwood, L., McBride, F., Morrison, H., Cowan, P. & Lee, M. (2000). Can the same results be obtained using computer-mediated tests as for paper-based tests for National Curriculum assessment? Proceedings of the International Conference in Mathematics/Science Education and Technology, 2000(1), 179-184. Groen, L. (2006) Enhancing learning and measuring learning outcomes in mathematics using online assessment. UniServe Science Assessment Symposium Proceedings, 56-61. Gronlund, N.E. (1976). Measurement and evaluation in teaching (3rd ed.). New York: Macmillan. Gronlund, N.E. (1988). How to construct achievement tests. Englewood Cliffs, NJ: Prentice Hall. Haladyna, T.M. (1999). Developing and validating multiple choice test items (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. 256 Hamilton, L.S. (2000). Assessment as a policy tool. Review of Research in Education, 27(1), 25-68. Harlen, W. & James, M.J. (1977). Assessment and learning: differences and relationships between formative and summative assessment. Assessment in Education, 4(3), 365-380. Harper, R. (2003). Correcting computer-based assessments for guessing. Journal of Computer Assisted Learning, 19(1), 208. Harper, R. (2003). Multiple choice questions – a reprieve. Bioscience Education eJournal, 2. Retrieved on 18 May, 2004 from http://bio.Itsn.ac.uk/journal/vol1/beej-2-6.htm Harvey, J.G. (1992). Mathematics testing with calculators: ransoming the hostages. In T.A. Romberg (Ed.). Mathematics assessment and evaluation: Imperatives for mathematics education (pp. 139-168). Albany, NY: Suny Press. Harvey, L. (1993). An integrated approach to student assessment. Paper presented to Measure for Measure, Act III conference, Warwick. Hasan, S., Bagayoko, D. & Kelley, E.L. (1999). Misconceptions and the certainty of response index (CRI). Physics Education, 34(5), 294-299. Heywood, J. (1989). Assessment in higher education. London: Kogan Page. Hibberd, S. (1996). The mathematical assessment of students entering university engineering courses. Studies in Educational Evaluation, 22(4 ), 375-384. Hiebert, J. & Carpenter, T.P. (1992). Learning and teaching with understanding. In D.A. Grouws (Ed.), Handbook of research on mathematics teaching and learning (pp. 97-111). New York: Macmillan. Hoffman, B. (1962). The tyranny of testing. New York: Greenwood Press. Hounsell, D., McCulloch, M. & Scott, M. (Eds.) (1996). The ASSHE Inventory: Changing assessment practices in Scottish higher education. Sheffield: UCOSDA. Hubbard, R. (1995). 53 ways to ask questions in mathematics and statistics. Bristol: Technical and Educational Services. Hubbard, R. (1997). Assessment and the process of learning statistics. Journal of Statistics Education, 5(1). Retrieved on 17 June, 2007 from http://www.amstat.org/publications/jse/v5n1/hubbard.html Hubbard, R. (2001). The why and how of getting rid of conventional examinations. Quaestiones Mathematicae, Suppl. 1, 57-64. Hughes, C. & Magin, D. (1996). Demonstrating knowledge and understanding. In P. Nightingale (Ed.), Assessing learning in universities (pp. 127-161) Sydney: University of New South Wales Press. 257 Huysamen, G.K. (1983). Introductory statistics and research design for the behavioural sciences, Volume 1. Bloemfontein: Department of Psychology, UOFS. Isaacs, G. (1994). Multiple choice testing: A guide to the writing of multiple choice tests and to their analysis. Campbelltown, NSW: HEROSA. Isaacson, R.M. & Fujita, F. (2006). Metacognitive knowledge monitoring and selfregulated learning: Academic success and reflections on learning. Journal of the Scholarship of Teaching and Learning, 6, 39-55. Jessup, G. (1991). Outcomes: NVQs and the emerging model of education and training. London: Falmer Press. Johnson, J.K. (1989). …Or none of the above. The Science Teacher, 56(2), 57-61. Johnstone, A.H. & Ambusaidi, A. (2001). Fixed-response questions with a difference. Chemistry Education: Research and Practice in Europe, 2(3), 313-327. Kehoe, J. (1995). Writing multiple choice tests items. Practical Assessment, Research and Evaluation, 4(9). Retrieved on 5 December, 2005 from http://PAREonline.net/getvn. Kenney, P.A. & Silver, E.A. (1993). An examination of relationships between 1990 NAEP mathematics items for grade 8 and selected themes from NCTM Standards. Journal for Research in Mathematics Education, 24(2), 159-167. Kerr, S.T. (1991). Lever and fulcrum: educational technology in teachers’ thought and practice. Teachers College Record, 93(1), 114-136. Kilpatrick, J. (1993). The chain and the arrow: From the history of mathematics assessment. In M. Niss (Ed.), Investigations into assessment in mathematics education: An ICMI study (pp. 31-46). Dordrecht, The Netherlands: Kluwer Academic Publishers. Knight, P. (1995). Assessment for learning in higher education. Published in association with the Staff and Educational Development Association. London: Kogan Page. Krutetskii, V.A. (1976). The psychology of mathematical abilities in school children. Chicago: University of Chicago Press. Lajoie, S. (1991). A framework for authentic assessment in mathematics. NCRMSE Research Review: The teaching and learning of Mathematics, 1(1), 6-12. Larisey, M.M. (1994). Student self assessment: a tool for learning. Adult learning, 5(6), 9-10. Lawson, D. (1999). Formative assessment using computer-aided assessment. Teaching Mathematics and its applications, 18(4), 155-158. Linacre, J.M. (1994). Sample Size and Item Calibration Stability. Rasch Measurement Transactions, 7(2), 328. Retrieved on 13 February, 2006 from http://www.rasch.org/rmt/rmt74m.htm 258 Linacre, J.M. & Wright, B.D. (1999). Winsteps Rasch model program. Chicago: MESA Press. Linacre, J.M. (2002). Optimizing rating scale effectiveness. Journal of Outcome Measurement, 3, 85-106. Linacre, J.M. (2005). WINSTEPS Rasch measurement computer program. Chicago: Winsteps.com. Linacre, J.M. (2007). Practical Rasch measurement, Lesson 2. Retrieved on 7 August, 2007 from www.statistics.com Linn, R.L. (1989). Educational measurement (3rd ed.). New York: Macmillan. Luckett, K. & Sutherland, L. (2000). Assessment practices that improve teaching and learning. In S. Makoni (Ed.), Improving teaching and learning in higher education. A handbook for Southern Africa (pp. 98-130). Johannesburg, South Africa: Wits University Press. Makoni, S.(Ed.) (2000). Improving teaching and learning in higher education. A handbook for Southern Africa (pp. 98-130). Johannesburg, South Africa: Wits University Press. Martinez, M. (1991). A comparison of multiple-choice and constructed figural response items. Journal of Educational Measurement, 28, 131-145. Marton, F. & Saljö, R. (1984). Approaches to learning. In F. Marton, D. Hounsell & N. Entwistle (Eds.), The experience of learning (pp. 36-55). Edinburgh: Scottish Academic Press. Massachusetts Department of Education. (1987). The 1987 Massachusetts Educational Assessment Program. Quincy: Massachusetts Department of Education. Mathematical Sciences Education Board (MSEB). (1989). Everybody counts: A report to the nation on the future of mathematics education. Washington, DC: National Academy Press. Mathematical Sciences Education Board (MSEB). (1993). Measuring what counts: A conceptual guide for mathematics assessment. Washington, DC: National Academy Press. McDonald, M. (2002). Systematic assessment of learning outcomes: Developing multiple-choice exams. Massachusetts, USA: Jones and Bartlett Publishers. McFate, C. & Olmsted, J. (1999). Assessing student preparation through placement tests. Journal of Chemical Education, 76(4), 562-565. McIntosh, H. (Ed.) (1974). Techniques and problems of assessment. London: Edward Arnold. McMillan, J.H. & Schumacher, S. (2001). Research in Education: A conceptual introduction (5th ed.). New York: Addison Wesley Longman, Inc. 259 Merriam, S.B. (1998). Qualitative research and case study applications in education. San Francisco: Jossey-Bass Publishers. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13-103). New York: American Council on Education and Macmillan Publishing Company. Minick, N., Stone, C.A. & Forman, E.A. (1993). Contexts for learning: Sociocultural dynamics in children’s development. New York: Oxford University Press. National Council of Teachers of Mathematics (NCTM). (1989). Curriculum and evaluation standards for school mathematics. Reston, VA: NCTM. National Council of Teachers of Mathematics (NCTM). (1995). Assessment standards for school mathematics. Reston, VA: NCTM. National Council of Teachers of Mathematics (NCTM). (2000). Principles and standards for school mathematics. Reston, VA: NCTM. Retrieved on 7 September, 2006 from http://standards.nctm.org/previous/currevstds/9-12sb.htm Nightingale, P., Te Wiata, I., Toohey, S., Ryan, G., Hughes, C. & Magin, D. (1996). Assessing learning in universities. Sydney: University of New South Wales Press. Niss, M. (1993). Investigations into assessment in mathematics education. An ICMI Study. Netherlands: Kluwer Academic Publishers. Ochse, C. (2003). Are positive self-perceptions and expectancies really beneficial in an academic context? South African Journal of Higher Education, 17(1), 6-73. Oosterhof, A. (1994). Classroom applications of educational measurement. Englewood Cliffs, NJ: Macmillan. Ormell, C.P. (1974). Bloom’s taxonomy and the objectives of education. Educational Research, 17, 3-18. Osterlind, S.J. (1998). Constructing test items: Multiple choice, constructedresponse, performance and other formats (2nd ed.). Boston: Kluwer Academic Publications. Pallier, G., Wilkinson, R., Danthiir, V., Kleitman, S., Knezevic, G., Stankov, L., & Robertsw, R. (2002). The role of individual differences in the accuracy of confidence judgments. Journal of General Psychology, 129, 257-299. Planinic, M., Boone, W.J., Krsnik, R. & Beilfuss, M.L. (2006). Exploring alternative conceptions from Newtonian dynamics and simple DC circuits: Links between item difficulty and item confidence. Journal of Research in Science Teaching, 43(2), 150-171. Potgieter, M., Rogan, J.M. & Howie, S. (2005). Chemical concepts inventory of Grade 12 learners and UP foundation year students. African Journal of Research in SMT Education, 9(2), 121-134. 260 Pressley, M., Ghatala, E.S., Woloshyn, V. & Pirie, J. (1990). Sometimes adults miss the main ideas and do not realise it: Confidence in responses to short-answer and multiple-choice comprehension questions. Reading Research Quarterly, 25(3), 232-249. Ramsden, P. (1984). The context of learning. In F. Marton, D. Hounsell & N. Entwistle (Eds.), The experience of learning. Edinburgh: Scottish Academic Press. Ramsden, P. (1992). Learning to teach in higher education. London: Routledge. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danmarks Paedogogiske Institute. Rasch, G. (1977). On specific objectivity. An attempt at formalizing the request for generality and validity of scientific statements. In Blegvad, M. (Ed.), The Danish Yearbook of Philosophy (pp. 58-94). Copenhagen: The Danish Institute of Educational Research. Rasch, G. (1980). Foreword and introduction. Probabilistic models for some intelligence and attainment tests (pp. 3-12, pp. ix-xix). Chicago: The University of Chicago Press. Resnick, L.B. (1987). Education and learning to think. Washington, DC: National Academy Press. Resnick, L.R. & Resnick, D.P. (1992). Assessing the thinking curriculum: New tools for educational reform. In B.R. Gifford and M.C. O’Connor (Eds.), Changing assessments: Alternative views of aptitude, achievement and instruction (pp. 3775). Boston and Dordrecht: Kluwer. Robins, R.W. & Beer, J.S. (2001). Positive illusions about the self: Short-term benefits and long-term costs. Journal of Personality and Social Psychology, 80, 340-352. Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37. Romberg, T.A., Zarinnia, E.A. & Collis, K.F. (1990). A new world view of assessment in mathematics. In G.Kulm (Ed.), Assessing higher order thinking in mathematics (pp. 21-38). Washington, DC: American Association for the advancement of Science. Romberg, T.A. (1992). Mathematics assessment and evaluation. Imperatives for mathematics educators. Albany: State University of New York Press. Rowntree, D. (1987). Assessing students: How shall we know them? (2nd ed.). London: Kogan Page. Schoenfeld, A.H. (Ed.)(1987). Cognitive science and mathematics education. Hillsdale, N.J: Lawrence Erlbaum Associates. Schoenfeld, A.H. (2002). Making mathematics work for all children: Issues of standards, testing and equity. Educational Researcher, 31(1), 13-25. 261 Schumacher, S. & McMillan, J.H. (1993). Research in education: A conceptual introduction. New York: Harper Collins. Scouller, K. & Prosser, M. (1994). Students’ experiences in studying for multiplechoice examinations. Studies in Higher Education, 19(3), 267-279. Scriven, M. (1991). Evaluation thesaurus, 4th ed. London: Sage. Senk, S.L., Beckmann, C.E. & Thompson, D.R. (1997). Assessment and grading in high school mathematics classrooms. Journal for Research in Mathematics Education, 28(2), 187-215. Sinkavich, F.J. (1995). Performance and metamemory: Do students know what they don’t know? Journal of Instructional Psychology, 22(1), 77-87. Sluijsmans, D., Moerkerke, G., van-Merrienboer, J. & Dochy, F. (2001). Peer assessment in problem based learning. Studies in Educational Evaluation, 27, 153173. Smith, G.H., Wood, L.N., Crawford, K., Coupland, M., Ball, G. & Stephenson, B. (1996). Constructing mathematical examinations to assess a range of knowledge and skills. Int. J. Math. Educ. Sci. Technol., 27(1), 65-77. Smith, G.H. & Wood, L.N. (2000). Assessment of learning in university mathematics. Int. J. Math. Educ. Sci. Technol., 31(1), 125-132. Smith, E.V., Jr. & Smith, R.M. (2004). Introduction to Rasch Measurement. Maple Grave, Minnesota: JAM Press. South African Qualifications Authority (SAQA). (2001). Criteria and guidelines for the assessment of NQF registered unit standards and qualifications: Policy document. Pretoria: SAQA. Steen, L.A. (1999). Assessing assessment. In B. Gold (Ed.), Assessment practices in undergraduate mathematics (pp. 1-8). Washington, DC: Mathematical Association of America. Stenmark, J.K. (1991) Mathematics assessment: myths, models, good questions and practical suggestions. Reston, VA: NCTM. Stewart, J. (2000). Calculus International Student Edition (5th ed.). United States of America: Thomson Learning, Inc. Tamir, P. (1990). Justifying the selection of answers in multiple choice items. International Journal of Science Education, 12(5), 563-573. Tang, H. (1996). What is Rasch? Rasch Measurement Transactions, 10(2), 507. Thorndike, R.M. (1997). Measurement and evaluation in psychology and education (6th ed.). Upper Saddle River, NJ: Prentice-Hall. Tobias, S. & Everson, H. (2002). Knowing what you know and what you don’t: Further research on metacognitive knowledge monitoring. College Board Report No. 2002-3. New York: College Board. 262 Traub, R.E. & Fisher, C.W. (1977). On the equivalence of constructed-response and multiple-choice tests. Applied Psychological Measurement, 1, 355-369. Traub, R.E. & Rowley, G.L. (1991). Understanding reliability. Educational Measurement: Issues and Practice, 19(1), 37-45. Treagust, D.F. (1988). Development and use of diagnostic tests to evaluate students’ misconceptions in Science. International Journal of Science Education, 10, 159-169. Tyler, R.W. (1931). A generalized technique for constructing achievement tests. Educational Research Bulletin, 8, 199-208. Wagner, E.P, Sasser, H. & DiBiase, W.J. (2002). Predicting students at risk in general chemistry using pre-semester assessments and demographic information. Journal of Chemical Education, 79(6), 749-755. Webb, J.H. (1989). Multiple-choice questions in mathematics. S.-Afr. Tydskr. Opvoedk., 9(1), 216-218. Webb, N. & Romberg, T.A. (1992) Implications of the NCTM standards for mathematics assessment. In T.A. Romberg (Ed.), Mathematics Assessment and Evaluation: Imperatives for Mathematics Educators (pp. 37-60). Albany: State University of New York Press. Webb, J.M. (1994). The effects of feedback timing on learning facts: the role of response confidence. Contemporary Educational Psychology, 19, 251-265. Wesman, A.G. (1971). Writing the test item. In R.L. Thorndike (Ed.), Educational measurement. Washington DC: American Council of Education. Wiggins, G. (1989). A true test: toward more authentic and equitable assessment. Phi Delta Kappan, 703-713. Williams, E. (1992). Student attitudes towards approaches to learning and assessment. Assessment and Evaluation in Higher Education, 17, 45-58. Williams, J.B. (2006). Assertion – reason multiple-choice testing as a tool for deep learning: a qualitative analysis. Assessment in Higher Education, 31(3), 287-301. Wood, L.N. & Smith, G.H. (1999). Flexible assessment. In W. Spunde, P. Cretchley, & R. Hubbard (Eds.), The Challenge of Diversity (pp. 229-233). Laguna Quays: University of Southern Queensland Press. Wood, L.N. & Smith, G.H. (2001). Survey of the use of flexible assessment. Quaestiones Mathematicae, Suppl. 1, 73-82. Wood, L.N. & Smith, G.H. (2002). Students’ perceptions of difficulty in mathematical tasks. In M. Boezi (Ed.), 2nd International Conference on the Teaching of Mathematics, Crete, Greece, July. New Jersey, USA: John Wiley & Sons. 263 Wood, L.N., Smith, G.H., Petocz, P., Reid, A. (2002). Correlations between students’ performance in assessment and categories of a taxonomy. In M. Boezi (Ed.), 2nd International Conference on the Teaching of Mathematics, Crete, Greece, July. New Jersey, USA: John Wiley & Sons. World Book Dictionary (1990). Chicago, London, Sydney Toronto: World Book. Inc. Wright, B.D.(1992) Point-biserials and item fits. Rasch Measurement Transactions, 5(4), 174. Wright, B.D. & Linacre, J.M. (1989). Observations are always ordinal: measurements, however, must be interval. Chicago, IL: MESA Psychometric Laboratory. Wright, B.D. & Stone, M.H. (1979). The measurement model. Best Test Design. Chicago: MESA Press. Retrieved on 15 April, 2006 from http://www.rasch.org/books.htm Yorke, M. (1988). The management of assessment in higher education. Assessment and evaluation in higher education, 23, 101-116. Zohar, A. & Dori, Y.J. (2002). Higher order thinking skills and low achieving students: are they mutually exclusive? The Journal of the Learning Sciences, 12(2), 145-182. 264 Appendix A1 Declaration letter Academic Information Systems Unit Private Bag 3, WITS 2050 South Africa Tel +27 11 717 1211/2/4 or 1061 Fax +27 11 717 1229 29 January 2007 I , Belinda Huntley, Staff Number 08901381, hereby declare that I will not use the information furnished to me by the University of the Witwatersrand in a manner that will bring the University in disrepute or in a way that it could be traced back to the University. I further agree that my research may be used by the University if it so desired. The Registrar has approved the use of this e-mail contact because of the importance the University attaches to the survey. Permission was granted on the understanding that you are not obliged to respond and that you may curtail your involvement at any time in the process SignatureB.Huntley:………………………………… Date:2007/01/28……………………. 265 Appendix A2 Table 1.2: Exit level outcomes (ELOs) of the undergraduate curriculum* Exit Level Outcomes (ELOs) The qualifying learner: 1. generates, explores and considers options and makes decisions about ways of seeing systems and situations, and considers different ways of applying and integrating scientific knowledge to solve theoretical, applied or real life problems specifically through research and the production of a research project 2. demonstrates an advanced understanding of key aspects of specified scientific systems and situations 3. demonstrates an advanced understanding of specified bodies of content and their interconnectedness in chosen disciplines 4. demonstrates an advanced understanding of the boundaries, inter-connections, value and knowledge creation systems of chosen disciplines within the sciences 5. reflects on possible implications for self and system of different ways of seeing and intervening in systems and situations 6. demonstrates an ability to reflect with self and others, critical of own and other peoples’ thoughts and actions, and capable of self-organisation and working in groups in the face of continual challenge from the environment 7. demonstrates consciousness of, and engagement with own learning processes and the nature of knowledge, and how new knowledge can be acquired 8. demonstrates an ability to conduct oneself as an independent learner and practitioner. 9. demonstrates an ability to reflect on the importance of scientific paradigms and methods in understanding scientific concepts and their changing nature (Source: Executive Information System, School of Mathematics, Academic Review 2000-2004, University of the Witwatersrand) *italicised text refers to the BScHons degree only; other text is common to the BSc and BScHons degrees 266 Appendix A3 Table 1.3: Associated assessment criteria (AAC)* A. The learner should demonstrate an ability to consider a range of options and make decision about: A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 ways of seeing systems and situations, and to consider different ways of applying and integrating scientific knowledge to solve theoretical, applied or real life problems methods for integrating information to solve complex problems appropriate methods to carry out investigations to solve problems appropriate use of quantitative techniques in the chosen discipline selecting and appropriate method for communicating a set of data the most appropriate personal learning strategies and organisation of work. awareness of quality control, scientific standards and ethical norms as they pertain to the application of their chosen discipline in scientific investigations and the work place awareness of the career path and professional responsibilities that accompany their chosen discipline. B. The learner should demonstrate an understanding of: B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8 the use of critical thinking and logic in analysing situations information storage and retrieval systems basic computing skills; effective communication and competent application of the relevant techniques including numerical and computer skills how to prepare a written scientific document; how to design, execute and present scientific investigations such as through a small scale scientific report/research project modes of communicating, interpreting and translating data relevant uses of quantitative methods to analyse and check for the plausibility of data how to design and carry out scientific investigations fundamental/advances techniques in the discipline C. The learner should demonstrate an ability to reflect on and critically evaluate: C.1 C.2 C.3 C.4 C.5 C.6 C.7 the use of advanced investigative techniques and their strengths and weaknesses the appropriateness of own interventions including strengths and weaknesses and possible future improvement of these the relative merits of issues raised by science and technology and the relevance of science to everyday life and global issues successes, strengths and weaknesses and possible improvement of personal learning strategies own and other peoples’ participation in a culturally and racially diverse learning situations and society. scientific paradigms and methods in understanding scientific concepts and their changing nature the practice and application of knowledge and understanding they have acquired of their chosen discipline in the workplace (Source: Executive Information System, School of Mathematics, Academic Review 2000-2004, University of the Witwatersrand) *italicised text refers to the BScHons degree only; underlined text refers to the BSc degree only; other text is common to the BSc and BScHons degrees 267 Appendix A4 Table 1.4: Critical cross-field outcomes (CCFOs) CCFO (a) identifying and solving problems in which responses display that responsible decisions using critical and creative thinking have been made. CCFO (b) working with others as a member of a team, group, organisation, community. CCFO (c) organising and managing oneself and one’s activities responsibly and effectively. CCFO (d) collecting, analysing, organising and critically evaluating information. CCFO (e) communicating effectively using visual, mathematical and/or language skills in the modes of oral and/or written persuasion. CCFO (f) using science and technology effectively and critically, showing responsibility towards the environment and health of others. CCFO (g) demonstrating an understanding of the world as a set of related systems by recognising that problem-solving contexts do not exist in isolation. CCFO (h) contributing to the full personal development of each learner and the social and economic development of society at large, by making it the underlying intention of any programme of learning to make an individual aware of the importance of: 1. reflecting on and exploring a variety of strategies to learn more effectively; 2. participating as responsible citizens in the life of local, national and global communities; 3. being culturally and aesthetically sensitive across a range of social contexts; 4. exploring education and career opportunities; 5. developing entrepreneurial opportunities. (Source: Executive Information System, School of Mathematics, Academic Review 2000-2004, University of the Witwatersrand) 268 Appendix A5 Table 6.2: Misfitting and discarded test items INFIT Item difficulty Model SE C45MB7 -3.94 C561B Item OUTFIT PTMEA CORR MnSQ ZSTD MnSQ ZSTD 0.47 0.83 -0.3 0.25 -1.5 0.26 -3.47 0.62 0.74 -0.4 0.29 -1.2 0.44 C46MA6 1.72 0.23 1.21 2.0 1.67 3.0 0.33 I036M04 -2.71 0.22 0.91 -0.6 0.45 -2.3 0.50 C361B -3.31 0.36 0.86 -0.4 0.49 -1.4 0.32 C35M02 -3.61 0.47 1.11 0.4 1.61 1.1 0.08 C45MB6 -2.1 0.17 1.19 2.0 1.64 2.8 0.36 269 Appendix A6 Test items Rasch statistics ITEM C35M01 C35M02 C35M03 C35M04 C35M05 A35M06 A35M07 A35M08 A45MA146 A45MA246 A45MA346 A45MA4 C45MA5 C45MA6 C45MA7 C45MA8 A45MB146 A45MB246 A45MB346 A45MB4 C45MB5 C45MB6 C45MB7 C45MB8 C55M01 C55M02 C55M03 C55M04 C55M05 A55M06 A55M07 A55M08 I65M0166 I65M0266 I65M0366 I65M0466 I65M0566 I65M06 I65M0766 I65M08 I65M09 I65M10 I65M1166 I65M1266 A651A663 A651B A652A A652B561B A653 C651A662A C651B662B C651C C651D662E C651E662G C652A C652B C652C C652D RAW SCORE 216 174 242 276 214 185 238 73 253 300 323 80 148 189 119 118 115 118 171 43 36 46 37 88 257 240 179 145 227 21 226 223 396 303 516 416 342 279 546 271 127 125 395 218 394 87 283 95 274 749 512 250 506 430 273 254 260 95 COUNT 295 179 297 298 295 296 297 278 418 415 417 197 200 200 199 127 215 215 216 116 117 49 108 100 327 328 322 328 328 251 284 324 664 652 638 669 662 324 675 328 349 343 644 631 686 353 369 353 369 957 652 369 686 686 335 369 369 353 MEASURE -0.36 -3.94 -0.97 -2.27 -0.32 0.26 -0.89 2.25 0.2 -0.5 -0.85 1.11 -0.7 -2.84 0.13 -2.98 0.34 0.25 -1.18 1.56 1.91 -3.47 1.72 -1.94 -0.5 -0.13 0.9 1.5 0.12 4.56 -0.76 0.15 0.27 0.98 -1.1 0.14 0.7 -1.36 -1.04 -1.04 1.72 1.73 0.18 1.62 1.1 2.97 -0.33 2.81 -0.15 -0.9 -0.33 0.27 0.1 0.8 -0.84 0.2 0.1 2.81 MODEL S.E. 0.15 0.47 0.17 0.24 0.15 0.14 0.16 0.15 0.11 0.12 0.13 0.16 0.18 0.33 0.16 0.36 0.16 0.16 0.19 0.22 0.23 0.62 0.23 0.34 0.15 0.14 0.13 0.13 0.14 0.24 0.16 0.14 0.09 0.09 0.11 0.09 0.09 0.17 0.11 0.16 0.12 0.13 0.09 0.09 0.09 0.14 0.14 0.14 0.14 0.09 0.11 0.13 0.1 0.09 0.16 0.13 0.13 0.14 INFIT MNSQ ZSTD 1.02 0.3 0.83 -0.3 0.99 0 1.02 0.2 1.19 2.5 0.87 -2.2 0.95 -0.5 1.03 0.5 1.01 0.2 0.95 -0.8 0.96 -0.5 1.04 0.6 1 0.1 0.98 0 0.93 -1 1.14 0.6 0.88 -1.9 0.91 -1.5 1.05 0.5 1.02 0.2 1.18 1.6 0.74 -0.4 1.21 2 0.94 -0.2 1.1 1.3 0.95 -0.7 1.16 2.8 1.02 0.3 0.91 -1.5 0.91 -0.5 1.05 0.6 0.86 -2.2 1.2 4.9 0.99 -0.1 0.95 -0.9 1.04 1.1 1.03 0.9 0.99 -0.1 0.93 -1.1 0.98 -0.2 0.81 -3.7 0.91 -1.7 0.99 -0.2 1.13 2.9 0.98 -0.6 1.01 0.1 1 0 1.09 1.2 1.09 1.3 0.87 -2.7 0.98 -0.3 0.99 -0.2 1.01 0.2 1 -0.1 1.07 0.8 0.99 -0.1 1.01 0.2 1.03 0.4 OUTFIT MNSQ ZSTD 1.02 0.2 0.25 -1.5 1.06 0.4 0.75 -0.7 1.25 2.1 0.82 -2.3 0.95 -0.2 1.02 0.2 0.98 -0.2 0.91 -0.8 0.87 -1 1.1 1 1.03 0.3 0.69 -0.6 0.93 -0.8 1.2 0.6 0.8 -2.1 0.83 -1.8 0.88 -0.6 1.2 1.2 1.24 1.2 0.29 -1.2 1.67 3 0.67 -0.8 1.06 0.4 1.06 0.5 1.28 2.8 1.03 0.4 0.85 -1.1 0.66 -1.1 1.13 0.9 0.74 -2.2 1.34 5.2 0.98 -0.4 0.88 -1 1.04 0.7 1.01 0.3 1.1 0.6 1.01 0.1 0.95 -0.3 0.77 -2.9 0.9 -1.2 0.93 -1.1 1.23 3 0.87 -1.8 0.93 -0.5 1.05 0.3 1.16 1.2 1.15 0.9 0.75 -2 1.06 0.5 0.91 -0.7 0.97 -0.2 1.03 0.3 0.96 -0.2 0.8 -1.5 0.83 -1.2 0.92 -0.6 270 PTMEA CORR. 0.49 0.26 0.44 0.33 0.41 0.62 0.48 0.68 0.54 0.53 0.5 0.58 0.48 0.3 0.58 0.2 0.58 0.56 0.39 0.46 0.35 0.44 0.33 0.42 0.36 0.46 0.44 0.55 0.51 0.73 0.33 0.53 0.37 0.54 0.41 0.46 0.5 0.32 0.41 0.35 0.66 0.61 0.5 0.49 0.57 0.61 0.47 0.57 0.45 0.54 0.45 0.53 0.48 0.53 0.41 0.53 0.51 0.6 ITEM C653A C653B C654 A85M0184 A85M0284 A85M0384 A85M0484 A85M0584 C85M0684 C85M0784 C85M0884 C85M0984 C85M1084 I95M01 I95M02 I95M03 I95M04 I95M05 I95M06 I95M07 I95M08 A951 A952A A952B A952C A952D A953A A953B A953C C951 C952 C953A C953B C953CI C953CII C953D C954 C955 I115M01 I115M02 I115M03 I115M04 I115M05 I115M06 I115M07 I115M08 I115M09 I115M10 I115M11 I115M12 I115M13 I115M14 I115M15 A1151I A1151II A1152A A1152B A1152C A1153A A1153B A1154A A1154BI A1154BII RAW SCORE 229 282 249 279 427 472 400 572 182 565 301 472 382 225 197 133 208 104 197 94 92 185 188 270 189 112 265 273 101 172 183 28 80 273 224 221 272 251 162 142 140 133 205 142 270 220 168 134 263 87 188 178 116 182 222 233 55 29 211 188 235 225 65 COUNT 256 335 369 771 773 771 772 640 754 724 775 770 772 352 220 350 355 346 351 348 346 363 363 341 363 355 341 341 355 359 363 29 345 318 363 363 341 288 359 368 360 356 361 370 350 359 367 364 346 356 362 364 355 205 265 339 325 289 348 344 317 339 330 MEASURE -1.93 -1.07 0.29 1.22 0.24 -0.08 0.41 -2.31 1.96 -1.17 1.08 -0.08 0.53 -0.61 -3.22 0.84 -0.3 1.3 -0.16 1.49 1.52 0.67 0.63 -1.15 0.61 1.8 -1.04 -1.22 2 0.86 0.7 -5.56 2.4 -1.83 0.08 0.13 -1.2 -2.09 0.67 1 0.98 1.07 0.03 1.01 -1.12 -0.19 0.63 1.1 -1.03 1.85 0.34 0.5 1.33 -2.92 -2.08 -0.58 2.93 3.83 -0.03 0.34 -1.05 -0.43 2.66 MODEL S.E. 0.22 0.16 0.13 0.08 0.08 0.08 0.08 0.14 0.09 0.1 0.08 0.08 0.08 0.13 0.24 0.13 0.13 0.13 0.13 0.14 0.14 0.12 0.12 0.15 0.12 0.13 0.15 0.15 0.13 0.12 0.12 1.03 0.14 0.18 0.13 0.12 0.15 0.19 0.12 0.12 0.12 0.12 0.12 0.12 0.14 0.12 0.12 0.12 0.14 0.14 0.12 0.12 0.13 0.25 0.19 0.14 0.17 0.21 0.13 0.13 0.15 0.13 0.16 INFIT MNSQ ZSTD 1.06 0.4 1.02 0.3 1.08 1.3 0.97 -0.8 1.17 5 0.91 -2.6 0.92 -2.6 0.93 -0.7 1.15 2.9 1 0.1 0.93 -2.1 1.04 1.1 0.98 -0.7 0.97 -0.5 0.95 -0.2 0.99 -0.2 1.1 1.7 1 -0.1 1 0 1.07 1 0.86 -2.1 1.02 0.5 0.99 -0.2 1.23 2.6 0.96 -0.8 1.07 1.2 1.02 0.3 0.86 -1.7 0.89 -1.7 1 0 1.03 0.5 0.94 0.2 1.31 3.6 0.91 -0.8 0.93 -1.2 0.92 -1.6 0.93 -0.8 1.06 0.5 0.96 -0.8 0.86 -3 1.01 0.1 1.07 1.4 1.03 0.6 1.04 0.8 0.96 -0.5 0.97 -0.6 0.95 -1.1 0.88 -2.4 1.07 1 0.99 -0.2 1.07 1.6 0.97 -0.8 1.19 3.2 1.04 0.3 1.1 0.9 1.02 0.3 0.9 -1 1.06 0.4 1.16 2.7 1.15 2.7 1.04 0.5 0.89 -1.8 0.85 -1.6 OUTFIT MNSQ ZSTD 1.14 0.6 1.15 0.8 1.22 1.6 0.92 -1.3 1.19 3.7 0.86 -2.6 0.88 -2.6 0.73 -2 1.32 3.4 1.03 0.3 0.98 -0.3 1.05 0.9 0.98 -0.4 0.89 -1 0.75 -0.9 0.99 -0.1 1.27 2.7 1.09 0.7 1.08 0.9 1.17 1.1 0.74 -1.7 1.02 0.2 0.92 -0.8 1.22 1.3 0.97 -0.2 1.08 0.7 1.1 0.7 0.68 -2.2 0.83 -1.4 0.96 -0.4 1.01 0.2 0.41 -0.3 1.36 2.3 0.84 -0.7 0.84 -1.5 0.85 -1.5 0.95 -0.3 0.94 -0.2 0.96 -0.6 0.83 -2.3 1 0 1.13 1.6 1.05 0.8 1.03 0.5 0.93 -0.6 0.96 -0.5 0.95 -0.8 0.84 -2.1 1.09 0.8 0.98 -0.2 1.07 1 0.96 -0.6 1.27 2.8 1.17 0.7 1.1 0.5 0.94 -0.5 0.78 -1.1 1.09 0.4 1.38 3.1 1.22 2.2 0.98 -0.1 0.73 -2.5 0.66 -2.1 271 PTMEA CORR. 0.31 0.39 0.48 0.52 0.36 0.52 0.53 0.38 0.38 0.38 0.53 0.44 0.49 0.54 0.34 0.54 0.46 0.52 0.52 0.48 0.6 0.5 0.52 0.3 0.53 0.46 0.42 0.53 0.57 0.51 0.5 0.15 0.31 0.44 0.54 0.55 0.46 0.34 0.48 0.56 0.46 0.41 0.39 0.43 0.39 0.43 0.49 0.55 0.3 0.47 0.4 0.47 0.33 0.38 0.4 0.5 0.54 0.43 0.42 0.43 0.47 0.57 0.57 ITEM A1154BIII A1155AI A1155AII A1155BI A1155BII A1155BIII A1156A A1156B C1151A C1151B C1152A C1152B C1153A C1153B C1154A C1154B C1154CI C1154CII C1155 C1156A C1156B C1157A C1157B I036M01 I036M02 I036M03 I036M04 I036M05 I036M06 I036M07 I036M08 A36A A36B A36C A36D A36E C361A C361B C361C C362A C362B C363A C363B C364A C364BI C364BII A46MA4 C46MA5 C46MA6 C46MA7 C46MA8 A46MB4 C46MB5 C46MB6 C46MB7 C46MB8 I56M01 I56M02 I56M03 I56M04 I56M05 I56M06 I56M07 RAW SCORE 187 218 199 215 84 179 139 188 217 164 238 66 166 107 185 157 190 129 240 213 125 241 192 74 73 196 246 196 205 109 121 239 243 207 153 100 239 138 252 168 210 226 38 207 32 196 89 50 94 152 150 43 60 72 37 77 42 163 241 263 251 158 80 COUNT 344 339 348 339 342 349 349 349 348 349 306 330 349 347 344 349 344 345 306 339 347 306 348 285 77 316 277 321 313 313 313 275 310 310 323 316 276 147 310 323 237 310 264 310 263 323 217 193 99 218 158 98 97 83 96 83 328 336 322 323 322 327 330 MEASURE 0.35 -0.3 0.16 -0.25 2.23 0.56 1.2 0.42 -0.14 0.8 -1.37 2.64 0.76 1.78 0.39 0.91 0.31 1.36 -1.42 -0.22 1.46 -1.45 0.28 1.85 -5.05 -0.38 -2.71 -0.31 -0.57 1.19 0.98 -1.7 -0.79 0.02 1.27 2.28 -1.68 -3.31 -1.02 0.99 -2.09 -0.39 3.94 0.02 4.19 0.48 1.41 2.47 -3.62 -0.23 -3.18 0.45 -0.41 -2.24 0.73 -2.96 3.07 0.77 -0.71 -1.2 -0.94 0.79 2.13 MODEL S.E. 0.13 0.13 0.13 0.13 0.15 0.13 0.13 0.13 0.13 0.13 0.16 0.16 0.13 0.14 0.13 0.13 0.13 0.13 0.16 0.13 0.13 0.16 0.13 0.15 0.54 0.14 0.22 0.13 0.14 0.14 0.13 0.2 0.16 0.14 0.14 0.14 0.19 0.36 0.17 0.13 0.22 0.15 0.2 0.14 0.21 0.14 0.16 0.18 0.47 0.17 0.38 0.23 0.23 0.34 0.23 0.44 0.18 0.12 0.14 0.16 0.15 0.12 0.14 INFIT MNSQ ZSTD 0.91 -1.7 0.93 -1.2 0.92 -1.4 1.13 2.1 1.09 1.2 1.2 3.6 0.98 -0.4 1.09 1.6 0.92 -1.4 0.97 -0.6 0.96 -0.4 0.92 -0.8 0.92 -1.5 1.01 0.1 0.94 -1.2 1.05 1 0.92 -1.5 1.18 3 1.16 1.6 0.88 -2 0.93 -1.1 1 0 0.89 -2.2 1.1 1.3 0.96 0 1.1 1.6 0.91 -0.6 1 0.1 0.92 -1.1 1.04 0.6 0.95 -0.8 1 0.1 0.98 -0.2 0.8 -3.1 1.06 0.9 0.95 -0.7 0.98 -0.1 0.86 -0.4 0.89 -1.2 1.07 1.2 1.04 0.3 1.05 0.7 0.89 -0.9 0.95 -0.7 1.05 0.4 1.32 4.6 1.1 1.5 1.05 0.6 1.11 0.4 0.97 -0.3 1 0.1 0.99 -0.1 1.03 0.3 1.01 0.1 1.09 0.9 1.04 0.2 0.86 -1.2 1.03 0.7 1.08 1.1 1 0 0.99 -0.1 0.96 -0.8 1.13 1.7 OUTFIT MNSQ ZSTD 0.85 -1.6 0.89 -1 0.87 -1.2 1.19 1.7 1.06 0.4 1.27 2.4 0.93 -0.6 1.07 0.7 0.92 -0.7 0.98 -0.1 0.95 -0.3 0.75 -1.4 0.82 -1.8 0.91 -0.6 0.88 -1.2 1.05 0.5 0.82 -1.9 1.34 2.8 1.36 2.1 0.8 -1.9 0.83 -1.5 1.15 1 0.84 -1.6 1.14 1.1 0.95 0.1 1.1 0.9 0.45 -2.3 0.95 -0.4 0.87 -1.1 1.03 0.3 1.03 0.3 0.95 -0.1 0.75 -1.2 0.66 -2.8 1.1 0.9 0.83 -1.2 1.15 0.6 0.49 -1.4 0.83 -0.7 1.23 1.9 1.29 1.1 0.98 -0.1 0.64 -1.6 0.96 -0.3 0.92 -0.2 1.32 2.2 1.23 1.7 1.03 0.3 1.61 1.1 0.9 -0.7 0.81 -0.2 0.97 -0.2 1.04 0.4 1.09 0.4 1.05 0.4 0.78 -0.3 0.65 -1.8 1.07 0.9 1.09 0.8 1.05 0.4 1.01 0.1 0.96 -0.5 1.21 1.5 272 PTMEA CORR. 0.56 0.54 0.55 0.44 0.46 0.41 0.53 0.47 0.55 0.53 0.5 0.54 0.56 0.51 0.54 0.49 0.55 0.4 0.38 0.57 0.55 0.46 0.57 0.43 0.29 0.49 0.5 0.54 0.58 0.51 0.55 0.38 0.48 0.62 0.53 0.59 0.37 0.32 0.49 0.51 0.27 0.46 0.57 0.53 0.47 0.39 0.47 0.48 0.08 0.53 0.23 0.48 0.4 0.23 0.42 0.2 0.49 0.44 0.36 0.39 0.42 0.49 0.33 ITEM I56M08 A561A A562A A562B A562C A562D C561AI C561AII C561AIII C561B C562 C563AI C563AII C563C I66M06 I66M08 I66M09 I66M10 A6611 A6612 A6613 A6614 A6621 A6622 C661A C661B C662C C662D C662F C663A C663B C663C C663D C664A C664B C664C C665 RAW SCORE 189 222 227 166 183 218 263 149 116 246 161 120 169 213 242 243 194 132 161 249 182 175 243 173 205 246 234 181 60 209 250 255 225 212 204 201 227 COUNT 329 304 305 298 304 304 305 159 295 305 298 128 298 304 315 278 309 284 171 317 317 317 317 317 317 317 283 317 277 317 317 317 317 317 317 221 283 MEASURE 0.33 -1.51 -1.62 -0.41 -0.72 -1.42 -2.63 -4.51 0.5 -2.1 -0.31 -4.74 -0.46 -1.31 -1 -2.02 -0.14 0.73 -2.35 0.02 1.36 1.49 0.16 1.52 0.94 0.09 -0.47 1.38 3.75 0.86 0 -0.13 0.55 0.81 0.96 -1.61 -0.27 MODEL S.E. 0.13 0.15 0.15 0.14 0.14 0.15 0.19 0.36 0.14 0.17 0.13 0.4 0.14 0.15 0.15 0.19 0.13 0.14 0.33 0.16 0.13 0.13 0.15 0.13 0.14 0.15 0.17 0.13 0.16 0.14 0.16 0.16 0.14 0.14 0.14 0.25 0.17 INFIT MNSQ ZSTD 0.92 -1.6 0.84 -2.2 0.86 -1.9 0.91 -1.5 0.92 -1.3 1.19 2.5 0.96 -0.3 0.9 -0.3 1.14 2.2 1.19 2 1.08 1.4 0.86 -0.4 1.16 2.6 0.97 -0.3 1.03 0.4 0.95 -0.4 0.84 -3.1 0.88 -2 0.93 -0.2 1.06 0.8 1.07 1.1 1.08 1.4 0.8 -2.8 0.72 -5.3 0.87 -2.2 1 0 0.78 -2.4 1.04 0.8 1.3 3.2 0.99 -0.1 1.22 2.5 1.02 0.2 0.97 -0.4 1.07 1.1 1 0 1.03 0.2 0.96 -0.4 OUTFIT MNSQ ZSTD 0.89 -1.5 0.65 -2.7 0.76 -1.6 0.95 -0.5 0.85 -1.5 1.44 2.8 0.78 -0.8 0.53 -1.1 1.21 2.1 1.64 2.8 1.09 1.1 0.59 -0.8 1.17 2 0.9 -0.7 1.04 0.3 0.74 -1.2 0.73 -3 0.86 -1.7 0.49 -1.5 1.19 1 1.02 0.2 1.04 0.5 0.63 -2.3 0.59 -4.9 0.88 -1 1.07 0.4 0.57 -2.4 1.02 0.3 1.44 2.4 0.97 -0.2 1.16 0.8 0.86 -0.6 0.89 -0.8 1 0 0.97 -0.2 1.23 0.8 1.07 0.5 273 PTMEA CORR. 0.52 0.59 0.57 0.6 0.6 0.41 0.45 0.32 0.52 0.36 0.52 0.31 0.48 0.54 0.38 0.34 0.58 0.6 0.24 0.39 0.51 0.51 0.56 0.69 0.58 0.44 0.5 0.52 0.4 0.51 0.33 0.42 0.51 0.48 0.52 0.2 0.41 Appendix A7 Confidence level items Rasch statistics TEM CC35M01 CC35M02 CC35M03 CC35M04 CC35M05 CA35M06 CA35M07 CA35M08 CA45MA146 CA45MA246 CA45MA346 CA45MA4 CC45MA5 CC45MA6 CC45MA7 CC45MA8 CA45MB146 CA45MB246 CA45MB346 CA45MB4 CC45MB5 CC45MB6 CC45MB7 CC45MB8 CC55M01 CC55M02 CC55M03 CC55M04 CC55M05 CA55M06 CA55M07 CA55M08 CI65M0166 CI65M0266 CI65M0366 CI65M0466 CI65M0566 CI65M06 CI65M0766 CI65M08 CI65M09 CI65M10 CI65M1166 CI65M1266 CA651A663 CA651B CA652A CA652B561B CA653 RAW SCORE 412 168 301 299 440 538 431 748 829 748 556 520 409 209 357 358 327 321 250 187 153 163 165 141 464 393 536 445 386 571 467 524 768 773 502 578 654 280 518 324 433 396 649 746 350 267 230 465 235 COUNT 264 130 221 220 257 294 259 288 392 387 357 214 215 158 212 216 154 155 153 81 80 82 74 80 262 244 253 259 237 254 255 251 338 334 320 320 329 187 321 194 193 192 312 302 186 118 128 224 131 MEASURE 0.59 1.99 1.33 1.35 0.25 -0.13 0.34 -1.41 -0.49 -0.18 0.73 -0.93 -0.04 1.77 0.38 0.47 -0.35 -0.26 0.6 -0.73 -0.06 -0.2 -0.67 0.22 0.21 0.67 -0.43 0.32 0.62 -0.69 0.09 -0.39 -0.7 -0.76 0.76 0.15 -0.2 1.03 0.62 0.55 -0.6 -0.25 -0.34 -1.03 -0.05 -0.64 0.21 -0.36 0.21 MODEL S.E. 0.1 0.18 0.12 0.13 0.09 0.08 0.09 0.07 0.07 0.07 0.08 0.09 0.09 0.16 0.1 0.1 0.1 0.11 0.12 0.14 0.15 0.15 0.15 0.16 0.09 0.1 0.08 0.09 0.1 0.08 0.09 0.08 0.07 0.07 0.09 0.08 0.07 0.12 0.09 0.11 0.09 0.1 0.07 0.07 0.1 0.12 0.13 0.09 0.12 INFIT MnSQ ZSTD 1.05 0.5 0.98 0 1.08 0.7 0.91 -0.7 0.76 -2.8 0.79 -2.6 0.87 -1.4 0.83 -2.4 0.86 -2.1 1.22 2.9 0.84 -2 0.82 -2.2 0.93 -0.7 0.92 -0.4 0.85 -1.5 0.93 -0.6 0.84 -1.5 1.42 3.5 0.74 -2.2 0.68 -2.5 0.7 -2.1 0.94 -0.4 1.18 1.2 0.83 -1.1 0.88 -1.4 0.79 -2.2 1.25 2.8 0.8 -2.3 0.95 -0.4 0.93 -0.9 1.03 0.4 1.24 2.6 1.05 0.7 1.11 1.6 1.54 5.2 0.97 -0.3 1.06 0.8 0.95 -0.4 0.76 -3 0.9 -0.9 1.08 0.9 1.06 0.6 1.24 3 1.34 4.2 1.09 0.9 1.34 2.6 1.1 0.8 0.91 -1.1 0.92 -0.6 OUTFIT PTMEA CORR. MnSQ ZSTD 1.05 0.5 0.55 0.81 -1 0.53 0.89 -0.8 0.53 0.81 -1.4 0.55 0.76 -2.6 0.65 0.81 -2.3 0.68 0.87 -1.3 0.62 0.82 -2.4 0.75 0.91 -1.3 0.67 1.18 2.2 0.61 0.78 -2.4 0.6 0.81 -2.1 0.7 0.97 -0.3 0.61 0.86 -0.8 0.49 0.79 -1.8 0.61 0.93 -0.5 0.57 0.9 -0.8 0.67 1.41 3.1 0.55 0.69 -2.2 0.66 0.72 -2 0.72 0.69 -1.9 0.69 0.96 -0.2 0.64 1.12 0.8 0.64 0.83 -0.9 0.66 0.96 -0.3 0.64 0.82 -1.6 0.63 1.21 2.2 0.65 0.76 -2.4 0.68 0.92 -0.6 0.62 0.94 -0.7 0.7 0.91 -0.8 0.67 1.26 2.6 0.64 1.16 1.9 0.64 1.16 1.9 0.65 1.45 3.8 0.51 1.07 0.8 0.59 1.04 0.5 0.63 0.85 -1 0.59 0.76 -2.5 0.64 0.9 -0.8 0.62 1.07 0.7 0.64 1.12 1.1 0.62 1.14 1.6 0.64 1.3 3.4 0.66 1.1 0.9 0.59 1.28 2 0.59 1.1 0.7 0.56 0.85 -1.5 0.65 1.01 0.1 0.57 274 TEM CC651A662A CC651B662B CC651C CC651D662E CC651E662G CC652A CC652B CC652C CC652D CC653A CC653B CC654 CA85M0184 CA85M0284 CA85M0384 CA85M0484 CA85M0584 CC85M0684 CC85M0784 CC85M0884 CC85M0984 CC85M1084 CI95M01 CI95M02 CI95M03 CI95M04 CI95M05 CI95M06 CI95M07 CI95M08 CA951 CA952A CA952B CA952C CA952D CA953A CA953B CA953C CC951 CC952 CC953A CC953B CC953CI CC953CII CC953D CC954 CC955 CI115M01 CI115M02 CI115M03 CI115M04 CI115M05 CI115M06 RAW SCORE 334 331 233 337 345 196 216 214 249 175 230 208 1373 1344 1256 1119 807 1409 1043 1196 1037 1355 420 353 469 385 511 469 510 489 327 359 364 354 344 279 270 307 298 321 230 270 243 268 267 278 204 346 320 358 431 350 401 COUNT 205 189 127 181 176 119 122 120 107 115 118 107 572 570 564 568 546 567 567 568 562 562 205 206 206 205 196 203 203 199 145 157 156 142 137 148 147 138 152 154 151 146 148 134 139 152 134 174 172 169 163 172 175 MEASURE 0.53 0.26 0.12 0.02 -0.18 0.57 0.31 0.28 -0.7 0.94 -0.04 -0.04 -0.71 -0.65 -0.43 0.01 1.16 -0.83 0.28 -0.22 0.25 -0.73 -0.11 0.54 -0.51 0.19 -1.02 -0.56 -0.87 -0.79 -0.52 -0.6 -0.65 -0.87 -0.9 0.13 0.24 -0.46 0.02 -0.21 0.99 0.26 0.68 -0.08 0.09 0.24 0.97 0.01 0.25 -0.21 -1.02 -0.09 -0.52 MODEL S.E. 0.11 0.11 0.12 0.1 0.1 0.14 0.13 0.13 0.13 0.15 0.13 0.13 0.05 0.05 0.05 0.06 0.07 0.05 0.06 0.06 0.06 0.05 0.09 0.1 0.09 0.1 0.09 0.09 0.09 0.09 0.11 0.1 0.1 0.11 0.11 0.11 0.12 0.11 0.11 0.11 0.13 0.12 0.13 0.12 0.12 0.11 0.14 0.1 0.1 0.1 0.1 0.1 0.09 INFIT MnSQ ZSTD 0.7 -3.1 0.7 -3.1 0.68 -2.8 0.81 -1.9 0.68 -3.5 0.68 -2.5 0.75 -2 0.72 -2.3 0.85 -1.2 0.87 -0.8 1.02 0.2 1.28 1.9 1.09 1.7 1.12 2.1 1.2 3.5 1.11 1.9 1.44 5.3 1.22 3.9 1.01 0.2 1.06 1.1 1.08 1.2 1.2 3.5 1.6 5.5 1.19 1.8 0.8 -2.4 1.09 0.9 1.34 3.4 1.27 2.8 1 0 1.22 2.4 1.06 0.6 0.8 -2.1 0.86 -1.4 0.92 -0.7 1.05 0.5 1.01 0.2 0.81 -1.7 0.9 -0.9 0.74 -2.5 0.68 -3.3 1.11 0.8 1.01 0.2 1.02 0.2 0.97 -0.2 0.98 -0.2 0.85 -1.3 1.16 1.1 1.38 3.3 0.99 0 1.3 2.7 1.36 3.3 1 0 1.05 0.6 OUTFIT PTMEA CORR. MnSQ ZSTD 0.76 -2.1 0.6 0.69 -2.8 0.61 0.65 -2.6 0.65 0.79 -1.9 0.62 0.69 -2.9 0.67 0.65 -2.4 0.62 0.71 -2 0.63 0.7 -2.1 0.63 0.86 -1 0.68 0.76 -1.4 0.57 1.11 0.8 0.59 1.26 1.6 0.54 1.2 3.2 0.62 1.08 1.3 0.66 1.14 2.2 0.66 1.07 1 0.63 1.13 1.4 0.56 1.32 4.9 0.58 0.97 -0.4 0.64 1.07 1 0.64 1.03 0.4 0.63 1.14 2.3 0.67 1.55 4.6 0.54 1.08 0.7 0.58 0.86 -1.5 0.67 1.01 0.2 0.61 1.36 3.3 0.6 1.25 2.3 0.6 1.02 0.3 0.64 1.21 2.1 0.61 1.13 1.1 0.64 0.78 -2 0.67 0.92 -0.7 0.65 0.91 -0.7 0.65 1.05 0.5 0.64 0.93 -0.5 0.64 0.74 -2.1 0.68 0.86 -1.1 0.67 0.89 -0.8 0.67 0.66 -3.1 0.7 1.02 0.2 0.61 0.92 -0.5 0.66 0.91 -0.5 0.64 0.92 -0.6 0.65 0.91 -0.6 0.66 0.79 -1.6 0.67 0.94 -0.3 0.63 1.28 2.3 0.52 1.17 1.4 0.52 1.28 2.4 0.51 1.37 3.1 0.55 0.96 -0.3 0.59 1.17 1.6 0.52 275 TEM CI115M07 CI115M08 CI115M09 CI115M10 CI115M11 CI115M12 CI115M13 CI115M14 CI115M15 CA1151I CA1151II CA1152A CA1152B CA1152C CA1153A CA1153B CA1154A CA1154BI CA1154BII CA1154BIII CA1155AI CA1155AII CA1155BI CA1155BII CA1155BIII CA1156A CA1156B CC1151A CC1151B CC1152A CC1152B CC1153A CC1153B CC1154A CC1154B CC1154CI CC1154CII CC1155 CC1156A CC1156B CC1157A CC1157B CI036M01 CI036M02 CI036M03 CI036M04 CI036M05 CI036M06 CI036M07 CI036M08 CA36A CA36B CA36C RAW SCORE 335 345 386 352 327 380 308 342 425 231 248 241 271 277 237 245 236 240 237 242 227 188 213 235 208 245 210 227 243 226 267 233 255 229 230 263 244 228 227 232 181 196 382 165 373 240 363 461 510 393 192 275 280 COUNT 175 172 171 166 171 166 163 162 161 131 131 122 115 114 116 112 119 107 101 100 111 98 103 99 97 103 100 116 118 120 114 110 102 108 109 113 105 113 108 100 104 92 220 130 218 180 221 228 233 224 128 140 124 MEASURE 0.11 -0.05 -0.47 -0.22 0.14 -0.51 0.19 -0.21 -1.04 0.38 0.12 -0.01 -0.68 -0.84 -0.16 -0.41 -0.05 -0.5 -0.65 -0.77 -0.17 0.07 -0.21 -0.72 -0.35 -0.69 -0.26 0.06 -0.14 0.16 -0.62 -0.21 -0.78 -0.26 -0.26 -0.6 -0.61 -0.1 -0.29 -0.61 0.39 -0.31 0.26 2.07 0.31 1.57 0.46 -0.34 -0.65 0.2 0.89 -0.27 -0.84 MODEL S.E. 0.1 0.1 0.1 0.1 0.1 0.1 0.11 0.1 0.1 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.13 0.12 0.14 0.13 0.13 0.13 0.12 0.13 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.13 0.14 0.13 0.1 0.18 0.1 0.14 0.1 0.09 0.08 0.1 0.14 0.11 0.12 INFIT MnSQ ZSTD 1.02 0.2 1.18 1.7 1.14 1.4 1.04 0.4 1.35 3 1.3 2.9 1.26 2.2 1.17 1.6 1.22 2.1 1.15 1.1 0.76 -2.1 1.2 1.6 1.11 1 0.78 -2 0.91 -0.7 0.81 -1.6 0.89 -0.9 0.73 -2.4 1.01 0.1 0.98 -0.1 1.45 3.2 1.25 1.7 0.87 -1 0.79 -1.7 0.99 0 1.02 0.2 0.69 -2.6 0.9 -0.8 0.87 -1 0.88 -0.9 0.99 0 0.91 -0.7 1.09 0.8 0.97 -0.2 0.95 -0.3 0.72 -2.5 0.99 0 0.91 -0.7 0.71 -2.5 1.12 1 1.06 0.5 0.93 -0.5 1.03 0.3 1.06 0.4 0.85 -1.6 0.9 -0.7 0.71 -3.1 1.21 2.2 0.92 -0.9 1.03 0.3 1.28 1.8 0.89 -0.9 0.67 -3.2 OUTFIT PTMEA CORR. MnSQ ZSTD 1.02 0.2 0.56 1.2 1.7 0.53 1.08 0.8 0.57 1.01 0.1 0.58 1.47 3.6 0.5 1.24 2.2 0.53 1.15 1.2 0.55 1.13 1.1 0.54 1.24 2.2 0.54 1.14 1 0.55 0.77 -1.8 0.63 1.14 1 0.57 1.13 1 0.59 0.79 -1.8 0.65 0.91 -0.7 0.6 0.82 -1.4 0.63 0.87 -0.9 0.61 0.75 -2 0.66 1 0.1 0.62 0.92 -0.6 0.64 1.38 2.5 0.54 1.24 1.5 0.59 0.82 -1.3 0.64 0.76 -1.9 0.68 0.88 -0.8 0.64 0.97 -0.1 0.63 0.66 -2.5 0.68 0.96 -0.3 0.61 1.08 0.6 0.59 0.86 -1 0.63 0.97 -0.2 0.6 0.9 -0.7 0.62 1.19 1.4 0.58 0.89 -0.7 0.62 0.93 -0.5 0.63 0.75 -2.1 0.66 1.04 0.3 0.59 1.06 0.5 0.6 0.76 -1.8 0.66 1.09 0.7 0.59 0.92 -0.4 0.62 0.89 -0.7 0.64 1.14 1.2 0.51 0.98 0 0.33 0.84 -1.4 0.58 0.78 -1.4 0.47 0.72 -2.6 0.61 1.27 2.5 0.56 0.96 -0.4 0.66 0.95 -0.4 0.54 1.08 0.5 0.41 0.86 -1.1 0.61 0.68 -2.8 0.73 276 TEM CA36D CA36E CC361A CC361B CC361C CC362A CC362B CC363A CC363B CC364A CC364BI CC364BII CA46MA4 CC46MA5 CC46MA6 CC46MA7 CC46MA8 CA46MB4 CC46MB5 CC46MB6 CC46MB7 CC46MB8 CI56M01 CI56M02 CI56M03 CI56M04 CI56M05 CI56M06 CI56M07 CI56M08 CA561A CA562A CA562B CA562C CA562D CC561AI CC561AII CC561AIII CC561B CC562 CC563AI CC563AII CC563C CI66M06 CI66M08 CI66M09 CI66M10 CA6611 CA6612 CA6613 CA6614 CA6621 CA6622 RAW SCORE 272 239 220 97 227 260 260 226 281 202 308 252 402 299 303 275 228 182 152 87 146 121 340 290 288 296 261 357 309 279 198 209 192 202 181 187 164 190 172 203 120 195 173 234 215 256 284 114 117 124 97 97 89 COUNT 115 105 150 79 144 143 150 142 120 131 141 124 171 170 171 173 148 73 71 65 73 72 171 168 165 167 163 163 166 168 98 106 96 94 89 107 103 93 102 93 89 91 86 125 121 129 116 69 61 61 56 60 51 MEASURE -1 -0.87 0.93 2.28 0.66 0.01 0.22 0.56 -1.04 0.7 -0.65 -0.41 -0.98 0.05 0.03 0.43 0.7 -0.89 -0.31 1.69 -0.05 0.6 -0.16 0.39 0.33 0.27 0.71 -0.54 0.07 0.55 -0.27 -0.15 -0.25 -0.47 -0.37 0.32 0.66 -0.28 0.43 -0.53 1.61 -0.53 -0.33 -0.07 0.16 -0.36 -1.15 0.44 -0.22 -0.52 0.13 0.52 0 MODEL S.E. 0.12 0.13 0.14 0.25 0.13 0.12 0.12 0.13 0.12 0.14 0.11 0.12 0.1 0.11 0.11 0.12 0.13 0.15 0.16 0.24 0.16 0.18 0.1 0.12 0.11 0.11 0.12 0.1 0.11 0.12 0.13 0.13 0.14 0.13 0.14 0.14 0.15 0.14 0.15 0.13 0.21 0.14 0.14 0.12 0.13 0.12 0.12 0.18 0.18 0.17 0.2 0.2 0.21 INFIT MnSQ ZSTD 0.87 -1.1 0.64 -3.2 0.9 -0.7 0.98 0 1.11 0.8 1.18 1.5 0.89 -0.9 1.02 0.2 0.93 -0.6 0.88 -0.8 0.8 -1.9 0.91 -0.7 0.88 -1.2 0.77 -2.2 0.88 -1.1 0.85 -1.3 0.81 -1.5 0.77 -1.6 0.98 -0.1 1.16 0.7 0.87 -0.8 0.9 -0.5 0.99 0 0.97 -0.2 1.19 1.6 0.95 -0.4 1 0.1 1.25 2.2 0.85 -1.4 0.89 -0.9 0.93 -0.5 0.88 -0.9 0.74 -2.1 0.87 -1 1.35 2.3 0.71 -2.2 1.03 0.3 1.1 0.8 0.83 -1.1 0.92 -0.6 1.22 1.1 0.94 -0.4 0.92 -0.5 0.87 -1 1.15 1.1 0.79 -1.8 1.4 3 1.15 0.8 1.04 0.3 1.09 0.6 0.89 -0.5 0.83 -0.8 0.92 -0.3 OUTFIT PTMEA CORR. MnSQ ZSTD 0.82 -1.4 0.72 0.62 -3 0.75 0.9 -0.6 0.39 0.78 -0.8 0.3 1.07 0.5 0.46 1.25 1.8 0.5 1 0.1 0.49 1.05 0.4 0.41 0.92 -0.6 0.67 0.85 -1 0.45 0.78 -1.9 0.67 0.94 -0.4 0.62 0.91 -0.9 0.72 0.79 -1.7 0.66 0.88 -1 0.64 0.83 -1.2 0.62 0.8 -1.4 0.58 0.72 -1.9 0.77 1.2 1.1 0.64 0.77 -0.8 0.46 0.78 -1.3 0.68 0.81 -0.8 0.61 1.15 1.2 0.67 0.91 -0.6 0.65 1.04 0.4 0.63 0.99 0 0.65 0.92 -0.5 0.64 1.38 3 0.65 0.83 -1.3 0.7 0.87 -0.9 0.66 0.88 -0.7 0.66 0.94 -0.4 0.63 0.71 -2 0.67 0.84 -1.1 0.67 1.28 1.7 0.59 0.72 -1.9 0.61 0.98 0 0.52 1.04 0.3 0.59 0.75 -1.5 0.6 0.93 -0.4 0.67 1.22 1 0.46 1.14 0.9 0.61 1.15 0.9 0.59 1.33 2.1 0.59 0.97 -0.1 0.59 0.76 -1.8 0.69 1.39 2.6 0.67 0.98 0 0.58 1.09 0.5 0.55 1.01 0.1 0.62 0.77 -1 0.64 0.87 -0.5 0.67 0.96 -0.1 0.59 277 TEM CC661A CC661B CC662C CC662D CC662F CC663A CC663B CC663C CC663D CC664A CC664B CC664C CC665 RAW SCORE 101 95 114 110 105 85 80 83 94 103 73 79 61 COUNT 65 62 59 57 56 51 51 50 53 58 53 55 47 MEASURE 0.62 0.75 -0.2 -0.2 -0.15 0.51 0.71 0.29 0.08 0.24 1.39 1.16 1.79 MODEL S.E. 0.2 0.21 0.18 0.18 0.19 0.21 0.23 0.22 0.2 0.19 0.26 0.24 0.3 INFIT MnSQ ZSTD 0.77 -1.2 1 0.1 0.69 -1.8 0.59 -2.6 0.77 -1.2 1.11 0.6 0.98 0 0.8 -0.9 0.57 -2.4 0.88 -0.6 0.9 -0.3 1.1 0.5 1.24 0.9 OUTFIT PTMEA CORR. MnSQ ZSTD 0.79 -0.9 0.61 0.87 -0.4 0.59 0.64 -2 0.67 0.59 -2.3 0.68 0.7 -1.5 0.63 0.9 -0.3 0.59 0.8 -0.7 0.61 0.84 -0.6 0.63 0.53 -2.3 0.66 0.87 -0.5 0.62 0.9 -0.2 0.54 1.19 0.7 0.51 1.07 0.3 0.51 278 Appendix A8 Item analysis data Item A6622 A35M06 A651B C1151A A55M06 A651A C1157B C85M0884 C1152B C1151B I65M09 A1152B A45MB146 A36E C651C A953C C1152A A95M01 A35M08 C662D C363B A652B A36M06 I65M10 C95M08 C951 I65M0466 C36M03 A562B A1154BII C115M02 C652D C36M05 A6613 A45MA146 C115M01 C66M09 A45MB246 A36M07 C45MA7 C953D A953A C45MA5 Diff 1.52 0.26 2.97 -0.14 4.56 1.1 0.28 1.08 2.64 0.8 1.72 2.93 0.34 2.28 0.27 2 -1.37 -0.61 2.25 1.38 3.94 2.81 -0.57 1.73 1.52 0.86 0.14 -0.38 -0.41 2.66 1 2.81 -0.31 1.36 0.2 0.67 -0.14 0.25 1.19 0.13 0.13 -1.04 -0.7 Adapted discrimination 0.048 0.192 0.213 0.336 -0.035 0.295 0.295 0.378 0.357 0.378 0.110 0.357 0.275 0.254 0.378 0.295 0.439 0.357 0.069 0.398 0.295 0.295 0.275 0.213 0.233 0.419 0.522 0.460 0.233 0.295 0.316 0.233 0.357 0.419 0.357 0.481 0.275 0.316 0.419 0.275 0.336 0.604 0.481 Adapted confidence deviation 0.495 0.271 0.291 0.244 0.537 0.385 0.398 0.258 0.247 0.266 0.351 0.255 0.416 0.447 0.360 0.249 0.352 0.412 0.842 0.326 0.274 0.465 0.570 0.352 0.524 0.392 0.358 0.381 0.477 0.229 0.583 0.230 0.502 0.357 0.542 0.352 0.508 0.367 0.481 0.523 0.313 0.315 0.377 Adapted expert opinion deviation 0.251 0.267 0.240 0.285 0.550 0.236 0.239 0.299 0.342 0.329 0.608 0.373 0.301 0.303 0.268 0.492 0.272 0.303 0.355 0.351 0.574 0.360 0.307 0.609 0.398 0.323 0.280 0.311 0.461 0.713 0.286 0.843 0.314 0.390 0.290 0.343 0.406 0.501 0.289 0.402 0.557 0.308 0.346 QI_3 0.069 0.076 0.079 0.107 0.112 0.119 0.123 0.125 0.128 0.135 0.138 0.138 0.140 0.141 0.144 0.148 0.160 0.164 0.165 0.166 0.177 0.178 0.180 0.181 0.183 0.185 0.188 0.189 0.190 0.191 0.191 0.192 0.195 0.196 0.197 0.197 0.198 0.198 0.200 0.201 0.202 0.205 0.207 Component 4 1 1 3 1 1 3 3 2 2 3 3 2 2 1 2 6 3 2 3 2 5 2 6 3 7 6 3 3 1 3 2 7 1 2 1 7 3 2 3 7 2 2 Good/poor) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 279 A45MA4 C651D662E C561AIII C954 A45MB4 C1154A C1157A C55M03 A36M08 C561AI C35M05 C56M06 A953B C651B662B C1153B A95M03 A95M04 C45MB8 C1154B A85M0484 C651E662G A55M07 C362A A45MA346 A35M07 A951 C664A A952D C652C C1154CI C95M06 A6612 A85M0184 C46MA7 I65M0566 A1156A A653 C661A A952C C1153A C115M05 C953CII C35M01 A45MA246 C651A662A C663D C115M08 A1153A C115M03 A1152A A55M08 1.11 0.1 0.5 -1.2 1.56 0.39 -1.45 0.9 0.98 -2.63 -0.32 0.79 -1.22 -0.33 1.78 0.84 -0.3 -1.94 0.91 0.41 0.8 -0.76 0.99 -0.85 -0.89 0.67 0.81 1.8 0.1 0.31 -0.16 0.02 1.22 -0.23 0.7 1.2 -0.15 0.94 0.61 0.76 0.03 0.08 -0.36 -0.5 -0.9 0.55 -0.19 -0.03 0.98 -0.58 0.15 0.275 0.481 0.398 0.522 0.522 0.357 0.522 0.563 0.336 0.543 0.625 0.460 0.378 0.543 0.419 0.357 0.522 0.604 0.460 0.378 0.378 0.790 0.419 0.439 0.481 0.439 0.481 0.522 0.419 0.336 0.398 0.666 0.398 0.378 0.439 0.378 0.543 0.275 0.378 0.316 0.666 0.357 0.460 0.378 0.357 0.419 0.584 0.604 0.522 0.439 0.378 0.698 0.257 0.337 0.264 0.473 0.342 0.249 0.374 0.544 0.460 0.349 0.473 0.267 0.354 0.470 0.443 0.309 0.410 0.250 0.305 0.238 0.294 0.408 0.601 0.312 0.480 0.542 0.553 0.445 0.602 0.656 0.379 0.519 0.495 0.244 0.509 0.350 0.840 0.743 0.240 0.283 0.267 0.587 0.443 0.448 0.381 0.294 0.345 0.248 0.334 0.479 0.296 0.487 0.476 0.449 0.247 0.537 0.483 0.318 0.371 0.262 0.304 0.324 0.655 0.371 0.369 0.460 0.449 0.284 0.593 0.623 0.736 0.290 0.455 0.277 0.508 0.372 0.290 0.251 0.432 0.382 0.287 0.301 0.397 0.441 0.680 0.430 0.432 0.314 0.268 0.919 0.424 0.796 0.309 0.524 0.543 0.554 0.492 0.418 0.623 0.601 0.504 0.207 0.209 0.210 0.212 0.213 0.215 0.218 0.221 0.221 0.222 0.223 0.225 0.227 0.227 0.228 0.228 0.231 0.232 0.232 0.234 0.235 0.236 0.237 0.239 0.239 0.239 0.241 0.242 0.242 0.243 0.244 0.246 0.247 0.247 0.248 0.248 0.249 0.251 0.252 0.254 0.256 0.256 0.257 0.258 0.259 0.261 0.261 0.262 0.264 0.265 0.265 7 2 2 3 1 3 3 2 1 2 2 7 2 3 2 5 2 3 3 6 3 2 4 3 1 6 5 1 3 3 7 7 1 3 6 4 2 5 2 3 2 7 5 2 5 7 3 1 3 2 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 280 C1156A A36B A1155AII C85M0784 A562A A652A I65M0766 C952 C115M07 -0.22 -0.79 0.16 -1.17 -1.62 -0.33 -1.04 0.7 -1.12 0.295 0.481 0.336 0.687 0.295 0.501 0.625 0.439 0.666 0.472 0.559 0.304 0.230 0.620 0.318 0.488 0.251 0.343 0.617 0.336 0.804 0.514 0.487 0.574 0.308 0.779 0.416 A46MA4 C115M06 C663A A561A A1153B I65M0266 A952A C652B C653B C46MA8 C46MB5 A95M02 A36C C652A A1155AI C654 A1156B A6621 C1156B A56M01 C56M05 C46MB8 C662C A36A A85M0384 C56M04 C85M0984 C36M01 C46MB7 A56M03 A85M05 C953CI A1152C C66M10 C364BI A45MB346 C55M01 A562C A1155BII C55M04 C95M07 C563AI 1.41 1.01 0.86 -1.51 0.34 0.98 0.63 0.2 -1.07 -3.18 -0.41 -3.22 0.02 -0.84 -0.3 0.29 0.42 0.16 1.46 3.07 -0.94 -2.96 -0.47 -1.7 -0.08 -1.2 -0.08 1.85 0.73 -0.71 -2.31 -1.83 3.83 0.73 4.19 -1.18 -0.5 -0.72 2.23 1.5 1.49 -4.74 0.501 0.584 0.419 0.254 0.584 0.357 0.398 0.378 0.666 0.996 0.646 0.769 0.192 0.625 0.357 0.481 0.501 0.316 0.336 0.460 0.604 1.058 0.439 0.687 0.398 0.666 0.563 0.584 0.604 0.728 0.687 0.563 0.584 0.233 0.501 0.666 0.728 0.233 0.522 0.336 0.481 0.831 0.680 0.420 0.746 0.687 0.459 0.598 0.545 0.484 0.443 0.284 0.520 0.406 0.826 0.487 0.400 0.248 0.337 0.629 0.405 0.655 0.571 0.317 0.452 0.565 0.548 0.242 0.391 0.742 0.319 0.337 0.652 0.391 0.300 0.924 0.501 0.449 0.288 0.691 0.347 0.723 0.587 0.545 0.263 0.409 0.295 0.519 0.379 0.475 0.490 0.577 0.349 0.322 0.314 0.333 0.536 0.361 0.755 0.819 0.663 0.561 0.799 0.389 0.335 0.298 0.613 0.287 0.569 0.657 0.571 0.256 0.630 0.501 0.249 0.589 0.691 0.500 0.547 0.450 0.587 0.703 0.736 0.546 0.510 0.273 Median QI 0.265 0.267 0.267 0.272 0.272 0.273 0.281 0.281 0.281 5 2 1 7 4 1 7 5 1 1 1 1 1 1 1 1 1 1 0.282 0.284 0.284 0.287 0.287 0.289 0.294 0.295 0.295 0.301 0.304 0.305 0.305 0.306 0.309 0.310 0.314 0.315 0.315 0.318 0.320 0.323 0.323 0.324 0.328 0.328 0.332 0.334 0.335 0.337 0.338 0.338 0.340 0.344 0.346 0.347 0.349 0.351 0.356 0.356 0.358 0.359 3 7 3 5 1 6 3 2 3 2 3 3 2 2 1 7 2 1 2 7 3 2 1 2 3 2 7 2 6 4 4 3 2 3 2 2 6 3 1 3 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 281 C663C A46MB4 A36D A6611 A1151I C1154CII C66M06 C563AII C55M02 C45MA8 C46MA5 C361C C1155 A56M02 C45MB5 I65M0366 A1151II C364BII A85M0284 I65M1166 A562D C562 C115M04 C66M08 C55M05 C653A A1154A A6614 A1155BIII C363A A1155BI C663B C36M02 C65M08 C364A A1154BI C361A A1154BIII C35M04 C362B C955 C561AII I65M06 C953A I65M0166 C56M07 C662F C35M03 A952B C85M0684 I65M1266 -0.13 0.45 1.27 -2.35 -2.92 1.36 -1 -0.46 -0.13 -2.98 2.47 -1.02 -1.42 0.77 1.91 -1.1 -2.08 0.48 0.24 0.18 -1.42 -0.31 1.07 -2.02 0.12 -1.93 -1.05 1.49 0.56 -0.39 -0.25 0 -5.05 -1.04 0.02 -0.43 -1.68 0.35 -2.27 -2.09 -2.09 -4.51 -1.36 -5.56 0.27 2.13 3.75 -0.97 -1.15 1.96 1.62 0.604 0.481 0.378 0.975 0.687 0.646 0.687 0.481 0.522 1.058 0.481 0.460 0.687 0.563 0.749 0.625 0.646 0.666 0.728 0.439 0.625 0.398 0.625 0.769 0.419 0.831 0.501 0.419 0.625 0.522 0.563 0.790 0.872 0.749 0.378 0.295 0.707 0.316 0.790 0.913 0.769 0.810 0.810 1.161 0.707 0.790 0.646 0.563 0.852 0.687 0.460 0.411 0.786 0.720 0.324 0.468 0.422 0.452 0.688 0.686 0.414 0.700 0.520 0.548 0.643 0.521 0.578 0.507 0.434 0.650 0.437 0.743 0.661 0.770 0.467 0.694 0.561 0.446 0.584 0.377 0.560 0.420 0.738 0.822 0.437 0.734 0.661 0.598 0.717 0.796 0.436 0.554 0.549 0.727 0.497 0.681 0.654 0.783 1.013 0.897 0.475 0.679 0.577 0.367 0.522 0.410 0.470 0.561 0.496 0.466 0.430 0.300 0.470 0.673 0.424 0.453 0.409 0.459 0.561 0.657 0.395 0.945 0.424 0.742 0.415 0.568 0.696 0.431 0.896 0.827 0.848 0.745 0.882 0.348 0.239 0.674 0.789 1.048 0.605 0.964 0.394 0.643 0.643 0.613 0.457 0.434 0.596 0.551 0.609 0.521 0.370 0.935 0.972 0.361 0.365 0.366 0.367 0.374 0.378 0.379 0.379 0.380 0.381 0.386 0.389 0.390 0.393 0.394 0.395 0.422 0.438 0.441 0.442 0.452 0.455 0.459 0.460 0.461 0.462 0.465 0.465 0.470 0.475 0.478 0.482 0.486 0.488 0.500 0.518 0.525 0.529 0.543 0.548 0.553 0.553 0.559 0.562 0.567 0.568 0.595 0.603 0.611 0.611 0.615 6 3 2 1 2 1 5 4 2 2 4 2 3 1 2 3 1 1 1 7 4 2 2 2 7 1 1 3 3 2 3 3 2 1 5 2 1 6 2 1 3 5 7 3 7 4 7 3 3 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 282 C95M05 C563C C45MA6 C56M08 C661B C85M1084 C664C C953B C46MB6 C664B C665 Average diff Median diff 1.3 -1.31 -2.84 0.33 0.09 0.53 -1.61 2.4 -2.24 0.96 -0.27 0.0617 0.13 0.398 0.357 0.852 0.398 0.563 0.460 1.058 0.831 0.996 0.398 0.625 0.729 0.695 0.998 0.681 0.782 0.656 0.776 0.839 1.047 1.399 1.469 1.007 1.144 0.333 1.112 0.797 1.090 0.612 0.865 0.544 0.891 0.758 0.617 0.628 0.634 0.637 0.655 0.658 0.842 0.927 0.933 0.935 1.085 Median QI 3 2 3 3 3 7 2 3 7 5 3 0 0 0 0 0 0 0 0 0 0 0 0.282 283
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement