Measurement is the process of a identifying a need, deciding what information matters most for the purpose at hand, selecting a well-balanced set of factors to measure, and finally, collecting and analyzing the data in an unbiased manner. Unfortunately, measurement tools are often selected more for their ease of use than for their added value. When this happens, people can become cynical about the benefits of collecting data to inform decision-making. The measurer, therefore, must choose an appropriate tool that measures not only what is important but also what will be most useful. Once high-quality data have been gathered, the techniques used to analyze the data must also be sound.
Validity refers to how well a tool actually measures what it claims to measure. For example, a tool used for assessing student writing should not include an assessment of reading comprehension. In general, before the validity of a measure can be researched, its reliability must be established. When multiple assessors or evaluators apply a measurement tool that involves observation, the level of consistency in their judgments is termed “interrater reliability.” The reliability of a tool increases when multiple raters use it and produce the same results. When the criteria used to judge a performance are clear and raters are able to apply the criteria in the same manner, the reliability of an instrument will be high.
Problems with the reliability of a measurement tool can often be detected using clues supplied by outlier data; therefore such data should be investigated to determine their significance. Outlier data may arise from the imprecision of the instrument or from sampling biases. They may not be representative for the purpose of the measurement. In order to provide credible findings, the instrument(s) used for measuring a performance should have high inter-rater reliability. Quality in assessment, evaluation, and research will not be advanced when measurers consistently hit the wrong target. Imprecision in measurement can also be caused by the performance of the measurer. When inter-rater reliability is low due to the performance of the measurer(s), it is often necessary to provide them with training that introduces explicit decision-making rules for assigning various quality ratings to different performances and to give them opportunities to practice measuring examples of student work that illustrate different levels of performance.
Of course measurers must always balance the cost and quality of the data with their validity and reliability, but they should strive for the best measurement tools and techniques possible within those constraints. The eight principles of measurement given in Table 1 summarize the most important considerations involved in a quality measurement process.
|
Table 2 Comparison of Measurement, Assessment, Evaluation, and Research
| Categories | Measurement | Assessment | Evaluation | Research |
|---|---|---|---|---|
| Purpose | To obtain observations and data | To continuously improve | To determine whether standards are met | To produce new knowledge |
| Nature | Objective/unbiased | Non-judgmental (collaborative) | Judgmental (not collaborative) | Inquiry-based (collaborative) |
| Performer | Measurer | Assessor | Evaluator | Researcher |
| Beneficiary | Stakeholders in use of data | Assessee | External decision-makers | Community of scholars |
| Results | Documented data/observations | Action plan | Documented level of final performance; part of a permanent record; brings closure | Contribution to existing knowledge |
| Important Characteristics | Calibrated; reliable, appropriate scale (with range and units) | Criteria based; assessee-centered | Unbiased; criteria based | Theory driven; designed to control bias; can be tested; involves a high level of expertise; uses accepted methods |
Assessment, evaluation, and research are three important processes in higher education. Although each is different, all three of them involve measurement. Table 2 shows the significant differences between assessment, evaluation, and research. All three of these processes benefit from conscious attention to the principles of measurement articulated in Table 1.
Measurement targets should be meaningful to three different audiences: students, practitioners in the field, and researchers. Students respond best to explicit learning targets that involve authentic challenges connected with knowledge mastery, reasoning proficiency, product realization, and professional expectations (Stiggins, 1996). Practitioners expect to see course outcomes that support the diverse roles within the discipline or profession and in the workplace. Researchers depend on a clearly conceptualized cognitive model that reflects the latest understanding of how learners represent knowledge and develop expertise in the domain (Pellegrino, Chudowsky, & Glaser, 2001).Researchers also expect alignment among the cognitive model, the methods used to observe performance, and the protocol for interpreting results. Educators vary both in their motivation for collecting data and in their skill in interpreting and reporting it. It is important to address the challenge of serving all three audiences with learning and growth that can be validly measured. The following sections explore the varying uses of measurement.
Assessment is a process of measuring and analyzing a performance, a work product, or a learning skill to provide high-quality, timely feedback that gives assessees clear and meaningful directives and insights to help them improve their future performance (Overview of Assessment and Assessment Methodology). Before a performance can be measured for assessment purposes, the criteria must be clearly defined and expectations or measures of each criterion must be set. The measurer will find it easier to provide specific feedback that will be effective for strengthening future performance if he or she narrows the focus to three to five performance expectations. If the goal is to “grow” a performance, a work product, or a learning skill, assessment must occur early (and often) to allow students ample time to refine and improve. For example, if a central course outcome is to improve student writing, it will be important for instructors to conduct multiple “formative” measurements of performance on steps in the process of preparing a research paper. In this case, an instructor might use an analytic writing rubric for a research paper as the assessment tool to measure and collect data that provides feedback to the student. At intermediate times throughout the semester, instructors can measure specific performance expectations, providing both student and instructor with assessment data that can strengthen writing quality.
Evaluation is the process of measuring the quality of a performance (e.g., a work product or the use of a process) to make a judgment or to determine whether, or to what level, standards have been met (Overview of Evaluation and Evaluation Methodology). Evaluation is used in many academic arenas, such as graded assignments and exams, grade point average (GPA), promotion and tenure, or grant acquisition. Measurements that are used to make judgments are often based on external standards (e.g., accrediting standards, agency policies, accountability for funding). Before any performance can be measured for evaluation purposes, the performance expectations (standards based on the measure) must be clear for each criterion of quality. Furthermore, the evaluation should be unbiased and documented in a permanent record (e.g., transcript, personnel file, grant record). In the case of a research paper, the final grade may be assigned using information from a score sheet associated with a writing rubric. The more a measurement tool requires an evaluator to explain his or her judgments about whether standards have been met, the less effective that measurement tool is for evaluation.
The purpose of measurement in research is to validate new knowledge within or across disciplines (Research Methodology). Researchers begin with questions about a void in the existing body of knowledge and form hypotheses regarding relationships of measurable variables. Theory should be used to frame research questions and to guide methods for collecting reliable and valid data. In research measurement falls into two categories: descriptive and experimental. If the researcher is attempting to answer a question descriptively, the appropriate tools include surveys, interviews or focus groups, conversational analysis, observation, ethnographies, or meta-analysis (Olds, 2005). If the researcher’s study is experimental in nature, the proper methods include randomized controlled trials, matched groups, baseline data, post-testing, and longitudinal designs. Each of these research designs or techniques requires certain kinds of measures that will result in data that can be appropriately analyzed to provide a basis for interpretation (National Research Council, 2002). Inferences drawn from the measurement should directly relate the evidence obtained to the hypothesis being investigated. The quality of a measure is very important because limitations, biases, and alternative interpretations will affect validity. Researchers want to know whether their findings can be generalized to a broader population or to multiple settings. The consistency of the measurement and the validity of the data are evidenced by the ability of other researchers to replicate the results.
Peer review and publication of research are essential for disseminating new knowledge to other practitioners as well as to the public.
Many educators are reluctant to apply measurement instruments and techniques to complex and integrated performance. Tasks like these are commonly referred to as “constructed-response outcomes”; they include learning portfolios, reflective journals, self-growth papers, capstone reports, project reports, and experiential narratives. Learning portfolios can include multiple performance “artifacts,” such as a sequence of art works produced during a course and accompanied by reflective journals and interpretive analyses. It is much easier to design constructed-response outcomes like portfolios than it is to create reliable and valid measures for assessing or evaluating their quality. To assess and/or to evaluate these complex outcomes, instructors often use custom-designed rubrics.
Educators’ historic reluctance to adopt complex integrated performance outcomes stems in part from their assumptions about reliability and validity in measuring them. For many, selected-response instruments, such as multiple-choice and matching, are perceived to be more reliable and valid as well as easier to use. Instructors cannot measure performances that involve critical thinking, quality teaching, or service-learning projects by counting “correct” answers (Wiggins, 1998). These require qualitative judgments. As a result, some instructors opt to take advantage of the comfort that comes from using traditional select-response measurement instruments, and so spend most of their in-class time “covering the content” to align with “the test.” But select-response tests are often not authentic measures of intended outcomes. For example, when one applies for a driver’s license, the simple indicators of the driving test and written test do not represent and are not intended to represent all key driving performances. Table 3 is a guide for selecting measurement tools for the five types of learning outcomes described in the Learning Outcomes module: competency, movement, accomplishment, experience, and integrated performance.
A competency is a collection of knowledge, skills, and attitudes needed to perform a specific task effectively and efficiently at a defined level. A common question about a competency outcome is: “What can the learner do at what level in a specific situation?” Movement is documented growth in a transferable process or learning skill. A common question about a movement outcomes is: “What does increased performance look like?” Accomplishments are significant work products or performances that are externally valued or affirmed by an outside expert. A common question about an accomplishment outcome is:
Table 3 Alignment of Learning Outcomes and Measurement Instruments
| Outcome Type | Example | Task/Instrument |
|---|---|---|
| Competency | Applying knowledge in a specific context at a specific level | Checklist or selected response exam with answer key |
| Movement | Exercising transferable skills in a continuum with no upper bound (e.g., problem solving, communication, teamwork) | Reflective essay with analytic rubric |
| Accomplisments | Creating something with external value (project work, community service, artistic creation, thesis) | Portfolio with scorecard or peer review form |
| Experiences | Response to and internalization of a situation | Personal journal with holistic rubric |
| Integrative performances | Deploying working expertise in response to an authentic challenge (e.g., internship interview, student teaching observation, final presentation, leadership situation) | Performance appraisal with rating from |
“How well does student work compare with work products of practitioners in the field?” Experiences are interactions, emotions, responsibilities, and shared memories that clarify one’s position in relation to oneself, a community, or discipline. A common question about an experience outcome is: “How has this experience changed the learner?” Integrated performance is the synthesis of prior knowledge, skills, processes, and attitudes with current learning needs to address a difficult challenge within a strict time frame and set of performance expectations. A common question about integrated performance is: “How prepared are students to respond to a real-world challenge?”
Over the last decade, rubrics have received considerable attention in education as tools for performance measurement (Arter & McTighe, 2001). Rubrics provide explicit statements that describe different levels of performance and are worded in a way that covers the essence of what to look for when conducting qualitative measurements. Rubrics should reflect the best thinking about what constitutes good performance, a work product, or a learning skill. As discussed in Fundamentals of Rubrics, rubrics can be analytic (with an extensive set of factors and multiple scales) or holistic (with just a single scale). However, rubrics are only as robust as the clarity of purpose for measurement.
Measurement is foundational to classroom assessment, grading, program evaluation, and educational research. In the physical sciences, quality measurement is a central event; in education, measurement involves a series of linked decisions that are more qualitative in nature. In both, the goal is to align outcomes, performance tasks, measurement methods, and data analysis. Educators in all disciplines must learn to apply their measurement skills to the multiple uses of measurement in education. Regardless of the discipline or profession, best practices include clear communication of purpose, well-selected targets for measurement, sound methods for data collection, and sampling to reduce bias and distortion. Faculty will become better teachers and researchers if they learn to seek consensus with their colleagues about what processes matter most in teaching and learning, and what tools measure learner growth most efficiently and effectively.
Arter, J., & McTighe, J. (2001). Scoring rubrics in the classroom: Using performance criteria for assessing and improving student performance. Thousand Oaks, CA: Corwin Press.
Olds, B. M., Moskal, B. M., & Miller, R. L. (2005). Assessment in engineering education: Evolution, approaches and future collaborations. Journal of Engineering Education, 94, 13-26.
Pellegrino, J., Chudowsky, N., & Glaser, R. (2001).
Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press.
Shavelson, R. J., & Towne, L. (Eds.). (2002). Scientific Research in Education. Washington, DC: National Academy Press.
Stiggins, R. J. (1996). Student-centered classroom assessment. Old Tappan, NJ: Prentice Hall.
Wiggins, G. (1998). Educative assessment: Designing assessment to inform and improve student performance. San Francisco: Jossey Bass.