Post-Doc Blogpost: Machine Scoring & Student Performance

I believe one concept at the heart of considering the validity of a writing assessment via automated essay scoring (AES) as a measure of student performance in a graduate program is fidelity. Fidelity has to do with the degree to which the task, response mode, and what is actually scored matches the requirements found in the real world (Zane, 2009, p. 87). This includes consideration for such elements as context, structure, and the item’s parameters. What lends to granting the validity of such an assessment is the fact that much of what is expected of a writing assessment mirrors the college experience. From considerations surrounding grammar and sentence structure, to critical thinking and conceptual integration, an automated writing assessment permits for greater fidelity among an exercise which emulates the student experience in myriad ways. Where attention must be paid, however, regards the validity of objective scoring provided by an AES system.

Machine scores are based on a limited set of quantifiable features in an essay while human holistic scores are based on a broader set of features, including many, such as the logical consistency of an argument, which cannot yet be evaluated by a machine (Bridgeman, Trapani, & Attali, 2012, p. 28). The assessment itself, then, is was provides much of the fidelity to the student experience. What has yet to be developed is a holistic way of interpreting and subsequently scoring essay responses with the same level of depth and consideration for elements not contained within a designed algorithm as that with human raters. Human raters can probe creativity, innovation, and integrative thinking. Human raters can identify off-topic content just as an AES system would, yet rather than reducing the score by default as would be the design per AES, the human rater can form an individual opinion regarding off-topic content and its relevance to the response. Where one is asked about the concept of truth in such an environment as Accuplacer testing, the scoring mechanism might deduct for the use of vernacular proximal to art for example, to which the author refers as art and truth alike are subjective interpretations. What AES does well, and human raters do not, however, is score essay responses with an equal level of consistency across multiple raters and multiple ratings.

As AES models often formed by using more than two raters, studies that have evaluated interrater agreement have usually showed that the agreement coefficients between the computer and human raters is at least as high or higher than among human raters themselves (Shermis, Burstein, Higgins, & Zechner, 2010, p. 22). This gives pause to doubt cast on an AES systems’ ability to accurately and reliably score work. Yet this reliability does not necessarily denote validity as we have previously discussed. Thus, an environment where at least one AES score and one human rater score are considered in conjunction, presents the most promising synthesis of both approaches regarding a single assessment item and score. Further understanding of rater cognition is necessary to have a more thorough understanding of what is implied by the direct calibration and evaluation of AES against human scores and what, precisely, is represented in a human score for an essay (Ramineni & Williamson, 2013, p. 37). This in mind it is critical that those designing such assessments remain attentive to differences between scores among human raters, the difference in scores between machine and human rater, and any differences which exist in the AES model’s ability to measure student performance against a particular rubric. Should the rubric cover such concepts as creativity, an element to which AES is disadvantaged, this emphasis on content will be the driving force to inform the decision to use AES. Ultimately, it will depend on the context, grading criteria as explained by the rubric, and the opportunity for joint assessment via human rater which drive a decision of whether AES is a valid source of objective writing assessment.

Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40.

Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25–39.

Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In P. Peterson, E. Baker, and B. McGaw (Eds.), International encyclopedia of education (3rd ed.), 20–26. Oxford, UK: Elsevier.

Zane, T. W. (2009). Performance assessment design principles gleaned from constructivist learning theory (Part 2). TechTrends, 53(3), 86–94.

Advertisement

Post-Doc Blogpost: Validity & Reliability in Performing Assessments

Designing a question for an instrument is designing a measure, an answer given to a question is of no intrinsic interest, and the answer is valuable only to the extent that it can be shown to have a predictable relationship to facts or subjective states that are of interest (Fowler, 2009, p. 87). Sweeping satisfaction questions such as with one’s job or degree program are, inherently, of little value to an assessment in their extant form. They contain limited value as there are a number of differing subjective states which might be experienced by a student throughout his/her degree program as an example. A student can find great value in his/her first courses, or potentially an entire year, yet subsequently find little value in those courses which remain. This alone would indicate the item’s inability to measure these changes in a student’s state, and would therefore not exhibit a predictable relationship to the state of interest. An objective test item is defined as one for which the scoring rules are so exhaustive and specific that they do not allow scorers to make subjective inferences or judgments (Murayama, 2012, para. 1). Requiring students to infer what period in time to which the item refers constitutes subjectivity, and negates the item’s ability to deliver anything other than a highly subjective, highly summative, point of view.

Many teachers believe that they need strong measurement skills, and report that they are confident in their ability to produce valid and reliable tests (Frey, Petersen, Edwards, Pedrotti, & Peyton, 2005, p. 2). Yet this contention remains at-issue, both as standards for establishing validity remain disparate and interspersed throughout the literature on item-writing, and as the research also shows limited assessment training as required curriculum among teaching certification programs. What is then needed to determine whether items possess required validity, are standards for the validity of each assessment item. Of an identified 40 different item-writing rules, each falls into one or more of few categories, including potentially confusing wording or ambiguous requirements, guessing, rules addressing test-taking efficiency, and rules designed to control testwiseness (Frey, Petersen, Edwards, Pedrotti, & Peyton, 2005, p. 4). Each category includes a number of item-writing rules all intended to address differing concerns for validity. Potentially confusing wording or ambiguous requirements is a category which speaks to a confidence of whether every respondent will understand a question the same way. Guessing in this instances refers to the exclusion of responses where respondents simply chose a correct answer by chance, and therefore the probability of this occurring must be reduced. Rules addressing test-taking efficiency have to do with designing items in such a way that their structure does not impede, their form is simple, completing each is brief, and options are made clear. Finally, regarding rules designed to control testwiseness, this refers to designing items so (to the largest extent possible) items are answered using only knowledge, ability, or a combination of the two, rather than identifying patterns or other unintended characteristics of an item which may lead respondents to accidentally identify a correct answer without knowing why it is the correct answer. In order to infuse greater validity into the item discussed above, considerations for all four categories are prudent. Yet tantamount to many is to alter the item in such a way that ambiguous requirement is corrected, and an appropriate span of time is delineated of the contexts of the question.

Where validity deals with the relationship between each item and an area of interest, reliability deals with the relationship between each item and the consistency of results each time a measurement is taken. In discussing whether scores resulting from an item demonstrate reliability, look for whether the items’ responses are consistent across constructs, whether scores are stable over time when the instrument is administered a second time, and whether there is consistency among test administration and scoring (Creswell, 2009, p. 149). While researchers often address validity and reliability as separate considerations, I feel their interrelationship cannot be described strongly enough. Returning to the example item on program satisfaction above, if the validity of the measurement is compromised as in creating confusion among respondents for when and how much of the program it is intended to describe, this will then heighten the probability of inconsistent responses, which then directly threatens reliability. If one respondent can answer the same question multiple ways, and do so defensibly each time while regarding another aspect of the same context, we are now measuring the same condition multiple times, and arriving at multiple and quite different results. This is especially problematic with either a true/false or multiple choice item, as either presents a very limited list of potential responses. Altering response patterns among respondents based on poorly worded items leaves the reliability of the instrument in question, as with each subsequent administration it is entirely likely different responses among those available are selected, and the percentage of each item chosen (and therefore its description of a percent of a population assessed) is unreliable. It is only after the ambiguities inherent in the item’s wording are addressed, and consistent responses collected across multiple administrations, can this item begin to be described as either valid or reliable or both.

Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.). Thousand Oaks, CA: Sage.

Fowler, F. J. (2009). Survey research methods (4th Ed.). Thousand Oaks, CA: Sage Publications, Inc.

Frey, B. B., Petersen, S., Edwards, L. M., Pedrotti, J. T., & Peyton, V. (2005). Item-writing rules: Collective wisdom. Teaching and Teacher Education, 21(4), 357–364.

Murayama, K. (2012). Objective test items. Retrieved December 24, 2013 from http://www.education.com/reference/article/objective-test-items/.