Post-Doc Blogpost: Machine Scoring & Student Performance

I believe one concept at the heart of considering the validity of a writing assessment via automated essay scoring (AES) as a measure of student performance in a graduate program is fidelity. Fidelity has to do with the degree to which the task, response mode, and what is actually scored matches the requirements found in the real world (Zane, 2009, p. 87). This includes consideration for such elements as context, structure, and the item’s parameters. What lends to granting the validity of such an assessment is the fact that much of what is expected of a writing assessment mirrors the college experience. From considerations surrounding grammar and sentence structure, to critical thinking and conceptual integration, an automated writing assessment permits for greater fidelity among an exercise which emulates the student experience in myriad ways. Where attention must be paid, however, regards the validity of objective scoring provided by an AES system.

Machine scores are based on a limited set of quantifiable features in an essay while human holistic scores are based on a broader set of features, including many, such as the logical consistency of an argument, which cannot yet be evaluated by a machine (Bridgeman, Trapani, & Attali, 2012, p. 28). The assessment itself, then, is was provides much of the fidelity to the student experience. What has yet to be developed is a holistic way of interpreting and subsequently scoring essay responses with the same level of depth and consideration for elements not contained within a designed algorithm as that with human raters. Human raters can probe creativity, innovation, and integrative thinking. Human raters can identify off-topic content just as an AES system would, yet rather than reducing the score by default as would be the design per AES, the human rater can form an individual opinion regarding off-topic content and its relevance to the response. Where one is asked about the concept of truth in such an environment as Accuplacer testing, the scoring mechanism might deduct for the use of vernacular proximal to art for example, to which the author refers as art and truth alike are subjective interpretations. What AES does well, and human raters do not, however, is score essay responses with an equal level of consistency across multiple raters and multiple ratings.

As AES models often formed by using more than two raters, studies that have evaluated interrater agreement have usually showed that the agreement coefficients between the computer and human raters is at least as high or higher than among human raters themselves (Shermis, Burstein, Higgins, & Zechner, 2010, p. 22). This gives pause to doubt cast on an AES systems’ ability to accurately and reliably score work. Yet this reliability does not necessarily denote validity as we have previously discussed. Thus, an environment where at least one AES score and one human rater score are considered in conjunction, presents the most promising synthesis of both approaches regarding a single assessment item and score. Further understanding of rater cognition is necessary to have a more thorough understanding of what is implied by the direct calibration and evaluation of AES against human scores and what, precisely, is represented in a human score for an essay (Ramineni & Williamson, 2013, p. 37). This in mind it is critical that those designing such assessments remain attentive to differences between scores among human raters, the difference in scores between machine and human rater, and any differences which exist in the AES model’s ability to measure student performance against a particular rubric. Should the rubric cover such concepts as creativity, an element to which AES is disadvantaged, this emphasis on content will be the driving force to inform the decision to use AES. Ultimately, it will depend on the context, grading criteria as explained by the rubric, and the opportunity for joint assessment via human rater which drive a decision of whether AES is a valid source of objective writing assessment.

Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40.

Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25–39.

Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In P. Peterson, E. Baker, and B. McGaw (Eds.), International encyclopedia of education (3rd ed.), 20–26. Oxford, UK: Elsevier.

Zane, T. W. (2009). Performance assessment design principles gleaned from constructivist learning theory (Part 2). TechTrends, 53(3), 86–94.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s