Understanding Item Analysis – Fighting the Invisible Threat

This is the first post in a series about item analysis – specifically why it’s important, what measures are of primary concern, and what to do with ill-performing items. Today’s post lays the framework before we dig deeper into the technical aspects of item analysis and revision.

Imagine a small brick wall made up of 200 bricks. It’s assumed during construction that these bricks will all perform roughly the same – there will be some variability in strength or quality and resistance to wear and tear, but all in all, they’ll work together to support the weight of the wall and ensure that it achieves its intended purposes – namely, to stay standing. Now, imagine that over time, after exposure to the elements, you notice that something doesn’t look quite right about the wall – it’s starting to sag or break apart in some areas. Perhaps not all of these bricks are holding their weight. If it’s one or two, the structural integrity of the wall won’t be compromised, but if it’s 20 or so, some legitimate concerns about the quality of the wall and its ability to remain standing will surface. At some point, it’s time to replace these under-performing bricks – the question is, which ones?

The over-simplified analogy found above describes the importance of each individual item and the need to periodically assess its condition. The items on a test are the building blocks, the bricks, that hold the test together and lend legitimacy to it. With more and more responses to the items, you’re able to gather valuable data on how the items are performing – basically, you get to see whether the items are holding their weight. Given the professional and, potentially, societal ramifications of granting an individual a certification or licensure at least partly based on their test score, it’s absolutely critical that the “health” of the items is monitored over time. If an item is no longer doing what it should, it’s time to lay it to rest and replace it with a new item. If an item appears good on the surface, but a few anomalous data points are coming up, maybe the best solution is to patch it up, make a few structural changes, and monitor it carefully in the ensuing months.

At its simplest form, item analysis is the gathering of data about the item – how individuals of varying abilities respond to it – and determining whether it is performing “up to spec.” Just as there are a multitude of item types (from fill-in-the-blank to essay to hot-spot and drag-and-place items), there are many ways to analyze and score an item. The majority of certification and licensure exams will be comprised, primarily, of multiple choice questions. This question type may not be the most effective as assessing an individual’s ability to actually perform a task, but it does allow for many items to be constructed quickly and scored automatically – the cost of additional validation and manual scoring of items aimed to assess an individual’s practical ability is often too great. Considering this, the rest of this article will concern itself with multiple-choice, dichotomously scored items. Dichotomously scored items are those that are scored as correct or incorrect – there are no partial points or scores beyond a simple 1 or 0.

The most commonly used data-points when analyzing an item are its difficulty and its discrimination (how well it’s able to differentiate between individuals of low and high ability levels) values. Beyond this, a distractor analysis is typically performed, which looks at the response options for the items and how individuals respond to them (how many people are choosing the incorrect vs. correct responses, and whether people who do well on the test are picking an incorrect choice disproportionately often). For each of these measures there are two primary psychometric theories that are applied to get the values of interest – Classical Test Theory (CTT) and Item Response Theory (IRT).

Over the next few weeks I’ll be delving into item analysis and what exactly to do with an ill-performing item in more detail. Next week we’ll go over the aforementioned data-points that are most commonly used when assessing an item’s performance. Stay tuned for more as we work to identify the bricks that pose the greatest threat to the integrity of our pretend wall.

