Test fraud is an extremely common occurrence.  We’ve all seen articles about examinee cheating.  However, there are very few defensible tools to help detect it.  I once saw a webinar from an online testing provider that proudly touted their reports on test security… but it turned out that all they provided was a simple export of student answers that you could subjectively read and form conjectures.  The goal of SIFT is to provide a tool that implements real statistical indices from the corpus of scientific research on statistical detection of test fraud, yet is user-friendly enough to be used by someone without a PhD in psychometrics and experience in data forensics.  SIFT still provides more collusion indices and other analysis than any other software on the planet, making it the standard in the industry from the day of its release.  The science behind SIFT is also being implemented in our world-class online testing platform, FastTest.  It is also worth noting that FastTest supports computerized adaptive testing, which is known to increase test security.

Interested?  Download a free trial version of SIFT!

What is Test Fraud?

As long as tests have been around, people have been trying to cheat them.  This is only natural; anytime there is a system with some sort of stakes/incentive involved (and maybe even when not), people will try to game that system.  Note that the root culprit is the system itself, not the test. Blaming the test is just shooting the messenger.  However, in most cases, the system serves a useful purpose.  In the realm of assessment, that means that K12 assessments provide useful information on curriculum on teachers, certification tests identify qualified professionals, and so on.  In such cases, we must minimize the amount of test fraud in order to preserve the integrity of the system.

When it comes to test fraud, the old cliche is true: an ounce of prevention is worth a pound of cure. You’ll undoubtedly see that phrase at conferences and in other resources.  So I of course recommend that your organization implement reasonable preventative measures to deter test fraud.  Nevertheless, there will still always be some cases.  SIFT is intended to help find those.  Also, some examinees might also be deterred by the knowledge that such analysis is even being done.

How can SIFT help me with statistical detection of test fraud?

Like other psychometric software, SIFT does not interpret results for you.  For example, software for item analysis like  Iteman  and  Xcalibre  do not specifically tell you which items to retire or revise, or how to revise them.  But they provide the output necessary for a practitioner to do so.  SIFT provides you a wide range of output that can help you find different types of test fraud, like copying, proctor help, suspect test centers, brain dump usage, etc.  It can also help find other issues, like low examinee motivation.  But YOU have to decide what is important to you regarding statistical detection of test fraud, and look for relevant evidence.  More information on this is provided in the manual, but here is a glimpse.

SIFT test security data forensics

First, there are a number if indices you can evaluate, as you see above.  SIFT  will calculate those collusion indices for each pair of students, and summarize the number of flags.

sift collusion index analysis

A certification organization could use  SIFT  to look for evidence of brain dump makers and takers by evaluating similarity between examinee response vectors and answers from a brain dump site – especially if those were intentionally seeded by the organization!  We also might want to find adjacent examinees or examinees in the same location that group together in the collusion index output.  Unfortunately, these indices can differ substantially in their conclusions.

Additionally, you might want to evaluate time data.  SIFT  provides this as well.

sift time analysis

Finally, we can roll up many of these statistics to the group level.  Below is an example that provides a portion of  SIFT  output regarding teachers.  Note the Gutierrez has suspiciously high scores but without spending much more time.  Cheating?  Possibly.  On the other hand, that is the smallest N, so perhaps the teacher just had a group of accelerated students.  Worthington, on the other hand, also had high scores but had notably shorter times – perhaps the teacher was helping?

sift group analysis

 

The Story of SIFT

I started  SIFT  in 2012.  Years ago, ASC sold a software program called  Scrutiny!  We had to stop selling it because it did not work on recent versions of Windows, but we still received inquiries for it.  So I set out to develop a program that could perform the analysis from  Scrutiny! (the Bellezza & Bellezza index) but also much more.  I quickly finished a few collusion indices.  Then unfortunately I had to spend a few years dealing with the realities of business, wasting hundreds of hours in pointless meetings and other pitfalls.  I finally set a goal to release SIFT in July 2016.

Version 1.0 of  SIFT  includes 10 collusion indices (5 probabilistic, 5 descriptive), response time analysis, group level analysis, and much more to aid in the statistical detection of test fraud.  This is obviously not an exhaustive list of the analyses from the literature, but still far surpasses other options for the practitioner, including the choice to write all your own code.  Suggestions?  I’d love to hear them.

The “opt out” movement is a supposedly-grass-roots movement against K-12 standardized testing, primarily focusing action on encouraging parents to refuse to allow their kids to take tests, i.e., opt out of testing.  The absolutely bizarre part of this is that large scale test scores are rarely used for individual impact on the student, and that tests take up only a tiny fraction of school time throughout the year.  An extremely well-written paper was recently released that explored this befuddling situation, written by Randy E. Bennett at Educational Testing Service (ETS).  Dr. Bennett is an internationally-renowned researcher whose opinion is quite respected.  He came to an interesting conclusion about the opt out of testing topic.

Opt-Out Movement: The Background

After a brief overview, he summarizes the situation:

Despite the fact that reducing testing time is a recurring political response, the evidence described thus far suggests that the actual time devoted to testing might not provide the strongest rationale for opting out, especially in the suburban low-poverty schools in which test refusal appears to occur more frequently.

A closer look at New York, the state with the highest opt-out rates, found a less obvious but stronger relationship (page 7):

It appears to have been the confluence of a revamped teacher evaluation system with a dramatically harder, Common Core-aligned test that galvanized the opt-out movement in New York State (Fairbanks, 2015; Harris & Fessenden, 2015; PBS Newshour, 2015). For 2014, 96% of the state’s teachers had been rated as effective or highly effective, even though only 31% of students had achieved proficiency in ELA and only 36% in mathematics (NYSED, 2014; Taylor, 2015). These proficiency rates were very similar to ones achieved on the 2013 NAEP for Grades 4 and 8 (USDE, 2013a, 2013b, 2013c, 2013d). The rates were also remarkably lower than on New York’s pre-Common-Core assessments. The new rates might be taken to imply that teachers were doing a less-than-adequate job and that supervisors, perhaps unwittingly, were giving them inflated evaluations for it.

That view appears to have been behind a March 2015 initiative from New York Governor Andrew Cuomo (Harris & Fessenden, 2015; Taylor, 2015). At his request, the legislature reduced the role of the principal’s judgment, favored by teachers, and increased from 20% to 50% the role of test-score growth indicators in evaluation and tenure decisions (Rebora, 2015). As a result, the New York State United Teachers union urged parents to boycott the assessment so as to subvert the new teacher evaluations and disseminated information to guide parents specifically in that action (Gee, 2015; Karlin, 2015).

The future?

I am certainly sympathetic to the issues facing teachers today, being the son of two teachers and having a sibling who is a teacher, as well as having wanted to be a high school teacher myself until I was 18.  The lack of resources and low pay facing most educators is appalling.  However, the situation described above is simply an extension of the soccer-syndrome that many in our society decry: how all kids should be allowed to play and rewarded equally, merely for participation and not performance.  With no measure of performance, there is no external impetus to perform – and we all know the role that motivation plays in performance.

It will be interesting to see the role that the Opt Out Of Testing movement plays in the post-NLCB world.

Assessment is being drastically impacted by technology, as is much of education.  Just like learning is undergoing a sea-change with artificial intelligence, multimedia, gamification, and many more aspects, assessment is likewise being impacted.  This post discussed a few ways this is happening.

What is assessment technology?

 

10 Ways That Assessment Technology Can Improve Exams

Automated Item generation

Newer assessment platforms will include functionality for automated item generation.  There are two types: template-based and AI text generators from LLMs like ChatGPT.

Gamification

Low-stakes assessment like formative quizzes in eLearning platforms are ripe for this.  Students can earn points, not just in a sense of test scores, but perhaps something like earning coins in a video game, and gaining levels.  They might even have an avatar that can be equipped with cool gear that the student can win.

Simulations

psychometric training and workshopsIf you want to assess how somebody performs a task, it used to be that you had to fly them in.  For example, I used to work on ophthalmic exams where they would fly candidates into a clinic once a year, to do certain tasks while physicians were watching and grading.  Now, many professions offer simulations of performance tests.

Workflow management

Items are the basic building blocks of the assessment.  If they are not high quality, everything else is a moot point. There needs to be formal processes in place to develop and review test questions.  You should be using item banking software that helps you manage this process.

Linking

Linking and equating refer to the process of statistically determining comparable scores on different forms of an exam, including tracking a scale across years and completely different set of items.  If you have multiple test forms or track performance across time, you need this.  And IRT provides far superior methodologies.

Automated test assembly

The assembly of test forms – selecting items to match blueprints – can be incredibly laborious.  That’s why we have algorithms to do it for you.  Check out  TestAssembler.

Item/Distractor analysis

Iteman45-quantile-plotIf you are using items with selected responses (including multiple choice, multiple response, and Likert), a distractor/option analysis is essential to determine if those basic building blocks are indeed up to snuff.  Our reporting platform in  FastTest, as well as software like  Iteman  and  Xcalibre, is designed for this purpose.

Item response theory (IRT)

This is the modern paradigm for developing large-scale assessments.  Most important exams in the world over the past 40 years have used it, across all areas of assessment: licensure, certification, K12 education, postsecondary education, language, medicine, psychology, pre-employment… the trend is clear.  For good reason.  It will improve assessment.

Automated essay scoring

This technology is has become more widely available to improve assessment.  If your organization scores large volumes of essays, you should probably consider this.  Learn more about it here.  There was a Kaggle competition on it in the past.

Computerized adaptive testing (CAT)

Tests should be smart.  CAT makes them so.  Why waste vast amounts of examinee time on items that don’t contribute to a reliable score, and just discourage the examinees?  There are many other advantages too.

The Partnership for Assessment of Readiness for College and Careers (PARCC) is a consortium of US States working together to develop educational assessments aligned with the Common Core State Standards.  This is a daunting task, and PARCC is doing an admirable job, especially with their focus on utilizing technology.  However, one of the new item types has a serious psychometric fault that deserves a caveat with regards to scoring.

The item type is an “Evidence-Based Selected-­Response” (PARCC EBSR) item format, commonly called a Part A/B item or Two-Part item.  The goal of this format is to delve deeper into student understanding, and award credit for deeper knowledge while minimizing the impact of guessing.  This is obviously an appropriate goal for assessment.  To do so, the item is presented as two parts to the student, where the first part asks a simple question and the second part asks for supporting evidence to their answer in Part A.  Students must answer Part A correctly to receive credit on Part B.  As described on the PARCC website:

 

In order to receive full credit for this item, students must choose two supporting facts that support the adjective chosen for Part A. Unlike tests in the past, students may not guess on Part A and receive credit; they will only receive credit for the details they’ve chosen to support Part A.

 

While this makes sense in theory, it leads to problem in data analysis, especially if using Item Response Theory (IRT). Obviously, this violates the fundamental assumption of IRT, local independence (items are not dependent on each other).  So when working with a client of mine, we decided to combine it into one multi-point question, which matches the theoretical approach PARCC EBSR items are taking.  The goal was to calibrate the item with Muraki’s generalized partial credit model (GPCM), which is typically used to analyze polytomous items in K12 assessment (learn more here).  The GPCM tries to order students based on the points they earn: 0 point students tend to have the lowest ability, 1 point students of moderate ability, and 2 point students are of the highest ability.  The polytomous category response functions (CRFs) then try to approximate those, and the model estimates thresholds, the points that are the line between a 0-point student and a 1-point student and 1 vs. 2.  This typically occurs to where the adjacent CRFs cross.

The first thing we noticed was that some point levels had very small sample sizes.  Suppose that Part A is 1 point and Part B is 1 point (select two evidence pieces but must get both).  Most students will get 0 points or 2 points.  Not many will receive 1: the only way to earn 1 point is to guess Part A but select no correct evidence or only select one evidence point.  This leads to calibration issues with the GPCM.

However, even when there was sufficient N at each level, we found that the GPCM had terrible fit statistics, meaning that the item was not performing according to the model described above.  So I ran Iteman, our classical analysis software, to obtain quantile plots that approximate the polytomous IRFs without imposing the GPCM modeling.  I found that in the 0-2 point items tend to have the issue where not many students get 1 point, and moreover the line for them is relatively flat.  The GPCM assumes that it is relatively bell-shaped.  So the GPCM is looking for where the drop-offs are in the bell shape, crossing with adjacent CRFs – the thresholds – and they aren’t there.  The GPCM would blow up, usually not even estimating thresholds in correct ordering.

PARCC EBSR Graphs

So I tried to think of this from a test development perspective.  How do students get 1 point on these PARCC EBSR items?  The only way to do so is to get Part A right but not Part B.  Given that Part B is the reason for Part A, this means this group is students who answer Part A correctly but don’t know the reason, which means they are guessing.  It is then no surprise that the data for 1-point students is in a flat line – it’s just like the c parameter in the 3PL.  So the GPCM will have an extremely tough time estimating threshold parameters.

From a psychometric perspective, point levels are supposed to represent different levels of ability.  A 1-point student should be higher ability than a 0-point student on this item, and a 2-point student of higher ability than a 1-point student.  This seems obvious and intuitive.  But this item, by definition, violates that first statement.  The only way to get 1 point is to guess the first part – and therefore not know the answer and are no different than the 0-point examinees whatsoever.  So of course the 1-point results look funky here.

The items were calibrated as two separate dichotomous items rather than one polytomous item, and the statistics turned out much better.  This still violates the IRT assumption but at least produces usable IRT parameters that can score students.  Nevertheless, I think the scoring of these items needs to be revisited so that the algorithm produces data which is able to be calibrated in IRT.  The entire goal of test items is to provide data points used to measure students; if the item is not providing usable data, then it is not worth using, no matter how good it seems in theory!