Assessment Reliability and Validity

What are we really assessing, here? Fine tuning the reliability and validity of scores.

So you’ve built an assessment.

You’ve designed this assessment to be a valuable measurement tool. Perhaps it will help you select the best and the brightest from among your pool of candidates for a job. Or maybe your assessment will determine whether a person is qualified to be credentialed in their line of work.  Whether you’re measuring the budding abilities of students in a classroom or gauging the talents of professionals in your industry, can you trust your scores to tell you what you need to know about your examinees?

Turns out, if you know your scores reliable and valid, then yes, you can. I sat down with Jordan Stoeger, Measurement Specialist at Assessment Systems, to discuss how fine-tuning reliability and validity can improve assessments. Jordan often navigates invalid and unreliable testing data. Her job is to pinpoint the weak spots in assessments and strengthen them through proven methods of psychometric analysis.

Auditing the quality of your assessment starts at the foundation.

“Items are the building blocks of good tests. If we have poor items, we automatically have a poor measurement,” Jordan explains. “Removing bad items increases the reliability and validity of test scores and improves the caliber of the measurement tool.”

So how does one identify bad items? A test and item analysis can be useful in providing sound psychometric evidence utilizing Classical Test Theory (CTT) to increase scores’ validity and reliability. The analysis focuses on two statistics, p-value and point biserial.

The p-value gives an indication of how hard or easy an item is. Items with high p-values are considered easy, where items with low p-values are considered hard items.

“Regarding validity, it is often valuable to have items spanning a range of difficulties, but items that are too hard or too easy provide little to no information about examinees,” Jordan says. “These items, items with very high or low p-values, contribute very little information about examinee’s ability or individual differences to the overall test score. Removing them from the overall test score corrects for this and makes scores more valid and reliable.”

An item’s point biserial is the correlation between an examinee’s propensity to endorse the correct answer (or key) and their overall score. All correlations range from -1 to with +1 with -1 being a perfect negative and +1 being a perfect positive correlation.

“An item with a Rpbis less than 0 indicates that examinees with a lower overall test score were more likely to endorse the key,” Jordan says. “This is a clear indication of a poor item.”

To process these variables, you’ll need to run your items through psychometric software. Iteman, a SaaS from Assessment Systems, imposes no limits to the number of items or variables you can analyze. It is also one of the easiest to use, utilizing a graphical user interface (GUI) for a quick and efficient experience. Competitors to Iteman don’t use a GUI and require knowledge of coding or impose limits on the amount of data you can process.

If you’d like a swift introduction to this process, CITAS, a free excel spreadsheet, is available for educational purposes. It limits the number of items and responses but can act as an incredible learning tool to be used on smaller data sets.

“Running a test and item analysis using software like Iteman makes scores more reliable and valid by providing us with psychometric evidence to remove poor performing items,” Jordan says. “The goal of a test is to provide information about individual examinees. By removing these poor performing items, we update scoring to better reflect a measure of examinee’s abilities or individual differences.”



0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply