Posts on psychometrics: The Science of Assessment

T scores

The two terms Norm-Referenced and Criterion-Referenced are commonly used to describe tests, exams, and assessments.  They are often some of the first concepts learned when studying assessment and psychometrics.

Norm-referenced means that we are referencing how your score compares to other people.  Criterion-referenced means that we are referencing how your score compares to a criterion such as a cutscore or a body of knowledge.

Do we say a test is “Norm-Referenced” vs. “Criterion-Referenced”?

Norm-Referenced Vs. Criterion-Referenced Testing

Actually, that’s a slight misuse.

The terms Norm-Referenced and Criterion-Referenced refer to score interpretations.  Most tests can actually be interpreted in both ways, though they are usually designed and validated for only one of the other.  More on that later.

Hence the shorthand usage of saying “this is a norm-referenced test” even though it just means that it is the primarily intended interpretation.

Examples of Norm-Referenced vs. Criterion-Referenced

Suppose you received a score of 90% on a Math exam in school.  This could be interpreted in both ways.  If the cutscore was 80%, you clearly passed; that is the criterion-referenced interpretation.  If the average score was 75%, then you performed at the top of the class; this is the norm-referenced interpretation.  Same test, both interpretations are possible.  And in this case, valid interpretations.T scores

What if the average score was 95%?  Well, that changes your norm-referenced interpretation (you are now below average) but the criterion-referenced interpretation does not change.

Now consider a certification exam.  This is an example of a test that is specifically designed to be criterion-referenced.  It is supposed to measure that you have the knowledge and skills to practice in your profession.  It doesn’t matter whether all candidates pass or only a few candidates pass; the cutscore is the cutscore.

However, you could interpret your score by looking at your percentile rank compared to other examinees; it just doesn’t impact the cutscore

On the other hand, we have an IQ test.  There is no criterion-referenced cutscore of whether you are “smart” or “passed.”  Instead, the scores are located on the standard normal curve (mean=100, SD=15), and all interpretations are norm-referenced.  Namely, where do you stand compared to others?  The scales of the T score and z-score are norm-referenced, as are Percentiles.  So are many tests in the world, like the SAT with a mean of 500 and SD of 100.

Is this impacted by item response theory (IRT)?

If you have looked at item response theory (IRT), you know that it scores examinees on what is effectively the standard normal curve (though this is shifted if Rasch).  But, IRT-scored exams can still be criterion-referenced.  It can still be designed to measure a specific body of knowledge and have a cutscore that is fixed and stable over time.

Even computerized adaptive testing can be used like this.  An example is the NCLEX exam for nurses in the United States.  It is an adaptive test, but the cutscore is -0.18 (NCLEX-PN on Rasch scale) and it is most definitely criterion-referenced.

Building and validating an exam

The process of developing a high-quality assessment is surprisingly difficult and time-consuming. The greater the stakes, volume, and incentives for stakeholders, the more effort that goes into developing and validating.  ASC’s expert consultants can help you navigate these rough waters.

Want to develop smarter, stronger exams?

Contact us to request a free account in our world-class platform, or talk to one of our psychometric experts.

 

point biserial discrimination

The item-total point-biserial correlation is a common psychometric index regarding the quality of a test item, namely how well it differentiates between examinees with high vs low ability.

What is item discrimination?

While the word “discrimination” has a negative connotation, it is actually a really good thing for an item to have.  It means that it is differentiating between examinees, which is entirely the reason that an assessment item exists.  If a math item on Fractions is good, then students with good knowledge of fractions will tend to get it correct, while students with poor knowledge will get it wrong.  If this isn’t the case, and the item is essentially producing random data, then it has no discrimination.  If the reverse is the case, then the discrimination will be negative.  This is a total red flag; it means that good students are getting the item wrong and poor students are getting it right, which almost always means that there is incorrect content or the item is miskeyed.

What is the point-biserial correlation?

The point-biserial coefficient is a Pearson correlation between scores on the item (usually 0=wrong and 1=correct) and the total score on the test.  As such, it is sometimes called an item-total correlation.

Consider the example below.  There are 10 examinees that got the item wrong, and 10 that got it correct.  The scores are definitely higher for the Correct group.  If you fit a regression line, it would have a positive slope.  If you calculated a correlation, it would be around 0.10.

point biserial discrimination

How do you calculate the point-biserial?

Since it is a Pearson correlation, you can easily calculate it with the CORREL function in Excel or similar software.  Of course, psychometric software like Iteman will also do it for you, and many more important things besides (e.g., the point-biserial for each of the incorrect options!).  This is an important step in item analysis.  The image below is example output from Iteman, where Rpbis is the point-biserial.  This item is very good, as it has a very high point-biserial for the correct answer and strongly negative point-biserials for the incorrect answers (which means the not-so-smart students are selecting them).

FastTest Iteman Psychometric Analysis

How do you interpret the point-biserial?

Well, most importantly consider the points above about near-zero and negative values.  Besides that, a minimal-quality item might have a point-biserial of 0.10, a good item of about 0.20, and strong items 0.30 or higher.  But, these can vary with sample size and other considerations.  Some constructs are easier to measure than others, which makes item discrimination higher.

Are there other indices?

There are two other indices commonly used in classical test theory.  There is the cousin of the point-biserial, the biserial.  There is also the top/bottom coefficient, where the sample is split into a highly performing group and a lowly performing group based on total score, the P value calculated for each, and those subtracted.  So if 85% of top examinees got it right and 60% of low examinees got it right, the index would be 0.25.

Of course, there is also the a parameter from item response theory.  There are a number of advantages to that approach, most notably that the classical indices try to fit a linear model on something that is patently nonlinear.  For more on IRT, I recommend a book like Embretson & Riese (2000).

The California Department of Human Resources (CalHR, calhr.ca.gov/) has selected Assessment Systems Corporation (ASC, assess.com) as its vendor for an online assessment platform. CalHR is responsible for the personnel selection and hiring of many job roles for the State, and delivers hundreds of thousands of tests per year to job applicants. CalHR seeks to migrate to a modern cloud-based platform that allows it to manage large item banks, quickly publish new test forms, and deliver large-scale assessments that align with modern psychometrics like item response theory (IRT) and computerized adaptive testing (CAT).

Assess.ai as a solution

ASC’s landmark assessment platform Assess.ai was selected as a solution for this project. ASC has been providing computerized assessment platforms with modern psychometric capabilities since the 1980s, and released Assess.ai in 2019 as a successor to its industry-leading platform FastTest. It includes modules for item authoring, item review, automated item generation, test publishing, online delivery, and automated psychometric reporting.

Read the full article here.

Multistage adaptive testing

Multistage testing

Automated item generation

automated item generation

student remote online proctoring software

Online proctoring software refers to platforms that proctor educational or professional assessments (exams or tests) when the proctor is not in the same room as the examinee.  This means that it is done with a video stream or recording using a webcam and sometimes an additional device, which are monitored by a human and/or AI.  It is also referred to as remote proctoring or invigilation. Online proctoring offers a compelling alternative to in-person proctoring, somewhere in between unproctored at-home tests and tests delivered at an expensive testing center in an office building.  This makes it a perfect fit for medium-stakes exams, such as university placement, pre-employment screening, and many types of certification/licensure tests.

What are the types of online proctoring?

There are many types of online proctoring software on the market, spread across dozens of vendors, especially new ones that sought to capitalize on the pandemic which were not involved with assessment before hand.  With so many options, how can you more effectively select amongst the types of remote proctoring? There are four types of remote proctoring platforms, which can be adapted to a particular use case, sometimes varying between different tests in a single organization.  ASC supports all four types, and partners with 5 different vendors to help provide the best solution to our clients.  In descending order of security:

Approach What it entails for you What it entails for the candidate

Live with professional proctors

  • You register a set of examinees in FastTest, and tell us when they are to take their exams and under what rules.
  • We provide the relevant information to the proctors.
  • You send all the necessary information to your examinees.
  • The most secure of the types of remote proctoring.
  • Examinee goes to ascproctor.com, where they will initiate a chat with a proctor.
  • After confirmation of their identity and workspace, they are provided information on how to take the test.
  • The proctor then watches a video stream from their webcam as well as a phone on the side of the room, ensuring that the environment is secure. They do not see the screen, so your exam content is not exposed. They maintain exam invigilation continuously.
  • When the examinee is finished, they notify the proctor, and are excused.

Live, bring your own proctor (BYOP)

  • You upload examinees into FastTest, which will generate links.
  • You send relevant instructions and the links to examinees.
  • Your staff logs into the admin portal and awaits examinees.
  • Videos with AI flagging are available for later review if needed.
  • Examinee will click on a link, which launches the proctoring software.
  • An automated system check is performed.
  • The proctoring is launched.  Proctors ask the examinee to provide identity verification, then launch the test.
  • Examinee is watched on the webcam and screencast.  AI algorithms help to flag irregular behavior.
  • Examinee concludes the test

Record and Review (with option for AI)

  • You upload examinees into FastTest, which will generate links.
  • You send relevant instructions and the links to examinees.
  • After examinees take the test, your staff (or ours) logs into review all the videos and report on any issues.  AI will automatically flag irregular behavior, making your reviews more time-efficient.
  • Examinee will click on a link, which launches the proctoring software.
  • An automated system check is performed.
  • The proctoring is launched.  System asks the examinee to provide identity verification, then launch the test.
  • Examinee is recorded on the webcam and screencast.  AI algorithms help to flag irregular behavior.
  • Examinee concludes the test

AI only

  • You upload examinees into FastTest, which will generate links.
  • You send relevant instructions and the links to examinees.
  • Videos are stored for 1 month if you need to check any.
  • Examinee will click on a link, which launches the proctoring software.
  • An automated system check is performed.
  • The proctoring is launched.  System asks the examinee to provide identity verification, then launch the test.
  • Examinee is recorded on the webcam and screencast.  AI algorithms help to flag irregular behavior.
  • Examinee concludes the test

 

Some case studies for different types of exams

We’ve worked with all types of remote proctoring software, across many types of assessment:

  • ASC delivers high-stakes certification exams for a number of certification boards, in multiple countries, using the live proctoring with professional proctors.  Some of these are available continuously on-demand, while others are on specific days where hundreds of candidates log in.
  • We partnered with a large university in South America, where their admissions exams were delivered using Bring Your Own Proctor, enabling them to drastically reduce costs by utilizing their own staff.
  • We partnered with a private company to provide AI-enhanced record-and-review proctoring for applicants, where ASC staff reviews the results and provides a report to the client.
  • We partner with an organization that delivers civil service exams for a country, and utilizes both unproctored and AI-only proctoring, differing across a range of exam titles.

 

Finding the Best Online Proctoring Software: Two Distinct Markets

First, I would describe the online proctoring industry as actually falling into two distinct markets, so the first step is to determine which of these fits your organizationlaptop-desk-above

  1. Large scale, lower cost (when large scale), lower security systems designed to be used only as a plugin to major LMS platforms like Blackboard or Canvas. These systems are therefore designed for medium-stakes exams like an Intro to Psychology midterm at a university.
  2. Lower scale, higher cost, higher security systems designed to be used with standalone assessment platforms. These are generally for higher-stakes exams like certification or workforce, or perhaps special use at universities like Admissions and Placement exams.

How to tell the difference? The first type will advertise about easy integration with systems like Blackboard or Canvas as a key feature. They will also often focus on AI review of videos, rather than using real humans. Another key consideration is to look at the existing client base, which is often advertised.  

Other ways that online proctoring software can differ

Screen capture:

Some online proctoring providers have an option to record/stream the screen as well as the webcam. Some also provide the option to only do this (no webcam) for lower stakes exams.

Mobile phone as the second camera:

Some newer platforms provide the option to easily integrate the examinee’s mobile phone as a second camera (third stream, if you include screen capture), which effectively operates as a human proctor. Examinees will be instructed to use the video to show under the table, behind the monitor, etc., before starting the exam. They then might be instructed to stand up the phone 2 meters away with a clear view of the entire room while the test is being delivered.  This is in addition to the webcam.

API integrations:

Some systems require software developers to set up an API integration with your LMS or assessment platform. Others are more flexible, and you can just log in yourself, upload a list of examinees, and you are all set.

On-Demand vs. Scheduled:

Some platforms involve the examinee scheduling a time slot. Others are purely on-demand, and the examinee can show up whenever they are ready. MonitorEDU is a prime example of this: examinees show up at any time, present their ID to a live human, and are then started on the test immediately – no downloads/installs, no system checks, no API integrations, nothing.  

More security: A better test delivery software

A good testing delivery platform will also come with its own functionality to enhance test security: randomization, automated item generation, computerized adaptive testing, linear-on-the-fly testing, professional item banking, item response theory scoring, scaled scoring, psychometric analytics, equating, lockdown delivery, and more. In the context of online proctoring, perhaps the most salient is the lockdown delivery. In this case, the test will completely take over the examinee’s computer and they can’t use it for anything else until the test is done.

LMS systems rarely include any of this functionality, because they are not needed for a midterm exam of Intro to Psychology. However, most assessments in the world that have real stakes – university admissions, certifications, workforce hiring, etc. – depend heavily on such functionality. It’s not just out of habit or tradition, either. Such methods are considered essential by international standards including AERA/APA/NCMA, ITC, and NCCA.  

ASC’s preferred online proctoring partners

ASC’s online assessment platforms are integrated with some of the leading remote proctoring software providers.

Type Vendors
Live MonitorEDU
AI Alemira, Sumadi, ProctorFree
Record and Review Alemira, ProctorFree
Bring Your Own Proctor Alemira

 

List of Online Proctoring Software Providers

Looking to evaluate potential vendors?  Here is a great place to start.

# Name Website Country Proctor Service
1 Aiproctor https://www.aiproctor.com/ USA AI
2 Centre Based Test (CBT) https://www.conductexam.com/center-based-online-test-software India Live, Record and Review
3 Class in Pocket classinpocket.com (Website now defunct) India AI
4 Datamatics https://www.datamatics.com/industries/education-technology/proctoring India AI, Live, Record and Review
5 DigiProctor https://www.digiproctor.com India AI
6 Disamina https://disamina.in/ India AI
7 Examity https://www.examity.com/ USA Live
8 ExamMonitor https://examsoft.com/ USA Record and Review
9 ExamOnline https://examonline.in/remote-proctoring-solution-for-employee-hiring/ India AI, Live
10 Eduswitch https://eduswitch.com/  India AI
11 Examus https://examus.com Russia AI, Bring Your Own Proctor, Live
12 EasyProctor https://www.excelsoftcorp.com/products/assessment-and-proctoring-solutions/ India AI, Live, Record and Review
13 HonorLock https://honorlock.com/ USA AI, Record and Review
14 Internet Testing Systems https://www.testsys.com/ USA Bring your own proctor/td>
14 Invigulus https://www.invigulus.com/  USA AI, Live, Record and Review
15 Iris Invigilation https://www.irisinvigilation.com/ Australia AI
16 Mettl https://mettl.com/en/online-remote-proctoring/ India AI, Live, Record and Review
17 MonitorEdu https://monitoredu.com/proctoring USA Live
18 OnVUE https://home.pearsonvue.com/Test-takers/OnVUE-online-proctoring.aspx USA Live
19 Oxagile https://www.oxagile.com/competence/edtech-solutions/proctoring/ USA AI, Live, Record and Review
20 Parakh https://parakh.online/blog/remote-proctoring-ultimate-solution-for-secure-online-exam India AI, Live, Record and Review
21 ProctorFree https://www.proctorfree.com/ USA AI, Live
22 Proctor360 https://proctor360.com/ USA AI, Bring Your Own Proctor, Live, Record and Review
23 ProctorEDU https://proctoredu.com/ Russia AI, Live, Record and Review
24 ProctorExam https://proctorexam.com/ Netherlands Bring Your Own Proctor, Live, Record and Review
25 Proctorio https://proctorio.com/products/online-proctoring USA AI, Live
26 Proctortrack https://www.proctortrack.com/ USA AI, Live
27 ProctorU https://www.proctoru.com/ USA AI, Live, Record and Review
28 Proview https://www.proview.io/en USA AI, Live
29 PSI Bridge https://www.psionline.com/en-gb/platforms/psi-bridge/ USA Live, Record and Review
30 Respondus Monitor https://web.respondus.com/he/monitor/ USA AI, Live, Record and Review
31 Rosalyn https://www.rosalyn.ai/ USA AI, Live
32 SmarterProctoring https://smarterservices.com/smarterproctoring/ USA AI, Bring Your Own Proctor, Live
33 Sumadi https://sumadi.net/ Honduras AI, Live, Record and Review
34 Suyati https://suyati.com/ India AI, Live, Record and Review
35 TCS iON Remote Assessments https://www.tcsion.com/hub/remote-assessment-marking-internship/ India AI, Live
36 Think Exam https://www.thinkexam.com/remoteproctoring India AI, Live
37 uxpertise XP https://uxpertise.ca/en/uxpertise-xp/ Canada AI, Live, Record and Review
38 Proctor AI https://www.visive.ai/solutions/proctor-ai India AI, Live, Record and Review
39 Wise Proctor https://www.wiseattend.com/wise-proctor USA AI, Record and Review
40 Xobin https://xobin.com/online-remote-proctoring India AI
41 Youtestme https://www.youtestme.com/online-proctoring/ Canada AI, Live

 

How do I select a vendor?

First, determine the level of security necessary, and the trade-off with costs.  Live proctoring with professionals can cost $30 to $100 or more, while AI proctoring can be as little as a few dollars.  Then, evaluate some vendors to see which group they fall into; note that some vendors can do all of them!  Then, ask for some demos so you understand the business processes involved and the UX on the examinee side, both of which could substantially impact the soft costs for your organization.  Then, start negotiating with the vendor you want!

Want some more information?

Get in touch with us, we’d love to show you a demo or introduce you to partners!

Email solutions@assess.com.

hr-manager-interviewing-a-candidate

HR assessment software is everywhere, but there is a huge range in quality, as well as a wide range in the type of assessment that it is designed for.

HR assessment platforms help companies create effective assessments, thus saving valuable resources, improving candidate experience & quality, providing more accurate and actionable information about human capital, and reducing hiring bias.  But, finding software solutions that can help you reap these benefits can be difficult, especially because of the explosion of solutions in the market.  If you are lost on which tools will help you develop and deliver your own HR assessments, this guide is for you.

 

What is HR assessment?

There are various types of assessments used in HR.  Here are four main areas, though this list is by no means exhaustive.

  1. Pre-employment tests to select candidates
  2. Post-training assessments
  3. Certificate or certification exams (can be internal or external)
  4. 360-degree assessments and other performance appraisals

 

Pre-employment tests

Finding good employees in an overcrowded market is a daunting task. In fact, according to research by Career builder, 74% of employers admit to hiring the wrong employees. Bad hires are not only expensive, but can also adversely affect cultural dynamics in the workforce. This is one area where HR assessment software shows its value.

There are different types of pre-employment assessments. Each of them achieves a different goal in the hiring process. The major types of pre-employment assessments include:

Personality tests: Despite rapidly finding their way into HR, these types of pre-employment tests are widely misunderstood. Personality tests answer questions in the social spectrum.  One of the main goals of these tests is to quantify the success of certain candidates based on behavioral traits.

Aptitude tests: Unlike personality tests or emotional intelligence tests which tend to lie on the social spectrum, aptitude tests measure problem-solving, critical thinking, and agility.  These types of tests are popular because can predict job performance than any other type because they can tap into areas that cannot be found in resumes or job interviews.

Skills Testing: The kinds of tests can be considered a measure of job experience; ranging from high-end skills to low-end skills such as typing or Microsoft excel. Skill tests can either measure specific skills such as communication or measure generalized skills such as numeracy.

Emotional Intelligence tests: These kinds of assessments are a new concept but are becoming important in the HR industry. With strong Emotional Intelligence (EI) being associated with benefits such as improved workplace productivity and good leadership, many companies are investing heavily in developing these kinds of tests.  Despite being able to be administered to any candidates, it is recommended they be set aside for people seeking leadership positions, or those expected to work in social contexts.

Risk tests: As the name suggests, these types of tests help companies reduce risks. Risk assessments offer assurance to employers that their workers will commit to established work ethics and not involve themselves in any activities that may cause harm to themselves or the organization.  There are different types of risk tests. Safety tests, which are popular in contexts such as construction, measure the likelihood of the candidates engaging in activities that can cause them harm. Other common types of risk tests include Integrity tests.

 

Post-training assessments

This refers to assessments that are delivered after training.  It might be a simple quiz after an eLearning module, up to a certification exam after months of training (see next section).  Often, it is somewhere in between.  For example you might take an afternoon sit through a training course, after which you take a formal test that is required to do something on the job.  When I was a high school student, I worked in a lumber yard, and did exactly this to become an OSHA-approved forklift driver.

 

Certificate or certification exams

Sometimes, the exam process can be high-stakes and formal.  It is then a certificate or certification, or sometimes a licensure exam.  More on that here.  This can be internal to the organization, or external.hr assessment software presentation

Internal certification: The credential is awarded by the training organization, and the exam is specifically tied to a certain product or process that the organization provides in the market.  There are many such examples in the software industry.  You can get certifications in AWS, SalesForce, Microsoft, etc.  One of our clients makes MRI and other medical imaging machines; candidates are certified on how to calibrate/fix them.

External certification: The credential is awarded by an external board or government agency, and the exam is industry-wide.  An example of this is the SIE exams offered by FINRA.  A candidate might go to work at an insurance company or other financial services company, who trains them and sponsors them to take the exam in hopes that the company will get a return by the candidate passing and then selling their insurance policies as an agent.  But the company does not sponsor the exam; FINRA does.

 

360-degree assessments and other performance appraisals

Job performance is one of the most important concepts in HR, and also one that is often difficult to measure.  John Campbell, one of my thesis advisors, was known for developing an 8-factor model of performance.  Some aspects are subjective, and some are easily measured by real-world data, such as number of widgets made or number of cars sold by a car salesperson.  Others involve survey-style assessments, such as asking customers, business partners, co-workers, supervisors, and subordinates to rate a person on a Likert scale.  HR assessment platforms are needed to develop, deliver, and score such assessments.

 

HR Assessment Software: The Benefits

Now that you have a good understanding of what pre-employment tests are, let’s discuss the benefits of integrating pre-employment assessment software into your hiring process. Here are some of the benefits:

Saves Valuable resources

Unlike the lengthy and costly traditional hiring processes, pre-employment assessment software helps companies increase their ROI by eliminating HR snugs such as face-to-face interactions or geographical restrictions. Pre-employment testing tools can also reduce the amount of time it takes to make good hires while reducing the risks of facing the financial consequences of a bad hire.

Supports Data-Driven Hiring Decisions

Data runs the modern world, and hiring is no different. You are better off letting complex algorithms crunch the numbers and help you decide which talent is a fit, as opposed to hiring based on a hunch or less-accurate methods like an unstructured interview.  Pre-employment assessment software helps you analyze assessments and generate reports/visualizations to help you choose the right candidates from a large talent pool.

Improving candidate experience 

Candidate experience is an important aspect of a company’s growth, especially considering the fact that 69% of candidates admitting not to apply for a job in a company after having a negative experience. Good candidate experience means you get access to the best talent in the world.

Elimination of Human Bias

Traditional hiring processes are based on instinct. They are not effective since it’s easy for candidates to provide false information on their resumes and cover letters. But, the use of pre-employment assessment software has helped in eliminating this hurdle. The tools have leveled the playing ground, and only the best candidates are considered for a position.

 

What To Consider When Choosing HR assessment software

Now that you have a clear idea of what pre-employment tests are and the benefits of integrating pre-employment assessment software into your hiring process, let’s see how you can find the right tools.

Here are the most important things to consider when choosing the right pre-employment testing software for your organization.

Ease-of-use

The candidates should be your top priority when you are sourcing pre-employment assessment software. This is because the ease of use directly co-relates with good candidate experience. Good software should have simple navigation modules and easy comprehension.

Here is a checklist to help you decide if a pre-employment assessment software is easy to use;

  • Are the results easy to interpret?
  • What is the UI/UX like?
  • What ways does it use to automate tasks such as applicant management?
  • Does it have good documentation and an active community?

Tests Delivery (Remote proctoring)

Good online assessment software should feature good online proctoring functionalities. This is because most remote jobs accept applications from all over the world. It is therefore advisable to choose a pre-employment testing software that has secure remote proctoring capabilities. Here are some things you should look for on remote proctoring;

  • Does the platform support security processes such as IP-based authentication, lockdown browser, and AI-flagging?
  • What types of online proctoring does the software offer? Live real-time, AI review, or record and review?
  • Does it let you bring your own proctor?
  • Does it offer test analytics?

Test & data security, and compliance

Defensibility is what defines test security. There are several layers of security associated with pre-employment test security. When evaluating this aspect, you should consider what pre-employment testing software does to achieve the highest level of security. This is because data breaches are wildly expensive.

The first layer of security is the test itself. The software should support security technologies and frameworks such as lockdown browser, IP-flagging, and IP-based authentication. If you are interested in knowing how to secure your assessments, learn more about it here.

The other layer of security is on the candidate’s side. As an employer, you will have access to the candidate’s private information. How can you ensure that your candidate’s data is secure? That is reason enough to evaluate the software’s data protection and compliance guidelines.

A good pre-employment testing software should be compliant with certifications such as GDRP. The software should also be flexible to adapt to compliance guidelines from different parts of the world.

Questions you need to ask;

  • What mechanisms does the software employ to eliminate infidelity?
  • Is their remote proctoring function reliable and secure?
  • Are they compliant with security compliance guidelines including ISO, SSO, or GDPR?
  • How does the software protect user data?

 

User experience

A good user experience is a must-have when you are sourcing any enterprise software. A new age pre-employment testing software should create user experience maps with both the candidates and employer in mind. Some ways you can tell if a software offers a seamless user experience includes;

  • User-friendly interface
  • Simple and easy to interact with
  • Easy to create and manage item banks
  • Clean dashboard with advanced analytics and visualizations

Customizing your user-experience maps to fit candidates’ expectations attracts high-quality talent.

 

Scalability and automation

With a single job post attracting approximately 250 candidates, scalability isn’t something you should overlook. A good pre-employment testing software should thus have the ability to handle any kind of workload, without sacrificing assessment quality.

It is also important you check the automation capabilities of the software. The hiring process has many repetitive tasks that can be automated with technologies such as Machine learning, Artificial Intelligence (AI), and robotic process automation (RPA).

Here are some questions you should consider in relation to scalability and automation;

  • Does the software offer Automated Item Generation (AIG)?
  • How many candidates can it handle?
  • Can it support candidates from different locations worldwide?

Reporting and analytics

iteman item analysis

A good pre-employment assessment software will not leave you hanging after helping you develop and deliver the tests. It will enable you to derive important insight from the assessments.

The analytics reports can then be used to make data-driven decisions on which candidate is suitable and how to improve candidate experience. Here are some queries to make on reporting and analytics;

  • Does the software have a good dashboard?
  • What format are reports generated in?
  • What are some key insights that prospects can gather from the analytics process?
  • How good are the visualizations?

Customer and Technical Support

Customer and technical support is not something you should overlook. A good pre-employment assessment software should have an Omni-channel support system that is available 24/7. This is mainly because some situations need a fast response. Here are some of the questions your should ask when vetting customer and technical support;

  • What channels of support does the software offer/How prompt is their support?
  • How good is their FAQ/resources page?
  • Do they offer multi-language support mediums?
  • Do they have dedicated managers to help you get the best out of your tests?

 

Conclusion

Finding the right HR assessment software is a lengthy process, yet profitable in the long run. We hope the article sheds some light on the important aspects to look for when looking for such tools. Also, don’t forget to take a pragmatic approach when implementing such tools into your hiring process.

Are you stuck on how you can use pre-employment testing tools to improve your hiring process? Feel free to contact us and we will guide you on the entire process, from concept development to implementation. Whether you need off-the-shelf tests or a comprehensive platform to build your own exams, we can provide the guidance you need.  We also offer free versions of our industry-leading software FastTest and Assess.ai– visit our Contact Us page to get started!

ai-remote-proctoring-solutions

AI remote proctoring has seen an incredible increase of usage during the COVID pandemic. ASC works with a very wide range of clients, with a wide range of remote proctoring needs, and therefore we partner with a range of remote proctoring vendors that match to our clients’ needs, and in some cases bring on a new vendor at the request of a client.

At a broad level, we recommend live online proctoring for high-stakes exams such as medical certifications, and AI remote proctoring for medium-stakes exams such as pre-employment assessment. Suppose you have decided that you want to find a vendor for AI remote proctoring. How can you start evaluating solutions?

Well, it might be surprising, but even within the category of “AI remote proctoring” there can be substantial differences in the functionality provided, and you should therefore closely evaluate your needs and select the vendor that makes the most sense – which is the approach we at ASC have always taken.

How does AI remote proctoring work?

Because this approach is fully automated, we need to train machine learning models to look for things that we would generally consider to be a potential flag.   These have to be very specific!  Here are some examples:student remote proctoring software

  1. Two faces in the screen
  2. No faces in the screen
  3. Voices detected
  4. Black or white rectangle approximately 2-3 inches by 5 inches (phone is present)
  5. Face looking away or down
  6. White rectangle approximately 8 inches by 11 inches (notebook or extra paper is present)

These are continually monitored at each slice of time, perhaps twice per second.  The video frames or still images are evaluated with machine learning models such as support vector machines to determine the probability that each flag is happening, and then an overall cheating score or probability is calculated.  You can see this happening in real time at this very helpful video.

If you are a fan of the show Silicon Valley you might recall how a character built an app for recognizing food dishes with AI… and the first iteration was merely “hot dog” vs. “not hot dog.”  This was a nod towards how many applications of AI break problems down into simple chunks.

Some examples of differences in AI remote proctoring

   1. Video vs. Stills: Some vendors record full HD video of the student’s face with the webcam, while others are designed for lower-stakes exams, and only take a still shot every 10 seconds.  In some cases, the still-image approach is better, especially if you are delivering exams in low-bandwidth areas.

   2. Audio: Some vendors record audio, and flag it with AI.  Some do not record audio.

   3. Lockdown browser: Some vendors provide their own lockdown browser, such as Sumadi.  Some integrate with market leaders like Respondus or SafeExam.  Others do not provide this feature.

   4. Peripheral detection: Some vendors can detect certain hardware situations that might be an issue, such as a second monitor. Others do not.

   5. Second camera: Some vendors have the option to record a second camera; typically this is from the examinee’s phone, which is placed somewhere to see the room, since the computer webcam can usually only see their face.

   6. Screen recording: Some vendors record full video of the examinee’s screen as they take the exam. It can be argued that this increases security, but is often not required if there is lockdown browser. It can also be argued that this decreases security, because now images of all your items reside in someplace other than your assessment platform.

   7. Additional options: For example, Examus has a very powerful feature for Bring Your Own Proctors, where staff would be able to provide live online proctoring, enhanced with AI in real time. This would allow you to scale up the security for certain classifications that have high stakes but low volume, where live proctoring would be more appropriate. 

Want some help in finding a solution?

We can help you find the right solution and implement it quickly for a sound, secure digital transformation of your assessments. Contact solutions@assess.com to request a meeting with one of our assessment consultants. Or, sign up for a free account in our assessment platforms, FastTest and Assess.ai and see how our Rapid Assessment Development (RAD) can get you up and running in only a few days. Remember that the assessment platform itself also plays a large role in security: leveraging modern methods like computerized adaptive testing or linear on-the-fly testing will help a ton.

What are current topics and issues regarding AI remote proctoring?

ASC hosted a webinar in August 2022, specifically devoted to this topic.  You can view the video below.

psychometric training and workshops

Post-training assessment is an integral part of improving the performance and productivity of employees. To gauge the effectiveness of the training, assessments are the go-to solution for many businesses.  They ensure transfer and retention of the training knowledge, provide feedback to employees, and can be used for evaluations.  At the aggregate level, they help determine opportunities for improvement at the company.

future of assessment

Benefits Of Post-Training Assessments

Insight On Company Strengths and Weaknesses

Testing gives businesses and organizations insight into the positives and negatives of their training programs. For example, if an organization realizes that certain employees can’t grasp certain concepts, they may decide to modify how they are delivered or eliminate them completely. The employees can also work on their areas of weaknesses after the assessments, hence improving productivity. 

Image: The Future Of Assessments: Tech. ed

Helps in Measuring Performance

Unlike traditional testing which is impossible to perform analytics on the assessments, measure performance and high-fidelity, computer-based assessments can quantify initial goals such as call center skills. By measuring performance, businesses can create data-driven roadmaps on how their employees can achieve their best form in terms of performance. 

Advocate For Ideas and Concepts That Can Be Integrated Into The Real World

Workers learn every day and sometimes what they learn is not used in driving the business towards attaining its objectives. This can lead to burnout and information overload in employees, which in turn lowers performance and work quality. By using post-training assessments, you can customize tests to help workers attain skills that are only in alignment with your business goals. This can be done by using methods such as adaptive testing

 

Other Benefits of Cloud-based testing include: 

  • The assessments can be taken from anywhere in the world
  • Saves the company a lot of time and resources
  • Improved security compared to traditional assessments
  • Improved accuracy and reliability. 
  • Scalability and flexibility
  • Increases skill and knowledge transfer

 

 

Tips To Improve Your Post-Training Assessments

1. Personalized Testing

Most businesses have an array of different training needs. Most employees have different backgrounds and responsibilities in organizations, which is difficult to create effective generalized tests. To achieve the main objectives of your training, it is important to differentiate the assessments. Sales assessments, technical assessments, management assessments, etc can not be the same. Even in the same department, there could be diversification in terms of skills and responsibilities. One way to achieve personalized testing is by using methods such as Computerized Adaptive Testing. Through the immense power of AI and machine learning, this method gives you the power to create tests that are unique to employees. Not only does personalized testing improve effectiveness in your workforce, but it is also cost-effective, secure, and in alignment with the best Psychometrics practices in the corporate world. It is also important to keep in mind the components of effective assessments when creating personalized tests.  

2. Analyzing Assessment Results

Many businesses don’t see the importance of analyzing training assessment results. How do you expect to improve your training programs and assessments if you don’t check the data?  This can tell you important things like where the students are weakest and perhaps need more instruction, or if some questions are wonky.

Analyzing Assessment Results

Example of Assessment analysis on Iteman

 

Analyze assessment results using psychometric analytics software such as Iteman to get important insights such as successful participants, item performance issues, and many others. This provides you with a blueprint to improve your assessments and employee training programs. 

3. Integrating Assessment Into Company Culture

Getting the best out of assessment is not about getting it right once, but getting it right over a long period of time. Integrating assessment into company culture is one great way to achieve this. This will make assessment part of the systems and employees will always look forward to improving their skills. You can also use strategies such as gamification to make sure that your employees enjoy the process. It is also critical to give the employees the freedom to provide feedback on the training programs. 

4. Diversify Your Assessment Types

One great myth about assessments is that they are limited in terms of questions and problems you can present to your employees. However, this is not true!

By using methods such as item banking, assessment systems are able to provide users with the ability to develop assessments using different question types. Some modern question types include:

  • Drag & drop 
  • Multiple correct 
  • Embedded audio or video
  • Cloze or fill in the blank
  • Number lines
  • Situational judgment test items
  • Counter or timer for performance tests

Diversification of question types improves comprehension in employees and helps them develop skills to approach problems from multiple angles. 

5. Choose Your Assessment Tools Carefully

This is among the most important considerations you should make when creating a workforce training assessment strategy. This is because software tools are the core of how your campaigns turn out. 

There are many assessment tools available, but choosing one that meets your requirements can be a daunting task. Apart from the key considerations of budget, functionality, etc, there are many other factors to keep in mind before choosing online assessment tools. 

To help you choose an assessment tool that will help you in your assessment journey, here are a few things to consider:

Ease-of-use

Most people are new to assessments, and as much as some functionalities can be powerful, they may be overwhelming to candidates and the test development staff. This may make candidates underperform. It is, therefore, important to vet the platform and its functionalities to make sure that they are easy to use. 

Functionality 

Training assessments are becoming popular and new inventions are being made every day. Does the assessment software have the latest innovations in the industry? Do you get value for your money? Does it support modern psychometrics like item response theory? These are just but a few questions to ask when vetting a platform for functionality. 

Assessment Reporting and Visualizations

One major advantage of assessments over traditional ones is that they offer access to instant assessment reporting. You should therefore look for a platform that offers advanced reporting and visualizations in metrics such as performance, question strengths, and many others. 

Cheating precautions and Security

When it comes to assessments, there are two concerns when it comes to security. How secure are the assessments? And how secure is the platform? In relation to the tests, the platform should provide precautions and technologies such as Lockdown browser against cheating. They should also have measures in place to make sure that user data is secure. 

Reliable Support System

This is one consideration that many businesses don’t keep in mind, and end up regretting in the long run. Which channels does the corporate training assessment platform use to provide its users with support? Do they have resources such as whitepapers and documentation in case you need them? How fast is their support?  These are questions you should ask before selecting a platform to take care of your assessment needs. 

Scalability

A good testing vendor should be able to provide you with resources should your needs go beyond expectation. This includes delivery volume – server scalability – but also being able to manage more item authors, more assessments, more examinees, and greater psychometric rigor.

Final Thoughts

Adopting effective post-training assessments can be daunting tasks with a lot of forces at play, and we hope these tips will help you get the best out of your assessments. 

Do you want to integrate smarter assessments into your corporate environment or any industry but feel overwhelmed by the process? Feel free to contact an experienced team of professionals to help you create an assessment strategy that helps you achieve your long-term goals and objectives.  

You can also sign up to get free access to our online assessment suite including 60 item types, IRT, adaptive testing, and so much more functionality!

Multistage testing

We are proud to announce that ASC’s next-generation platform, Assess.ai, was recently announced as a Finalist for the 2021 EdTech Awards. This follows the successful launch of the landmark platform in 2020, where it quickly garnered attention as a finalist for the Innovation award at the 2020 conference of the Association of Test Publishers.

What is Assess.ai?

Assess.ai brings together best practices in assessment with modern approaches to automation, AI, and software design. It is built to make it easier to develop high-quality assessments with automated item review and automated item generation, and deliver with advanced psychometrics like adaptive and multistage testing.

Computerized Adaptive Testing Software

An example of this approach is shown below; a modern UI/UX presents a visual platform for publishing MultiStage Tests based on information functions from item response theory (IRT). Users can easily publish tests with this modern AI-based approach, without having to write a single line of code.

Multistage testing

 

What are the EdTech Awards?

See the full list of finalists, and learn more about the awards, via the link below.

https://www.edtechdigest.com/2021-finalists-winners/

Item analysis is the statistical evaluation of test questions to ensure they are good quality, and fix them if they are not.  This is a key step in the test development cycle; after items have been delivered to examinees (either as a pilot, or in full usage), we analyze the statistics to determine if there are issues which affect validity and reliability, such as being too difficult or biased.  This post will describe the basics of this process.  If you’d like further detail and instructions on using software, you can also you can also check out our tutorial videos on our YouTube channel and download our free psychometric software.


Download a free copy of Iteman: Software for Item Analysis

What is Item Analysis?

Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items. This is an important step in the test development cycle, not only because it helps improve the quality of your test, but because it provides documentation for validity: evidence that your test performs well and score interpretations mean what you intend.  It is one of the most common applications of psychometrics, by using item statistics to flag, diagnose, and fix the poorly performing items on a test.  Every item that is poorly performing is potentially hurting the examinees.Iteman Statistics Screenshot

Item analysis boils down to two goals:

  1. Find the items that are not performing well (difficulty and discrimination, usually)
  2. Figure out WHY those items are not performing well, so we can determine whether to revise or retire them

There are different ways to evaluate performance, such as whether the item is too difficult/easy, too confusing (not discriminating), miskeyed, or perhaps even biased to a minority group.

Moreover, there are two completely different paradigms for this analysis: classical test theory (CTT) and item response theory (IRT). On top of that, the analyses can differ based on whether the item is dichotomous (right/wrong) or polytomous (2 or more points).

Because of the possible variations, item analysis complex topic. But, that doesn’t even get into the evaluation of test performance. In this post, we’ll cover some of the basics for each theory, at the item level.

How to do Item Analysis

1. Prepare your data

Most psychometric software utilizes a person x item matrix.  That is, a data file where examinees are rows and items are columns.  Sometimes, it is a sparse matrix where is a lot of missing data, like linear on the fly testing.  You will also need to provide metadata to the software, such as your Item IDs, correct answers, item types, etc.  The format for this will differ by software.

2. Run data through item analysis software

To implement item analysis, you should utilize dedicated software designed for this purpose. If you utilize an online assessment platform, it will provide you output for item analysis, such as distractor P values and point-biserials (if not, it isn’t a real assessment platform).

CITAS output with histogram

In some cases, you might utilize standalone software. CITAS provides a simple spreadsheet-based approach to help you learn the basics, completely for free.  A screenshot of the CITAS output is here.  However, professionals will need a level above this.  Iteman and Xcalibre are two specially-designed software programs from ASC for this purpose, one for CTT and one for IRT.

3. Interpret results of item analysis

Item analysis software will produce tables of numbers.  Sometimes, these will be ugly ASCII-style tables from the 1980s.  Sometimes, they will be beautiful Word docs with graphs and explanations.  Either way, you need to interpret the statistics to determine which items have problems and how to fix them.  The rest of this article will delve into that.

Item Analysis with Classical Test Theory

Classical Test Theory provides a simple and intuitive approach to item analysis. It utilizes nothing more complicated than proportions, averages, counts, and correlations. For this reason, it is useful for small-scale exams or use with groups that do not have psychometric expertise.

Item Difficulty: Dichotomous

CTT quantifies item difficulty for dichotomous items as the proportion (P value) of examinees that correctly answer it.

It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult.  There are no hard and fast rules because interpretation can vary widely for different situations.  For example, a test given at the beginning of the school year would be expected to have low statistics since the students have not yet been taught the material.  On the other hand, a professional certification exam, where someone can not even sit unless they have 3 years of experience and a relevant degree, might have all items appear easy even though they are quite advanced topics!  Here are some general guidelines”

    0.95-1.0 = Too easy (not doing much good to differentiate examinees, which is really the purpose of assessment)

    0.60-0.95 = Typical

    0.40-0.60 = Hard

    <0.40 = Too hard (consider that a 4 option multiple choice has a 25% chance of pure guessing)

With Iteman, you can set bounds to automatically flag items.  The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly.

For a test where we expect examinees to perform poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20.  The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees.  In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.

Note that because the scale is inverted (lower value means higher difficulty), this is sometimes referred to as item facility.

The Item Mean (Polytomous)

This refers to an item that is scored with 2 or more point levels, like an essay scored on a 0-4 point rubric or a Likert-type item that is “Rate on a scale of 1 to 5.”

  • 1=Strongly Disagree
  • 2=Disagree
  • 3=Neutral
  • 4=Agree
  • 5=Strongly Agree

The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.

You will have to adjust for your own situation, but here is an example for the 5-point Likert-style item.

    1-2 is very low; people disagree fairly strongly on average

    2-3 is low to neutral; people tend to disagree on average

    3-4 is neutral to high; people tend to agree on average

    4-5 is very high; people agree fairly strongly on average

Iteman also provides flagging bounds for this statistic.  The minimum item mean bound represents what you consider the cut point for the item mean being too low.  The maximum item mean bound represents what you consider the cut point for the item mean being too high.

The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.

Item Discrimination: Dichotomous

In psychometrics, discrimination is a GOOD THING, even though the word often has a negative connotation in general. The entire point of an exam is to discriminate amongst examinees; smart students should get a high score and not-so-smart students should get a low score. If everyone gets the same score, there is no discrimination and no point in the exam! Item discrimination evaluates this concept.

CTT uses the point-biserial item-total correlation (Rpbis) as its primary statistic for this.

The Pearson point-biserial correlation (r-pbis) is a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0 and is a correlation of item scores and total raw scores.  If you consider a scored data matrix (multiple-choice items converted to 0/1 data), this would be the correlation between the item column and a column that is the sum of all item columns for each row (a person’s score).

A good item is able to differentiate between examinees of high and low ability yet have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item because it means that the high-ability examinees are answering incorrectly, while the low examinees are answering it correctly, which of course would be bizarre, and therefore typically indicates that the specified correct answer is actually wrong. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”  Here are some general guidelines on interpretation.  Note that these assume a decent sample size; if you only have a small number of examinees, many item statistics will be flagged!

    0.20+ = Good item; smarter examinees tend to get the item correct

    0.10-0.20 = OK item; but probably review it

    0.0-0.10 = Marginal item quality; should probably be revised or replaced

    <0.0 = Terrible item; replace it

***Major red flag is if the correct answer has a negative Rpbis and a distractor has a positive Rpbis

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the Rpbis be as high as possible.

The biserial correlation is also a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial.

The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).

Item Discrimination: Polytomous

The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from −1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the Rpbis are equivalent for a 2-category item, so guidelines for interpretation remain unchanged.

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical Rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the Rpbis be as high as possible.

The eta coefficient is an additional index of discrimination computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between-groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score.

As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.

Key and Distractor Analysis

In the case of many item types, it pays to evaluate the answers. A distractor is an incorrect option. We want to make sure that more examinees are not selecting a distractor than the key (P value) and also that no distractor has higher discrimination. The latter would mean that smart students are selecting the wrong answer, and not-so-smart students are selecting what is supposedly correct. In some cases, the item is just bad. In others, the answer is just incorrectly recorded, perhaps by a typo. We call this a miskey of the item. In both cases, we want to flag the item and then dig into the distractor statistics to figure out what is wrong.

Iteman Psychometric Item Analysis

Example

Here is an example output for one item from our Iteman software, which you can download for free. You might also be interested in this video.  This is a very well-performing item.  Here are some key takeaways.

  • This is a 4-option multiple choice item
  • It was on a subscore named “Example subscore”
  • This item was seen by 736 examinees
  • 70% of students answered it correctly, so it was fairly easy, but not too easy
  • The Rpbis was 0.53 which is extremely high; the item is good quality
  • The line for the correct answer in the quantile plot has a clear positive slope, which reflects the high discrimination quality
  • The proportion of examinees selecting the wrong answers was nicely distributed, not too high, and with negative Rpbis values. This means the distractors are sufficiently incorrect and not confusing.

Item Analysis with Item Response Theory

Item Response Theory (IRT) is a very sophisticated paradigm of item analysis and tackles numerous psychometric tasks, from item analysis to equating to adaptive testing. It requires much larger sample sizes than CTT (100-1000 responses per item) and extensive expertise (typically a PhD psychometrician). It isn’t suitable for small-scale exams like classroom quizzes.

However, it is used by virtually every “real” exam you will take in your life, from K-12 benchmark exams to university admissions to professional certifications.

If you haven’t used IRT, I recommend you check out this blog post first.

Item Difficulty

IRT evaluates item difficulty for dichotomous items as a b-parameter, which is sort of like a z-score for the item on the bell curve: 0.0 is average, 2.0 is hard, and -2.0 is easy. (This can differ somewhat with the Rasch approach, which rescales everything.) In the case of polytomous items, there is a b-parameter for each threshold, or step between points.

Item Discrimination

IRT evaluates item discrimination by the slope of its item response function, which is called the a-parameter. Often, values above 0.80 are good and below 0.80 are less effective.

Key and Distractor Analysis

Xcalibre-poly-output

In the case of polytomous items, the multiple b-parameters provide an evaluation of the different answers. For dichotomous items, the IRT modeling does not distinguish amongst correct answers. Therefore, we utilize the CTT approach for distractor analysis. This remains extremely important for diagnosing issues in multiple choice items.

Example

Here is an example of what output from an IRT analysis program (Xcalibre) looks like. You might also be interested in this video.

  • Here, we have a polytomous item, such as an essay scored from 0 to 3 points.
  • It is calibrated with the generalized partial credit model.
  • It has strong classical discrimination (0.62)
  • It has poor IRT discrimination (0.466)
  • The average raw score was 2.314 out of 3.0, so fairly easy
  • There was a sufficient distribution of responses over the four point levels
  • The boundary parameters are not in sequence; this item should be reviewed

 

Summary

This article is a very broad overview and does not do justice to the complexity of psychometrics and the art of diagnosing/revising items!  I recommend that you download some of the item analysis software and start exploring your own data.

For additional reading, I recommend some of the common textbooks.  For more on how to write/revise items, check out Haladyna (2004) and subsequent works.  For item response theory, I highly recommend Embretson & Riese (2000).

 

 

Conditional standard error of measurement function

Do you conduct adaptive testing research? Perhaps a thesis or dissertation? Or maybe you have developed adaptive tests and have a technical report or validity study? I encourage you to check out the Journal of Computerized Adaptive Testing as a publication outlet for your adaptive testing research. JCAT is the official journal of the International Association for Computerized Adaptive Testing (IACAT), a nonprofit organization dedicated to improving the science of assessments.

JCAT has an absolutely stellar board of editors and was founded to focus on improving the dissemination of research in adaptive testing. The IACAT website also contains a comprehensive bibliography of research in adaptive testing, across all journals and tech reports, for the past 50 years.  IACAT was founded at the 2009 conference on computerized adaptive testing and has since held conferences every other year as well as hosting the JCAT journal.

Potential research topics at the JCAT journal

Here are some of the potential research topics:

laptop data graph

  • Item selection algorithms
  • Item exposure algorithms
  • Termination criteria
  • Cognitive diagnostic models
  • Simulation studies
  • Validation studies
  • Item response theory models
  • Multistage testing
  • Use of adaptive testing in new(er) situations, like patient reported outcomes
  • Design of actual adaptive assessments and their release into the wild

If you are not involved in CAT research but are interested, please visit the IACAT and journal website to read the articles.  Access is free.  JCAT would also appreciate it if you would share this information to colleagues so that they might consider publication.