a: The item response theory index of item discrimination, analogous to the point-biserial and biserial correlations in classical test theory. It reflects the slope of the item response function. Often ranging from 0.1 to 2.0 in practice, a higher value indicates a better-performing item.
Accreditation: Accreditation by an outside agency affirms that an organization has met a certain level of standards. Certification testing programs may become accredited by meeting specified standards in test development, psychometrics, bylaws, management, etc.
b: The item response theory index of item difficulty or location, analogous to the P-value (P+) of classical test theory. Typically ranging from -3.0 to 3.0 in practice, a higher value indicates a more difficult item.
Biserial Correlation: A classical index of item discrimination, highly similar to the more commonly used point-biserial. The biserial correlation assumes that the item scores and test scores reflect an underlying normal distribution, which is not always the case.
Blueprint: A test blueprint, or test specification, details how an exam is to be constructed. It includes important information, such as the total number of items, the number of items in each content area or domain, the number of items that are recall vs. reasoning, and the item formats to be utilized.
c: The item response theory pseudoguessing parameter, representing the lower asymptote of the item response function. It is theoretically near the value of 1/k, where k is the number of alternatives. For example, with the typical four-option multiple choice item, a candidate has a base chance of 25% of guessing the correct answer.
Certification: A non-mandatory testing program which certifies that candidates have achieved a minimum standard or knowledge or performance.
Classical Test Theory (CTT). A psychometric analysis and test development paradigm based on correlations, proportions, and other statistics that are relatively simple compared to IRT. It is therefore more appropriate for smaller samples, especially for fewer than 100.
Classification: The use of tests for classifying candidates into categories, such as pass/fail, nonmaster/master, or basic/proficient/advanced.
Computerized Adaptive Testing (CAT): A dynamic method of test administration where items are selected one at a time to match item difficulty and candidate ability as closely as possible. This helps prevent candidates being presented with items that are too difficult or too easy for them, which has multiple benefits. Often, the test only takes half as many items to obtain a similar level of accuracy to form-based tests. This reduces the testing time per examinee and also reduces the total number of times an item is exposed, as well as increasing security by the fact that nearly every candidate will receive a different set of items.
Computerized Classification Testing (CCT): An approach similar to CAT, but with different algorithms to reflect the fact that the purpose of the test is only to make a broad classification and not obtain a highly accurate point estimate of ability.
Cutscore. Also known as a passing score, the cutscore is the score that a candidate must achieve to obtain a certain classification, such as “pass” on a licensure or certification exam.
Criterion-Referenced: A test score (not a test) is criterion-referenced if it is interpreted with regard to a specified criterion and not compared to scores of other candidates. For instance, providing the number-correct score does not relate any information regarding a candidate’s relative standing.
Distractors: Distractors are the incorrect options of a multiple-choice item. A distractor analysis is an important part of psychometric review, as it helps determine if one is acting as a keyed response.
Equating: The process of determining comparable scores on different forms of an examination. For example, if Form A is more difficult than Form B, it might be desirable to adjust scores on Form A upward for the purposes of comparing them to scores on Form B. Usually, this is done statistically based on items that are on both forms, which are called equater, anchor, or common items. Since the groups who took the two forms are different, this is called a common items non-equivalent groups design.
Form: A specific set of items that are administered together for a test. For example, if a test included a certain set of 100 items this year, and a different set of 100 items next year, these would be two distinct forms.
Item: The basic component of a test, often colloquially referred to as a “question,” but items are not necessary phrased as a question. They can be as varied as true/false statements, rating scales, and performance task simulations, in addition to the ubiquitous multiple-choice item.
Item Bank: A repository of items for a testing program, including items at all stages, such as newly written, reviewed, pretested, active, and retired.
Item Banker: A specialized software program that facilitates the maintenance and growth of an item bank by recording item stages, statistics, notes, and other characteristics.
Item Difficulty: A statistical index of how easy/hard the item is with respect to the underlying ability/trait. That is, an item is difficult if not many people get it correct, or respond in the keyed direction. More on that here.
Item Discrimination: A statistical index of the quality of the item, assessing how well it differentiates examinees of high vs. low ability. Items with low discrimination are considered poor quality and are candidates to be revised or retired.
Item Response Theory (IRT): A comprehensive approach to psychometric analysis and test development that utilizes complex mathematical models. This provides several benefits, including the ability to design CATs, but requires larger sample sizes. A common rule of thumb is 100 candidates for the one-parameter model and 500 for the three-parameter model. Learn how to start applying it here.
Job Analysis: Also known as practice analysis or role delineation study, job analysis is a formal study used to determine the structure of a job and the KSAs important to success or competence. This is then used to establish the test blueprint for a professional testing program, a critical step in the chain of evidence for validity.
Key: The key is the correct response to an item.
KSA: KSA is an acronym for Knowledge, Skills, and Abilities. A critical step in testing for employment or professional credentials is to determine the KSAs that are important in a job. This is often done via a job analysis study.
Licensure: A testing program mandated by a government body. The test must be passed in order to perform the task in question, whether it is to work in the profession or drive a car.
Norm-Referenced: A test score (not a test) is norm-referenced if it is interpreted with regard to the performance of other candidates. Percentile rank is an example of this, because it does not provide any information regarding how many items the candidate got correct.
P-value (or P+): A classical index of item difficulty, presented as the proportion of candidates who correctly responded to the item. A value above 0.90 indicates an easy item, while a value below 0.50 indicates a relatively difficult item. Note that it is inverted; a higher value indicatesless difficulty.
Point-Biserial Correlation: A classical index of item discrimination, calculated as the Pearson correlation between the item score and the total test score. If below 0.0, low-scoring candidates are actually doing better than high-scoring candidates, and the item should be revised or retired. Low positive values are marginal, higher positive values are ideal.
Pretest (or Pilot) Item: An item that is administered to candidates simply for the purposes of obtaining data for future psychometric analysis. The results on this item are not included in the score. It is often prudent to include a small number of pretest items in a test.
Reliability: A measure of the repeatability or consistency of the measurement process. Often, this is indexed by a single number, most commonly the internal consistency index coefficient alpha or its dichotomous formulation, KR-20. Under most conditions, these range from 0.0 to 1.0, with 1.0 being perfectly reliable measurement. However, just because a test is reliable does not mean that it is valid, i.e., measures what it is supposed to measure.
Scaling: Scaling is a process of converting scores obtained on an exam to an arbitrary scale. This is done so that all the forms and exams used by a testing organization are on a common scale. For example, suppose an organization had two testing programs, one with 50 items and one with 150 items. All scores could be put on the same scale to standardize score reporting.
Standard-Setting Study: A formal study conducted by a testing organization to determine standards for a testing program, which are manifested as a cutscore. Common methods include the Angoff, Bookmark, Contrasting Groups, and Borderline Survey methods.
Subject Matter Expert (SME): An extremely vital person in the test development process. SMEs are necessary to write items, review items, participate in standard-setting studies and job analyses, and oversee the testing program to ensure its fidelity to its true intent.
Validity: Validity is the concept that test scores can be interpreted as intended. For example, a test for certification in a profession should reflect basic knowledge of that profession, and not intelligence or other constructs, and scores can therefore be interpreted as evidencing professional competence. Validity must be formally established and maintained by empirical studies as well as sound psychometric and test development practices.