The Item Analysis output consists of four parts: A summary of test statistics, a test frequency distribution, an item quintile table, and item statistics. This analysis can be processed for an entire class. If it is of interest to compare the item analysis for different test forms, then the analysis can be processed by test form. Measurement and Evaluation staff is available to help instructors interpret their item analysis data.
Part I: Summary of test statistics
Part I of the Item Analysis consists of a Summary of Test Statistics (pdf).
Part II: Test frequency distribution
Part II of the Item Analysis program displays a test frequency distribution. The raw scores are ordered from high to low with corresponding statistics:
- Standard score: A linear transformation of the raw score that sets the mean equal to 500 and the standard deviation equal to 100; in normal score distributions for classes of 500 students or more the standard score range usually falls between 200 and 800 (plus or minus three standard deviations of the mean); for classes with fewer than 30 students the standard score range usually falls within two standard deviations of the mean, i.e., a range of 300 to 700.
- Percentile rank: The percentage of individuals who received a score lower than the given score plus the percentage of half the individuals who received the given score. This measure indicates a person's relative position within a group.
- Percentage of people in the total group who received the given score.
- Frequency: In a test analysis, the number of individuals who receive a given score.
- Cumulative frequency: In a test analysis, the number of individuals who score at or below a given score value.
Download sample Test Frequency Distribution (pdf)
Part III: Item difficulty and discrimination: Quintile table
Part III of the Item Analysis output, a score quintile table, can aid in the interpretation of Part IV of the output. Part IV compares the item responses versus the total score distribution for each item. A good item discriminates between students who scored high or low on the examination as a whole. In order to compare different student performance levels on the examination, the score distribution is divided into fifths, or quintiles. The first fifth includes students who scored between the 81st and 100th percentiles; the second fifth includes students who scored between the 61st and 80th percentiles, and so forth. When the score distribution is skewed, more than one-fifth of the students may have scores within a given quintile and as a result, less than one-fifth of the students may score within another quintile. The table indicates the sample size, the proportion of the distribution, and the score ranges within each fifth.
* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *
THE QUINTILE GRAPH AND MATRIX OF RESPONSES
APPEARING WITH EACH ITEM ARE BASED ON THE
STATISTICS INDICATED IN THE TABLE BELOW:
QUINTILE
| SAMPLE SIZE
| PROPORTION
| SCORE RANGE
|
1ST
| 128
| 0.21
| 77 - 92
|
2ND
| 127
| 0.21
| 70 - 76
|
3RD
| 121
| 0.20
| 64 - 69
|
4TH
| 121
| 0.20
| 56 - 63
|
5TH
| 106
| 0.18
| 24 - 55
|
Part IV: Interpreting item statistics
Download MERMAC Test Analysis & Questionnaire Package (pdf)
Part IV of ITEM ANALYSIS portrays item statistics which can help determine which items are good and which need improvement or deletion from the examination. The quintile graph on the left side of the output indicates the percent of students within each fifth who answered the item correctly. A good, discrimination item is one in which students who scored well on the examination answered the correct alternative more frequently than students who did not score well on the examination. Therefore, the scattergram graph should form a line going from the bottom left-hand corner to the top right-hand corner of the graph. Item 1 in the sample output shows an example of this type of positive linear relationship. Item 2 in the sample output also portrays a discriminating item; although few students correctly answered the item, the students in the first fifth answered it correctly more frequently than the students in the rest of the score distribution. Item 3 indicates a poor item, the graph indicates no relationship between the fifths of the score distribution and the percentage of correct responses by fifths. However, it is likely that this item was miskeyed by the instructor--note the response pattern for alternative B.
A. Evaluating Item Distractors: Matrix of Responses
On the right-hand side of the output, a matrix of responses by fifths shows the frequency of students within each fifth who answered each alternative and who omitted the item. This information can help point out what distractors, or incorrect alternatives, are not successful because: (a) they are not plausible answers and few or no students chose the alternative (see alternatives D and E, item 2), or (b) too many students, especially students in the top fifths of the distribution, chose the incorrect alternative instead of the correct response (see alternative B, item 3). A good item will result in students in the top fifths answering the correct response more frequently than students in the lower fifths, and students in the lower fifths answering the incorrect alternative more frequently than students in the top fifths. The matrix of responses prints the correct response of the item on the right-hand side and encloses the correct response in the matrix in parentheses.
B. Item Difficulty: The PROP Statistic
The proportion (PROP) of students who answer each alternative and who omit the item is printed in the first row below the matrix. The item difficulty is the proportion of subjects in a sample who correctly answer the item. In order to obtain maximum spread of student scores it is best to use items with moderate difficulties. Moderate difficulty can be defined as the point halfway between perfect score and chance score. For a five choice item, moderate difficulty level is .60, or a range between .50 and .70 (because 100% correct is perfect and we would expect 20% of the group to answer the item correctly by blind guessing).
Evaluating item difficulty. For the most part, items which are too easy or too difficult cannot discriminate adequately between student performance levels. Item 2 in the sample output is an exception; although the item difficulty is .23, the item is a good, discriminating one. In item 4, everyone correctly answered the item; the item difficulty is 1.00. Such an item does not discriminate at all between students, and therefore does not contribute statistically to the effectiveness of the examination. However, if one of the instructor's goals is to check that all students grasp certain basic concepts and if the examination is long enough to contain a sufficient number of discrimination items, then such an item may remain on the examination.
C. Item Discrimination: Point Biserial Correlation (RPBI)
Interpreting the RPBI statistic. The point biserieal correlation (RPBI) for each alternative and omit is printed below the PROP row. It indicates the relationship between the item response and the total test score within the group tested, i.e., it measures the discriminating power of an item. It is interpreted similarly to other correlation coefficients. Assuming that the total test score accurately discriminates among individuals in the group tested, then high positive RPBI's for the correct responses would represent the most discriminating items. That is, students who answered the correct response scored well on the examination, whereas students who did not answer the correct response did not score well on the examination. It is also interesting to check the RPBI's for the item distractors, or incorrect alternatives. The opposite correlation between total score and choice of alternative is expected for the incorrect vs. the correct alternative. Where a high positive correlation is desired for the RPBI of a correct alternative, a high negative correlation is good for the RPBI of a distractor, i.e., students who answer with an incorrect alternative did not score well on the total examination. Due to restrictions incurred when correlating a continuous variable (total examination score) with a dichotomous variable (response vs nonresponse of an alternative), the highest possible RPBI is .80 instead of the usual maximum value of 1.00 for a correlation. This maximum RPBI is directly influenced by the item difficulty level. The maximum RPBI value of .80 occurs with items of moderate difficulty level; the further the difficulty level deviates from the moderate difficulty level in either direction, the lower the ceiling and RPBI. For example, the maximum RPBI is about .58 for difficulty levels of .10 or .90. Therefore, in order to maximize item discrimination, items of moderate difficulty level are preferred, although easy and difficult items still can be discriminating (see item 2 in the sample output).
Evaluating item discrimination. When an instructor examines the item analysis data, the RPBI is an important indicator in deciding which items are discriminating and should be retained, and which items are not discriminating and should be revised or replaced by a better item (other content considerations aside). The quintile graph also illustrates this same relationship between item response and total scores. However, the RPBI is a more accurate representation of this relationship. An item with a RPBI of .25 or below should be examined critically for revision or deletion; items with RPBIs of .40 and above are good discriminators. Note that all items, not only those with RPBIs lower than .25, can be improved. An examination of the matrix of responses by fifths for all items may point out weaknesses, such as implausible distractors, that can be reduced by modifying the item.
It is important to keep in mind that the statistical functioning of an item should not be the sole basis for deleting or retaining an item. The most important quality of a classroom test is its validity, the extent to which items measure relevant tasks. Items that perform poorly statistically might be retained (and perhaps revised) if they correspond to specific instructional objectives in the course. Items that perform well statistically but are not related to specific instructional objectives should be reviewed carefully before being reused.
References
- Ebel, R. L. & Frisbee, D. A. (1986). Essentials of educational measurement (4th ed.). Prentice-Hall, Inc.
- Guilford, J. P. (1954). Psychometric method. McGraw-Hill.
- Gronlund, N. E. & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). MacMillan.
- Osterlind, S. J. (1989). Constructing test items. Kluwer Academic Publishers.
- Thorndike, R. L. & Hagen, E. (1969). Measurement and evaluation in psychology and education (3rd ed.). John Wiley & Sons.