Reliability and Validity: What makes a test a “good” test?

When someone develops a test of some feature of personality, or a test of intelligence, or a test of any other human characteristic that can differ between people, how can one decide if the test is a good test of that characteristic? Another way of asking the same question is to ask: What makes a test a good test?

A good test of any kind (including the tests you take in school) has two basic characteristics: the test is reliable, and the test is valid.

If a test is RELIABLE it means that the test measures or assesses something IN A CONSISTENT MANNER. For example, the kind of device that is used for measuring height in a doctor’s office is highly reliable. If you have your height measured in your bare feet on two different occasions, the device will produce almost exactly the same measurement of your height both times. Similarly, if two people who really are the same height have their height measured in their bare feet using this device, the device will indicate that they are the same height.

On the other hand, if you were to use a long rubber band with markings on it as your tool for measuring height, there is a good chance that you would stretch the rubber band differing amounts on the two measurement occasions and would end up with two different (perhaps very different) measures of how tall you are. The rubber band would be an example of a measuring device that is not reliable.

If a test is a good test, it must be reliable. Whatever it is that the test measures, it must measure it accurately and consistently or the test is not reliable.

If a test is VALID it means that the test reliably measures WHAT IT WAS DESIGNED TO MEASURE. For example, if a test of intelligence is valid, then the test reliably measures intelligence and not something else. If a test is supposed to measure the amount that you have learned in a course in school, then if the test is valid it means that the test actually does measure the amount that you have learned in that course.

How would it be possible to know whether a test is valid (in addition to being reliable)? How would it be possible to know whether a test that someone claims is a measure of intelligence really does measures intelligence? Or how would it be possible to know whether a measure that has been developed to assess a personality trait like sociability really measures sociability?

There are two primary methods of assessing a test’s validity (there are others, but these two are the most important).

First, if the test is valid, then its development was based upon an accepted definition of that which it is supposed to measure. For example, most people would agree that having a good memory and being able to reason through difficult problems is part of what it means to be intelligent. Accordingly, it would be appropriate to have measures of memory and measures of reasoning on an intelligence test. On the other hand, the definition of intelligence does not include the idea that being tall is part of being intelligent. As a result, you would not include a measure of height as part of your intelligence test. Even a very reliable measure of, for example, driving ability, would not be a valid measure of intelligence.

The second basis for judging that a test is valid involves examining whether scores on the test predict things that the scores should predict if the test is valid. For example, it is part of our definition of intelligence that people with higher intelligence would be expected, on average, to perform better in school. Therefore, for an intelligence test to be valid, scored on the test should do a good job of predicting how well someone would be able to do in school. This is, by the way, one of the primary sources of evidence that valid intelligence tests really are valid. There is no question that individual differences in performance on the most widely used intelligence tests do predict (not perfectly, but still quite well) how well students perform in their different academic courses in school. Similarly, a valid measure of sociability should predict the degree to which different people are motivated to spend time with other people, and perhaps should also predict how many friends a person has.

It is important to note that a test can be reliable without being valid. The measuring device described above that is used for measuring height is a very reliable measure, but would not be a valid measure of intelligence.

On the other hand, a test cannot be valid unless it is also reliable. If a test does not measure anything consistently, then it definitely does not validly measure whatever it was developed to measure. A non-reliable measure is simply not a good measure of anything at all.

One general point that I hope you’ve learned from this discussion of reliability and validity is that there actually are ways of measuring whether a test is a good test or not. Judgments regarding, for example, whether the Wechsler Adult Intelligence Scale (one of the most widely used IQ tests) is a good measure of intelligence do not have to be decided simply on the basis of someone’s opinion. The reliability and validity of the test can be assessed (and it is both reliable and valid).

Reliability and Validity: What makes a test a “good” test?

Leave a Reply Cancel reply