Measuring Intelligence

Early Approaches to Intelligence Testing

Binet and Simon wrote the first IQ test to measure a student's ability to think and reason rather than to acquire knowledge.

In the early 20th century, the French government hired French psychologist Alfred Binet to identify students who might need extra assistance in school. Binet worked with French psychologist Théodore Simon to explore ways to assess a student's learning ability. Instead of writing an achievement test, which measures the skills and knowledge acquired during school, they wanted to test innate attention, memory, and problem-solving skills.

Binet and Simon created the first intelligence test, called the Binet-Simon Scale. A heavily revised version remains a popular assessment tool today. An intelligence test is intended to measure the ability to think and reason rather than measuring accumulated knowledge. Their test provided information about a person's mental age, an expression of cognitive ability in terms of the age at which a typical person reaches that level of mental capacity. For example, a six-year-old who performs as well as the average nine-year-old would have a mental age of nine. A 30-year-old with developmental disabilities could also have a mental age of nine.

In 1916 American psychologist Lewis Terman of Stanford University standardized Binet's original test. A standardized test ensures that testing and scoring conditions are consistent across test takers. Standardization of a test makes it possible to compare the performance of test takers without other variables affecting the results. The resulting standardized test was named the Stanford-Binet Intelligence Scale. It produced an intelligent quotient (IQ) score representing a person's reasoning ability. In the early days of intelligence testing, IQ was calculated by dividing a person's mental age by their chronological age and multiplying by 100.

Items from the Original Binet-Simon Intelligence Test

The original Binet-Simon intelligence test, written in 1905, used a variety of tasks to assess intellectual ability. They included perception, spatial reasoning, and basic math skills.
Binet stressed that his intelligence test had limitations. Intelligence is complex and cannot be truly captured by a single test. Binet also stated that intelligence changes over time, is influenced by many factors, and should be compared only among children from similar backgrounds. Unfortunately, this did not keep people from misusing intelligence tests. In the early 20th century, English language tests were administered to immigrants who did not speak English. Their low scores were incorrectly used as evidence of limited intellectual ability. This sparked the eugenics movement. Eugenics is the idea that the human population could be improved through selective breeding, and the movement aimed to stop people with "bad genes" from having children. Starting in the 1920s, many U.S. states allowed forced sterilization of people with low test scores. The practice continued through the 1960s in some parts of the country.

The Wechsler Intelligence Scales

The Wechsler Adult Intelligence Scale (WAIS) was the first intelligence test written specifically for adults, designed to measure intelligence across various mental abilities.

American psychologist David Wechsler felt the Stanford-Binet Intelligence Scale had limitations. Even before Gardner and Sternberg introduced their theories of multiple intelligences, he believed intelligence was made up of many different mental abilities instead of just one general intelligence factor (g). The Stanford-Binet Intelligence Scale also had been designed specifically for schoolchildren, making it invalid when used for adults. Building off of the Stanford-Binet Intelligence Scale, Wechsler designed the Wechsler Adult Intelligence Scale (WAIS) in 1955. He then developed tests for younger age groups as well: the Wechsler Intelligence Scale for Children (WISC) and the Wechsler Preschool and Primary Scale of Intelligence (WPPSI).

The Wechsler Adult Intelligence Scale (WAIS) was designed to measure intelligence across various mental abilities. It has gone through four major revisions over the years. The most recent version, the WAIS-IV, was published in 2008. The WAIS-IV has 10 core subtests used to create broad index scores in four major areas of intelligence: verbal comprehension, perceptual reasoning, working memory, and processing speed. It also gives two overall intelligence scores: Full-Scale IQ combines all four index scores and General Ability combines scores from verbal comprehension and perceptual reasoning only.

WAIS-IV Index Scores and Core Subtests

Index Scores
Core Subtests
Verbal Comprehension Similarities
Perceptual Reasoning Block Design
Matrix Reasoning
Visual Puzzles
Working Memory Digit Span
Processing Speed Symbol Search

The Wechsler Adult Intelligence Scale (WAIS)-IV was designed to measure intelligence across various mental abilities in four categories using ten subtests.

WAIS scoring is designed to produce an average score of 100. Approximately two-thirds of scores fall within the average intellectual range, which includes scores from 85 to 115. About 95% of all people have scores between 70 and 130. Extremely low scores reflect an intellectual disability, while extremely high scores reflect giftedness. This scoring method is now the standard in intelligence testing and is also used in the modern revision of the Stanford-Binet test.
The bell curve, also called the normal curve, represents the distribution of test scores in the population. IQ tests are designed to have an average score of 100.

Modern Intelligence Testing

Psychologists use psychometric test-design techniques to make intelligence tests as culture-fair, valid, and reliable as possible.

Many students are familiar with school admission tests such as the SAT, ACT, GRE, MCAT, and LSAT. These tests are aptitude tests, designed to measure ability in a particular skill or field of knowledge. They are used by colleges and universities when selecting students for admission who will perform well at their institutions. Scores on these tests correlate with intelligence test scores, but they do not assess the full scope of a person's intellectual ability.

For aptitude and intelligence tests to have value, they must be reliable and valid. Psychometrics is the science behind measurements of mental capacities, abilities, and processing. In order to be fair and useful, intelligence tests must be standardized so scores can be compared across all test takers. For example, intelligence tests have strict rules about how to deliver instructions and rules against offering hints. Otherwise, a test taker with an especially helpful test giver could get an unfairly high IQ score.

Useful intelligence tests must have content validity, meaning the test measures the behavior or skill it is intended to measure. They also must have predictive validity, meaning that a score on one measure can predict the score on a related measure. For example, if a significant percentage of students who scored very high on the LSAT failed out of law school, that test would have poor predictive validity.

Modern intelligence and aptitude tests are also normed, meaning that they have been given to a large, representative sample. This allows test developers to know what level of performance reflects average intellectual ability.

Intelligence test scores produce relatively small differences across genders, race, or ethnic groups. Men tend to score slightly higher on spatial reasoning and women on verbal skills. In the United States, white and Asian groups tend to score slightly higher than African American and Latino groups. However, across all groups, intellectual abilities are more alike than different. Variability within a group far exceeds variability between groups. Differences between groups are also strongly linked to environmental differences rather than innate biological differences.

Aspects of some intelligence tests depend on cultural knowledge, educational experiences, and knowledge of specific vocabulary. Test bias, which occurs when a test is comparatively more difficult for one group of people than it is for others, can influence an individual's IQ scores. For example, a test asking questions related to snow may disadvantage test takers from states or nations that rarely have cold weather. Questions using common sayings from one culture, like “comparing apples to oranges,” may disadvantage people who do not have American English as a first language.

A culture-fair intelligence test is designed to ensure it does not favor any certain cultural background over another. Tests such as Raven's Progressive Matrices, which focuses on nonverbal abstract reasoning, may be less influenced by culture and life experiences. Similarly, tasks focusing on processing speed and mental rotation may lead to fewer biases. However, as culture influences attitudes toward testing and test experience, no test is truly equivalent across cultures.

Mental Rotation Task

Mental rotation tasks require people to visualize how three-dimensional objects look as they rotate through space. These tasks do not depend on specific cultural knowledge, but people who play 3D video games improve their mental rotation skills.
Stereotype threat occurs when people feel concerned about confirming negative expectations about their group and subsequently have a higher potential of performing poorly on tests because of this apprehension. When test conditions trigger awareness of negative stereotypes, such as the idea that women are not good at math, people tend to perform more poorly than they would otherwise. For example, women told that a math test typically detects sex differences in mathematical ability tend to perform more poorly on the test than women who take the test without hearing that message. African Americans told they are taking a test of intellectual ability tend to perform more poorly than African Americans told the test is a problem-solving task.

Stereotypes about a group’s intellectual abilities can also bias a test administrator’s test administration or scoring. If they have to make a judgment call about whether an answer is “good enough” to count as correct, administrators may unconsciously give the benefit of doubt to people they expect to perform well. They may also unconsciously err on the side of giving too few points to people they do not expect to perform well. Test administration manuals help to target this potential issue by having standardized rules for how to give directions, when to repeat or elaborate on directions, and how to score answers.