The first standardized tests that we know of were administered in China over 2,000 years ago during the Han dynasty.我们所知道的第一个标准化测验,是超过 2 千年前自汉朝所建立的考试系统Chinese officals used them to determine aptitude for various government posts.中国政府用这个考试来检视每个官员的能力The subject matter included philosophy, farming, and even military tactics.考试科目包括了哲学、农业,甚至还有军方策略Standardized tests continued to be used around the world for the next two millennia, and today, they're used for everything from evaluating stair climbs for firefighters in France to language examinations for diplomats in Canada to students in schools.在接下来的两千年以后,标准化测验仍在全世界惯用着,至今,标准化的测验方式仍被运用在每个地方,从法国评估消防人员爬楼梯的速度,到加拿大外交官的语言测验,再到学校学生。Some standardized tests measure scores only in relation to the results of other test takers.有些标准化测验的评分方式是与其他考生相互比较Others measure performances on how well test takers meet predetermined criteria.其他的评分的方式,则是看考生是否符合既定的表现标准So the stair climb for the firefighter could be measured by comparing the time of the climb to that of all other firefighters.所以要鉴定消防人员的能力,可以比较所有的消防人员爬楼梯的速度This might be expressed in what many call a bell curve.这样的评分方式被称之为钟形曲线 (标准常态分布)Or it could be evaluated with reference to set criteria, such as carrying a certain amount of weight a certain distance up a certain number of stairs.或者可以根据既定的标准来评估,像是背着特定重量的东西爬上指定的楼层上。Similarly, the diplomat might be measured against other test-taking diplomats, or against a set of fixed criteria, which demonstrate different levels of language proficiency.同样地,外交官可能会和其他接受考试的外交官互相比较,或是用一套既定规则来评估语言的熟练度。And all of these results can be expressed using something called a percentile.而所有得到的结果可以用百分比来表示If a diplomat is in the 70th percentile, 70% of test takers scored below her.如果一位外交官是排名在 70 百分比,代表有百分之 70 的受试者低于她的分数If she scored in the 30th percentile, 70% of test takers scored above her.如果她是排名在 30 百分比,代表有百分之 70 的受试者高于她的分数Although standardized tests are sometimes controversial, they are simply a tool.虽然标准化的测验方式有时是争议性的,但他们其实就只是个工具As a thought experiment, think of a standardized test as a ruler.作为一个思维上的实验,可以想像标准化测验像是一把直尺A ruler's usefulness depends on two things.一把尺主要有两种功能:First, the job we ask it to do.第一,我们会拿着直尺做的事情Our ruler can't measure the temperature outside or how loud someone is singing.直尺不能用来量外面温度或是谁唱歌比较大声Second, the ruler's usefulness depends on its design.第二,直尺好不好用主要是依据它的设计Say you need to measure the circumference of an orange.如果说你要用它来量一颗橘子的周长Our ruler measures length, which is the right quantity, but it hasn't been designed with the flexibility required for the task at hand.我们的直尺量的就只是长度,看似是对的单位,但它并非被设计来测量不同形状的东西。So, if standardized tests are given the wrong job, or aren't designed properly, they may end up measuring the wrong things.所以如果将标准化测验使用在不适合的地方,或是它们的设计并不适当,可能导致最后错误的结果。In the case of schools, students with test anxiety may have trouble performing their best on a standardized test, not because they don't know the answers, but because they're feeling too nervous to share what they've learned.以学校的例子来说,对考试焦虑的学生而言,他们无法在考试中表现出最好的一面,并不是因为他们不知道答案,就只是因为他们太紧张,导致无法展现出他们所学到的东西。Students with reading challenges may struggle with the wording of a math problem, so their test results may better reflect their literacy rather than numeracy skills.有阅读障碍的学生可能对数学问题并不熟练,所以比起数学能力,其考试结果可能更可以知道他们的阅读能力。And students who were confused by examples on tests that contain unfamiliar cultural references may do poorly, telling us more about the test taker's cultural familiarity than their academic learning.而被不熟悉的文化例子困惑住的考生,也会因此表现不佳这样的结果让我们知道受试者对文化的熟悉度,而非他们学术上的学习能力。In these cases, the tests may need to be designed differently.在这些例子中,测验方式可能需要有不同的设计Standardized tests can also have a hard time measuring abstract characteristics or skills, such as creativity, critical thinking, and collaboration.标准化的测验也会很难测出抽象化的特质或是技能,像是创造力、批判性思考、以及合作能力。If we design a test poorly, or ask it to do the wrong job, or a job it's not very good at, the results may not be reliable or valid.如果我们没设计好考试形式,或是将它用在不对、不适合的地方,这样得到的结果可能无法使人信服,也失去有效性。Reliability and validity are two critical ideas for understanding standardized tests.可靠性和有效性在标准化测验中是很重要的两个主轴To understand the difference between them, we can use the metaphor of two broken thermometers.要知道它们之间有什么不同,可以用两支坏掉的温度计作为例子An unreliable thermometer gives you a different reading each time you take your temperature, and the reliable but invalid thermometer is consistently ten degrees too hot.一支不可靠的温度计在每次测量温度时,都会显示不同的温度数值,可靠却无效的温度计却比起实际的温度远远地热了 10 度。Validity also depends on accurate interpretations of results.有效性也需仰赖对于结果的准确阐述If people say the results of a test mean something they don't, that test may have validity problem.如果从人们口中解释出的结果和测验所显示的不同,那考试可能会有有效性的问题Just as we wouldn't expect a ruler to tell us how much an elephant weighs, or what it had for breakfast, we can't expect standardized tests alone to reliably tell us how smart someone is, how diplomats will handle a tough situation, or how brave a firefighter might turn out to be.就像我们不会期望一支直尺告诉我们一只大象有多重,或是早餐它吃了些什么,我们不可以期待单单标准化测验就能让我们知道一个人有多聪明,或是外交官们如何面对一个艰困的情况,又或是一位消防员可能会变得多勇猛。So standardized tests may help us learn a little about a lot of people in a short time, but they usually can't tell us a lot about a single person.所以标准化测验可能能帮助我们在短时间初步地了解到很多人,但通常无法让我们深入了解到一个人。Many social scientists worry about test scores resulting in sweeping and often negative changes for test takers, sometimes with long-term life consequences.很多社会科学家担心这样的考试分数会让受试者感到失落或产生消极感,有时还会影响到往后的生活。We can't blame the tests, though.不过我们不行怪这样的测验方式It's up to us to use the right tests for the right jobs, and to interpret results appropriately.在对的地方用对的考试,并正确地解释其结果,这些都是出于我们自己的选择
