Twenty five years of "temperament testing" fails to produce a single scientitically valid protocol

March 28, 2019

A dog is tested with a child-sized toy in an Australian animal shelter

Major shelters in Australia have long claimed that they (and they alone) are able to assess dogs in their shelters for the purposes of public safety and matching adoptable pets to their new owners. But how scientifically valid are these so-called "temperament tests"?

This month in the Journal of Veterinary Behavior is the paper "What is the Evidence for Reliability andValidity of Behavior Evaluations for Shelter Dogs? A prequel to “No Better than Flipping a Coin”.

The authors (Patronek, Bradley, and Arps) conducted an in depth review of 25+ years of research and 17 published studies that assessed the validity and reliability of commonly used temperament assessments including the Behavioural Assessment for Re-Homing K9’s (B.A.R.K.) protocol, the Safety Assessment for Evaluating Rehoming (SAFERTM) test, the Canine Behavioral Assessment and Research Questionnaire (C-BARQ) and the ASPCA's Assess-A-PetTM.

Not only did the authors find that these test failed to produce reliable and valid results, but that they called for a moratorium on the use of these evaluations as the sole determinant of a dog’s fate, and concluded that after more than two decades of intensive effort, it is unlikely that we will ever produce a valid test in the future.

Despite 25+ years of publications, including solid studies performed under good to ideal conditions by skilled investigators, findings indicate there is no evidence that any canine behavior evaluation or individual subtest has come close to meeting accepted standards justifying claims that it is validated for routine use in shelters. Furthermore, the mean reported 27 false-positive error rate in study populations was 35.1%, whereas in more typical shelter populations it was estimated at 63.8%.


Scientifically demonstrating overall validity (i.e., having acceptable reliability, validity, and predictive ability conferring suitability for routine use in shelter settings) has not been achieved for any behavior evaluation or subtest published to date.

Traditional pass/fail temperament tests used in shelters failed to be reliable, due to the diverse settings of shelters and the experience levels of shelter workers

Reliability; the most important first step is to demonstrate that an instrument is reliable – in other words, that it is at least measuring something in a reproducible fashion (i.e. reliability).

With respect to the technical measurement of reliability, the methodological approaches that tend to increase reliability are not practical in a shelter setting.


No study has demonstrated inter-shelter reliability for any canine behavior evaluation.

They have also failed to produce repeatability, due to the stressful and unfamiliar environment a dog will find himself in when in a shelter.

Repeatability; We are not the first to note the numerous conceptual problems with attempting to demonstrate elicitation of similar canine behaviors upon retesting, including the inability to scientifically determine how soon after intake to perform and when to repeat an evaluation, as well as the uncertain effect of dogs’ adaptation (or failure to adapt) to the shelter environment and staff.


In summary, it appears that demonstrated reliability of any type is largely absent in published studies of canine behavior evaluations. The take-home message is that if sufficient reliability cannot be established, validity is moot.

Tests also have failed to demonstrate that they actually testing for, what they're claiming to be testing for; or construct validation.

Construct validation studies are conducted to establish how strongly an evaluation measures what it claims to be measuring.

In summary, although a few studies reported statistically significant findings for various aspects of construct validity, none of the studies demonstrated compelling evidence of construct validity in a more global sense.

Perhaps the most important general principle to appreciate is that overall validity can only be claimed after multiple rigorous studies,
using precisely the same instrument and conducted in different populations exactly the same way, have demonstrated acceptable strength of association and statistical significance.

The environmental aspect is especially relevant to establishing validity of canine behavior evaluations, which may perform very differently in home and shelter environments, and within sub-environments in the shelter (e.g., meeting rooms, outdoor play areas, offices, hallways, and kennels) where the circumstance may not seem at all equivalent when experienced through the eyes, ears, noses, and previous experiences of dogs. Another issue is the influence of attachment (or lack thereof) to persons conducting or present during the tests.

For example, in their survey of 11 shelters in Australia, Mornement et al. (2010) describe how most did not have any documentation or step-by-step instructions on how to administer whatever behavior evaluation was being used, nor was there any standardization between or within shelters as to how soon after intake the evaluations were conducted. Evidence of training and auditing of that training to ensure consistency was limited at best. Given the limited resources and other constraints faced by many shelters just to stay abreast of daily intake and animal care requirements, we believe this is not an uncommon situation in the US as well. As noted by Mornement et al. (2010), minimizing measurement error in daily field use must be an ongoing process that can only be ensured through quality of initial training and auditing of training to be sure original standards continue to be adhered to. Factoring in staff turnover in shelters, this is not a trivial concern.

And that these tests of current behaviour in a shelter environment, are a poor predictor of future behaviour outside of the shelter environment.

Regardless of the target expressed behavior settled upon, perhaps the most important question to consider when thinking about face validity would be whether it is reasonable to assume that any behavior on a test battery (or individual subtest) administered at a single time in a stressful, unfamiliar shelter environment surrounded by strangers can generally be considered a reasonable surrogate for that dog’s behavior in a future home, where s/he would presumably be attached, settled, and frequently in the company of familiar people, and potentially have very different concepts of territory and different reactions to stimuli. Embedded in this concept is an assumption that the dog’s behavior in the home would be invariant after adoption and that the time of evaluation after rehoming in order to evaluate the success of the behavior evaluation in the shelter would not be important. That assumption does not strike us as plausible.


For shelter dogs, it may well be that some evaluations are detecting little more than fear, uncertainty and lack of adaptation to the shelter environment when they are conducted, as opposed to some stable, generalizable behavior that will carry over into a different situation. However, performance of tests on owned dogs has not been any better.

The authors also point out that even there was a valid test was available, the unregulated nature of animal shelters and the haphazard way they apply these kinds of tests, would rule that original test invalid anyway.

In human clinical use, it is also well-established that once validated in a specific population, an instrument cannot be modified (at least without permission of the developers), as even minor changes to a scale, series of questions, or other rating system, may alter performance and negate any claims to validity. Indeed, it is not uncommon for developers of scales to make this requirement an explicit condition for using a particular instrument, which may be copyrighted. There is evidence to suggest this core principle is regularly ignored in the practical application of canine behavior evaluations in shelters, with shelters modifying items, deleting items, and/or adding items of their own, as well as creating their own evaluations from scratch. This type of diversion from accepted practice is a cause of great concern, and underscores the pitfalls of what happens when procedures are implemented in an environment where there is little to no regulatory or professional oversight to prevent such misuse, however well-intentioned.

We concede this is currently a moot point, as it is not possible to invalidate something that had never been validated in the first place. But it does highlight a concern for the future.

Finally the authors conclude that after over two decades of invested research, the chances of a scientifically valid temperament test being developed is highly unlikely, and that we should be looking beyond pass/fail temperament testing of shelter dogs.

In the past, when the deficiencies of behavior evaluations have been identified, most authors have called for further research. Calls for further research are understandable when deficiencies in existing work are newly recognized, and they are justifiable when there is some reasonable expectation that successful efforts will be forthcoming. The latter would imply a consensus at minimum over which evaluation(s) and/or subtests and behaviors are the most meaningful; how the work will be funded and replicated sufficiently to support a claim of overall reliability, validity and utility; how confirmation of that validation will occur and be supported at individual shelters; how continual quality control over time will be achieved; and what error rate false-positive results would be acceptable to shelters; and most importantly, how the problem of poor predictive ability of a positive test (false-positive errors) in the face of low prevalence will be overcome. We see the latter as insurmountable based on the likely low prevalence of adoption-preventing behaviors in shelter dogs, as the method of calculating false-positive and false-negative errors is a fundamental, species-invariant principle of evaluating diagnostic test performance.

We would argue, that when looking at the cumulative results of 25+ years of publication in this field, including solid studies performed under good to ideal conditions by skilled investigators, that calls for additional research, at least for assessing aggression-related behaviors in shelter dogs, is akin to waiting for Godot. This delay only serves to dodge the need for engaging in a much-needed conversation starting with first principles, including, as Mornement et al. (2010 p. 316) noted, considering the meaning of behavior observed under “...highly artificial conditions and during a limited time”.

But most significantly, we should stop using these tests as a pass/fail measurement at all.

Ethically, we would argue, given the lack of scientific evidence for validity, reliability, and predictability of canine behavior evaluations for individual dogs in a field setting, a moratorium on any uses of these evaluations as the sole determinant of a dog’s fate is warranted, particularly when problematic behavior on the evaluation is the only cause for concern for a dog who has otherwise acted normally in the shelter. Careful observation of a dog’s daily behavior in the shelter during routine interactions is a more natural way to gauge a dog’s needs.
Read the full paper (free to download) here.

See also

Sacrificed on the altar of temperament testing (Saving Pets 2014)

Are ‘unscientific’ temperament tests costing dogs their lives? (Saving Pets 2010)

And it's not any better for cats... Ranger, vet confirm; cat 'temperament testing' can't tell a feral cat from a pet one (Saving Pets 2018)

Find this post interesting? Share it around.