top of page

Across Laboratory and In-the-Wild Videos: How SuperAnimal Improves the Way Animal Behavior Is Tracked

Every movement an animal makes contains a wealth of quantifiable detail: the position of the nose, the angle of the limbs, the swing of the tail, shifts in the body's center of mass, and each episode of rearing, running, exploration, or stillness. All of these can serve as clues for understanding the nervous system, disease, animal welfare, and ecological adaptation. In the past, accurately tracking animal movement from video usually required researchers to select images, manually label keypoints on the animal's body, and then train a dedicated model. Open-source tools such as DeepLabCut have greatly lowered this barrier. With only tens to hundreds of labeled images, researchers can build a fairly reliable animal pose estimation model. Yet this workflow still has an important limitation: different laboratories may study similar animals, repeatedly label similar body parts, and train similar models independently. The problem becomes even more complicated when different datasets label the same structure in different ways. For example, the nose of a mouse may be named nose, snout, or mouse1_nose. Some datasets may label only 4 keypoints, while others may label more than 20. These inconsistencies make it difficult to combine data across laboratories and limit the ability of models to work across laboratories and across experimental settings.


DeepLabCut technology for animal behavior tracking

This is where the SuperAnimal method comes in. Its central goal is to move animal pose estimation toward a more general and reusable framework. The research team aimed to build pretrained models, meaning models that have already been exposed to large amounts of data and have learned broadly useful patterns, so they can be applied directly to many animal species, camera conditions, and experimental environments without additional manual labeling. By learning body structure and movement patterns from large, diverse animal pose datasets, these models can later be used directly by researchers. If a new dataset differs substantially from the original training data, only a small number of labeled images may be needed for fine-tuning. The team developed two representative models: SuperAnimal-TopViewMouse, designed mainly for top-view mouse videos, and SuperAnimal-Quadruped, designed for quadrupeds beyond laboratory mice. The latter covers more than 45 animal species, with data drawn from both laboratory and in-the-wild images, totaling more than 85,000 images.


SuperAnimal-Quadruped tracking of animal body parts in images(Image source:Ye S et al. (2024), CC BY 4.0 )
SuperAnimal-Quadruped tracking of animal body parts in images(Image source:Ye S et al. (2024), CC BY 4.0 )

The purpose of SuperAnimal is to treat every dataset, regardless of differences in label number, naming conventions, or imaging conditions, as part of a broader map of the animal body. Some datasets provide more complete body information, while others contain only a few labeled positions. As long as these labels can be correctly matched to shared body parts, they can contribute to model learning. The challenge is that the model must understand that "unlabeled" does not mean "absent." For example, if a mouse dataset labels only the nose, body center, and tail base, this does not mean the ears or neck are unimportant. It simply means those positions were not annotated in that dataset. SuperAnimal's training strategy avoids this misunderstanding, allowing the model to learn a more complete representation of animal body structure from datasets with different levels of annotation completeness. The method also aligns identical or similar keypoints across datasets, reducing interference caused by differences in naming habits and human annotation bias.


SuperAnimal therefore learns far more than the colors, edges, or textures present in an image. General image models can learn to recognize object contours from large collections of pictures, but they do not necessarily understand how animal body parts relate to one another. By learning directly from animal pose data, this method gains a more biologically grounded ability to interpret body form. It learns how the head, trunk, limbs, and tail are usually connected, which positions change during movement, and which body keypoints should remain in reasonable spatial relationships across different postures. This pose-specific prior, grounded in animal body structure, provides an important foundation for maintaining stable performance when the model is applied to new images.


The research team first tested SuperAnimal using top-view mouse images. They deliberately excluded certain datasets during training and then asked the model to analyze images it had never seen before, simulating the situation in which a researcher applies the model to a new video. The results showed that even without additional manual labeling, SuperAnimal-TopViewMouse could still identify body positions in unfamiliar mouse images. When a small number of newly labeled images were provided, performance improved further. In one mouse behavior video dataset, fine-tuning with only 10 images allowed the model to approach the performance that a conventional method achieved with roughly 100 images. This means researchers no longer need to repeatedly label large numbers of animal body parts in order to establish a usable behavioral analysis pipeline.


The task becomes even more difficult for quadruped data, because images of horses, dogs, cats, cows, sheep, and wild rodents vary widely in their visual conditions. Animals may be partly occluded by objects in the environment, backgrounds may contain grass, shadows, or clutter, and both viewing angle and camera distance may change from image to image. Tests of SuperAnimal-Quadruped showed that the model could apply body-structure regularities learned from many animal species to new image datasets. In the horse dataset, for example, fine-tuning with only 5% of the training data was enough to match the performance of a conventional method trained with the full dataset. This is especially useful for in-the-wild animal behavior studies, where manual labeling is often more difficult and time-consuming.


Video analysis introduces another common problem: a model may perform well on individual images but become unstable across continuous video frames. An animal may be moving smoothly, yet the model's predicted nose, tail tip, or limb positions may jitter from one frame to the next, disrupting downstream behavioral interpretation. To address this, the research team proposed an unsupervised video adaptation method that requires no additional manual labels. The model first analyzes the video, then uses information from the continuous frames to refine its predictions, making the trajectories of body keypoints smoother over time. This approach reduced jitter in most videos and improved the reliability of behavioral analysis.


Tracking stability of SuperAnimal-Quadruped in videos of quadrupeds in the wild. Videos of a black dog (h) and an elk (i) in the wild were used to test model performance. The curves above indicate the 2D area of the body outline formed by the keypoints predicted by the model in each frame; frequent fluctuations in the curves indicate less stable keypoint positions across consecutive frames. The images below compare three processing outputs: raw detections are the direct predictions produced by the model, +video adaptation shows the results after the model adapts itself to the video, and +VA+mean filter adds smoothing after video adaptation. After video adaptation and smoothing, the keypoint trajectories become more continuous, indicating that the model can reduce prediction jitter in complex in-the-wild videos(Image source:Ye S et al. (2024), CC BY 4.0 )
Tracking stability of SuperAnimal-Quadruped in videos of quadrupeds in the wild. Videos of a black dog (h) and an elk (i) in the wild were used to test model performance. The curves above indicate the 2D area of the body outline formed by the keypoints predicted by the model in each frame; frequent fluctuations in the curves indicate less stable keypoint positions across consecutive frames. The images below compare three processing outputs: raw detections are the direct predictions produced by the model, +video adaptation shows the results after the model adapts itself to the video, and +VA+mean filter adds smoothing after video adaptation. After video adaptation and smoothing, the keypoint trajectories become more continuous, indicating that the model can reduce prediction jitter in complex in-the-wild videos(Image source:Ye S et al. (2024), CC BY 4.0 )

SuperAnimal provides a more efficient framework for animal behavior analysis. It brings together pose data accumulated by different researchers and turns them into pretrained models that can be shared, reused, and fine-tuned. In doing so, it allows future studies of animal behavior to build on existing collective resources rather than repeatedly starting from scratch.



Author: Shui-Ye You


Reference:

Ye S et al. (2024). SuperAnimal pretrained pose estimation models for behavioral analysis. Nature Communications.




Comments


bottom of page