Horse Racing: Detection & Tracking in Real-time

The Supervisely Team is pleased to share a successful solution for tracking objects in horse racing videos using a DEIM detector combined with NvSort tracking algorithm within NVIDIA's accelerated DeepStream environment to achieve real-time, high-performance object tracking in video streams. We efficiently annotated our dataset using an Active Learning approach and a pre-labeling pipeline that leverages Florence 2 model to minimize manual labeling efforts. Our optimized pipeline achieves 275 FPS on NVIDIA RTX 4090 GPU while maintaining high detection accuracy with 72.93 mAP.

Project Overview

This project delivers an automated solution for tracking objects in horse racing videos with high accuracy and real-time performance. The system identifies and tracks essential racing elements including horses, riders, horse numbers, and other race-related objects throughout video sequences.

Our solution combines two core computer vision techniques: Object Detection (locating objects in individual frames) and Multi-Object Tracking (following objects across consecutive frames). This enables analysts to monitor race performance, study racing dynamics, and extract actionable insights from video footage.

Data type: Video Task types: Object Detection, Multi-Object Tracking Models used: DEIM, Florence 2, YOLOv12 Key techniques: Zero-shot pre-labeling with Active Learning for efficient annotation, TensorRT optimization, NVIDIA DeepStream integration with NvSORT tracker Target objects: horse, horse head, rider, number plate, white stick, yellow stick

Solution Approach

Our team focused on finding the optimal solution that can be used to solve this case in the most effective way. The solution follows these key steps:

  1. Video Import: Upload 107 horse racing videos to the Supervisely platform for processing and analysis.

  2. Smart Data Annotation: Implement an Active Learning approach to streamline the labeling process. We begin with zero-shot pre-labeling using Florence 2 model, then iteratively train custom detectors to continuously improve annotation quality.

  3. Model Training: Train our object detection model using DEIM architecture, which demonstrated superior performance compared to alternative models like YOLOv12 in our testing.

  4. Performance Optimization: Export the trained model to TensorRT format to maximize inference speed while preserving detection accuracy.

  5. Production Deployment: Deploy using NVIDIA's DeepStream framework integrated with the NvSORT tracker, creating an accelerated pipeline that achieves 275 FPS on NVIDIA RTX 4090 GPU.

The solution is implemented using the Supervisely platform and its features, such as video annotation, model training and experiments, model evaluation and comparison, and deploying models for inference (you can check corresponding documentations to learn how these features work in Supervisely: Experiments, Model Evaluation, Inference & Deployment). Each step is detailed in the corresponding sections below.

Table of Contents:

1. Import data

The first step is to import the video files into the Supervisely platform. There are plenty of ways to do that, and all of them are described in the Import section of the Supervisely documentation. In this case, we'll briefly describe one of the options - manual upload of the data from the local machine.

  1. Create a new project in Supervisely (If your workspace is empty, you can just click the Import Data button).

  2. Choose the Videos option in the Type of project section, and click Create.

  3. Next, drag & drop your video files into the project.

If you need to import files from a remote server or from Cloud Storage, you can use apps like Import Videos from Cloud Storage or Remote Import.

2. Annotation & Active Learning

For efficient annotation of a large dataset, we implemented an Active Learning approach, which makes the process more effective by iteratively training models and using them to assist in annotation. This approach significantly reduces the manual labeling effort required while maintaining high annotation quality.

Pre-labeling with Florence 2

In the initial phase when no trained models were available suitable for our task, we utilized Florence 2 model with zero-shot capabilities to generate preliminary annotations for the first 500 frames which were uniformly sampled across the videos.

  1. Sampled 500 frames from the input videos uniformly.

  2. Crafted a custom pipeline for Florence 2 to achieve the best possible accuracy from a zero-shot model

  3. Applied this pipeline to automatically pre-label the first 500 frames

  4. Sent these preliminary annotations for manual review and correction

After manual correction, we evaluated the quality of pre-labeling with Florence 2.

Model
Validation Size
F1-score (avg. per-image)
Average Recall (AR)
mAP*

Florence-2 pipeline

500 frames

0.4869

0.411

0.089

* mAP is reported for reference. It is not the main metric for evaluating pre-labeling quality. Additionally, Florence 2 architecture does not generate confidence scores for detected objects. To calculate mAP, we set all confidence scores to 1, which, obviously, does not reflect the model's performance accurately.

The model performed quite well for initial pre-labeling. The F1-score of 0.49 indicates that nearly half of the objects were correctly identified, which is a solid starting point for manual refinement of annotations.

Florence 2 Auto Pre-labeling

Active Learning Labeling Process

After obtaining the first 500 annotated frames, we started the iterative process of training and annotation (also known as human-in-the-loop). In each iteration, we trained a DEIM model on the currently labeled dataset, used it to pre-label the next batch of images, and then had annotators correct any mistakes. This cycle was repeated until the entire dataset of 6000 frames was annotated. DEIM models were trained with 640x640 input resolution and architecture DEIM D-FINE-L.

We performed 3 iterations of this process with the following dataset sizes:

  • Iteration 1: 500 annotated frames

  • Iteration 2: 2000 annotated frames

  • Iteration 3: 6000 annotated frames

After this, we re-evaluated all intermediate models on the final validation set of 725 images. To evaluate and compare models fairly, we used a consistent validation set of 725 images which were not included in training set of any iteration. The validation set was incrementally built along with the training set during the active learning process. Here are the resulting metrics after each iteration:

Iteration
Training Size
Validation Size
mAP

1

400

725

58.71

2

1675

725

71.98

3

5275

725

72.93

The mAP improved significantly from 58.71 in the first iteration to 71.98 in the second iteration with the addition of 1500 more annotated frames. The improvement from the second to the third iteration was smaller, gaining only +0.95 points, indicating that the model was approaching its performance ceiling with 6000 data examples (5275 in training set).

Data Collections: The Supervisely's Collections feature was used in managing and organizing train/validation splits effectively. Once training is complete, the validation set used in this training automatically becomes a collection, which can be reused later. See the Collections documentation.

3. Training Experiments

After the annotation process was completed with 6000 annotated frames, we proceeded to train the final object detection model. We evaluated two architectures: YOLOv12-L and DEIM D-FINE-L, and selected the one that provided the best balance of accuracy and inference speed for our use case.

Model Architectures:

  • YOLO: A popular object detection model family recognized for speed and decent accuracy. While widely used in real-time applications, it is not the most efficient model available today. Additionally, its AGPL-3.0 license restricts commercial use.

  • DEIM: A state-of-the-art real-time object detection model based on DETR architecture. The work follows RT-DETR principles (as described in DETRs Beat YOLOs on Real-time Object Detection), DEIM not only outperforms YOLO in real-time detection but also provides effective strategies for accelerating training convergence (detailed in the DEIM research paper). DEIM was accepted to CVPR 2025 and has Apache 2.0 open-source license.

Comparing DEIM with YOLOv12

We tested the YOLOv12-L model using the same dataset and training methodology. YOLOv12-L achieved only 45.53 mAP on the first training iteration, and 53.4 mAP on the final iteration, significantly underperforming compared to DEIM.

Model
Iteration
Training Size
mAP

YOLOv12-L

1

400

45.53

YOLOv12-L

3

5275

53.4

DEIM D-FINE-L

1

400

58.71

DEIM D-FINE-L

3

5275

72.93🏆

Architecture Comparison:

Model
Dataset size
mAP
Params
Latency
GFLOPs

YOLOv12-L

6000

53.4

26.4M

6.77ms

88.9

DEIM D-FINE-L

6000

72.93🏆

31M

8.07ms

91

DEIM vs YOLOv12 Comparison

Model Evaluation & Comparison in Supervisely

To evaluate and compare the models in-depth, we used Supervisely's Model Evaluation Benchmark – an excellent tool to analyze and compare the performance of different models in detail. It provides a comprehensive suite of metrics and visualizations allowing you to not only assess common metrics like mAP or accuracy, but also to understand model behavior through comprehensive tables with per-image metrics, looking at the model predictions, confusion matrices, precision-recall curves, and more.

F1-score DEIM vs YOLOv12

In this f1-score comparison, we can see that DEIM consistently outperforms YOLOv12 across all classes, with a particularly significant advantage in detecting smaller objects like "number plate" and "white stick".

Learn how to evaluate and compare models in Supervisely in the Model Evaluation Benchmark documentation.

Training with Different Resolution

We also experimented with training DEIM model at different input resolutions: 640x640, 1536x864, and 1920x1088 (Note, that DEIM input size should be divisible by 32) on the same GPU NVIDIA RTX 4090 with 24GB VRAM. We selected model variants and batch sizes that fit into the GPU memory.

Model Variant
Input Size
Batch Size
Epochs
mAP

DEIM D-FINE-S

1536x864

4

100

64.87

DEIM D-FINE-N

1920x1088

12

110

70.88

DEIM D-FINE-L

640x640

8

100

72.93🏆

We observed that training at higher resolutions did not yield better accuracy. Training at higher resolutions required either smaller batch size or a model variant with fewer parameters to fit into GPU memory. The DEIM D-FINE-L model trained at 640x640 resolution achieved the highest mAP of 72.93.

Finally, we chose the DEIM D-FINE-L at 640x640 for further deployment.

Training Experiments: while experimenting, it is important to keep track of all the training sessions, configurations, and results. In Supervisely, you can explore all your training runs on the Experiments page. From there, you can start a new training session, compare results, and manage your models effectively. Check the Experiments documentation for more details.

4. Optimization & Deployment

To meet the requirement of processing video at 50+ FPS in 1920x1080 resolution, we implemented several optimization techniques:

TensorRT Export

We exported our trained model to Nvidia TensorRT engine. TensorRT provides significant acceleration through hardware-specific optimizations on NVIDIA GPUs.

In Supervisely you can export models to ONNX or TensorRT directly in train applications (Train DEIM), just select the "Export to TensorRT" option in the training configuration. The model will be automatically converted after training and saved in Team Files.

Nvidia DeepStream Integration

We integrated our TensorRT-optimized model into the Nvidia DeepStream framework. This framework is designed for high-performance video analytics and supports efficient processing pipelines. It also provides built-in multi-object tracking algorithms. We selected the NvSORT tracker for its balance of speed and accuracy.

Our optimized pipeline consists of:

  1. Video input at 1920x1080 resolution

  2. DEIM detector running on TensorRT in 640x640 resolution

  3. NvSORT tracker for associating detections across frames

This setup achieves real-time performance with 275 FPS on NVIDIA RTX 4090 GPU, significantly exceeding the 50 FPS requirement.

Demo video

Quick Start with DEIM and DeepStream

After training a model with Train DEIM app, you can easily integrate it with DeepStream tracking pipeline. We prepared a quick start guide in this GitHub repository. This guide will help you set up the environment using our prepared Dockerfile and run inference on your video file using a trained DEIM model and NvSORT tracker in DeepStream framework. Here are the steps:

1. Clone repository

git clone https://github.com/supervisely-ecosystem/deim
cd deim

2. Build Docker image

docker build -f supervisely_integration/deepstream/Dockerfile -t deim-deepstream .

3. Prepare data and model

After training your model, download the files model.pth (or best.pth), model_config.yml, and model_meta.json from Supervisely Team Files. Create a data folder on your machine and place your input video and model files there. The folder structure should look like this:

data/
├── input_video.mp4      # your input video
└── model/               # your model folder
    ├── model.pth        # your PyTorch trained model weights  
    ├── model_config.yml # DEIM model configuration file
    └── model_meta.json  # Supervisely export metadata (classes info)

4. Run inference

When running the container, you mount your local data/ directory into the container (-v $(pwd)/data:/data) and pass environment variables to specify the input video (INPUT_VIDEO), the model directory (MODEL_DIR), and the output path (OUTPUT_FILE). These variables must point to the paths inside the container. This way the container can access your video and model files, and save the results back to your local machine.

You can choose the output mode: either render the output video with predicted bounding boxes, or output a JSON file with predictions.

Video output (MP4 with bounding boxes):

docker run --gpus all --rm \
    -v $(pwd)/data:/data \
    -e OUTPUT_MODE=video \
    -e INPUT_VIDEO=/data/input_video.mp4 \
    -e MODEL_DIR=/data/model \
    -e OUTPUT_FILE=/data/result \
    deim-deepstream

Output: data/result.mp4

JSON output (coordinates data):

docker run --gpus all --rm \
    -v $(pwd)/data:/data \
    -e OUTPUT_MODE=json \
    -e INPUT_VIDEO=/data/input_video.mp4 \
    -e MODEL_DIR=/data/model \
    -e OUTPUT_FILE=/data/predictions \
    deim-deepstream

Output: data/predictions.json

JSON format:

{"frame_id":0,"timestamp":1234567890,"objects":[{"bbox":{"left":100.5,"top":200.3,"width":50.2,"height":80.1},"confidence":0.85,"class_id":0,"track_id":1,"class_name":"person"}]}
{"frame_id":1,"timestamp":1234567891,"objects":[{"bbox":{"left":102.1,"top":201.8,"width":49.8,"height":79.5},"confidence":0.83,"class_id":0,"track_id":1,"class_name":"person"}]}

5. Exporting Data and Models

Exporting the data

At any time, you can export your assets from the Supervisely platform. This applies to both the data (video files with annotations) and the trained models. There are several ways to download and export the data, which are described in the Export section of the Supervisely documentation. In this case, we'll briefly describe one of the options - exporting the data from the platform's UI.

Export Project

Exporting the models

All of the artifacts that were created during the training process, including the trained models, are stored in the Team Files. You can just right-click on any folder or file and download it to your local machine.

There's no vendor lock in Supervisely, so you can use the models completely outside of the Supervisely platform, for example, in your own Python scripts or in Docker containers. Check our documentation on how you can use and deploy trained models: Inference & Deployment, and Using trained models outside of Supervisely.

Export Model

Using trained models outside of Supervisely

We prepared a demo script that shows how to load the trained DEIM model and get predictions on images in pure PyTorch code (and outside of Supervisely):

This way, you can download the trained model from Team Files and use it in your own code. There are also demos for using the model in ONNX and TensorRT formats:

Last updated