Custom Benchmark Integration
Overview
In this guide, we will show you how to integrate a custom benchmark evaluation using Supervisely SDK. For most use cases, our Evaluator for Model Benchmark app in the Ecosystem provides a set of built-in evaluation metrics for various task types, such as object detection, instance segmentation, and semantic segmentation. However, in some cases, you may need to define custom metrics that are specific to your use case. The custom benchmark implementation allows you to achieve this goal – to evaluate the model performance with your own business metrics and visualize the results in a comprehensive report.

Key features of the custom benchmark implementation in Supervisely:
Custom Metrics and Charts: Implement custom evaluation metrics and visualizations for your specific use case.
Automate with Python SDK & API: Run evaluations from the script in your pipeline or release as a private app. Releasing as a private app allows you to run evaluations with a few clicks in the GUI interface or automate the launch using the Supervisely API (learn more here).
Easy integration with experiments in Supervisely: Integrate the custom benchmark with your custom training app to evaluate the best model checkpoint after training automatically and visualize the results in the Experiments page.
Implement Custom Evaluation
The custom benchmark implementation consists of several classes that interact with each other to perform the evaluation process and generate the report.
🛠️ Brief overview of the relationships between the instances:

And all you need to do is to implement these classes with your custom logic to calculate the metrics and generate the visualizations. We will guide you through the process step by step. Let's get started!
1. Custom Evaluator
The Evaluator
is the key component of a custom benchmark. Its main responsibility is to process Ground Truth (GT) and Predictions data, preparing it for evaluation.
Instead of computing every metric and chart directly, the Evaluator
focuses on generating essential processed data that serves as the foundation for further analysis. Some computer vision tasks require computationally expensive operations. For example, in object detection, each predicted instance must be matched with the corresponding GT instance, which can take significant time. However, once this matching is done, calculating metrics becomes straightforward.
To optimize performance, the Evaluator
:
Processes raw GT and Prediction data into a structured format suitable for metric calculation.
Handles computationally intensive tasks like matching predictions to GT.
Saves processed data to disk, avoiding redundant computations and speeding up further analysis.
By handling the heavy lifting in the evaluation pipeline, the Evaluator
ensures that metric computation remains efficient and scalable.
Before you start, make sure you have downloaded Ground Truth and Predictions projects in Supervisely format with the same datasets and classes. If you need to run evaluations on a subset of classes, you can provide a classes_whitelist
parameter.
BaseEvaluator
is a base class that provides the basic functionality for the evaluator.
Available arguments in the BaseEvaluator
class:
gt_project_path
: Path to the local GT project.pred_project_path
: Path to the local Predictions project.evaluation_params
: Optional: Evaluation parameters.result_dir
: Optional: Directory to save evaluation results. Default is./evaluation
.classes_whitelist
: Optional: List of classes to evaluate.
Let's start by creating a new class MyEvaluator
that inherits from BaseEvaluator
and overrides the evaluate
method. In our example, evaluate
method compares GT and Predicted annotations from two projects, counting the occurrences of each object class in images and objects. It iterates through datasets and images, retrieves annotations, and collects statistics for both ground truth and predicted data. Finally, it saves the evaluation results in self.eval_data
and dumps it to disk.
# src/evaluator.py
from collections import defaultdict
from pathlib import Path
import supervisely as sly
from supervisely.nn.benchmark.base_evaluator import BaseEvaluator
from src.eval_result import MyEvalResult
class MyEvaluator(BaseEvaluator):
eval_result_cls = MyEvalResult # we will implement this class in the next step
def evaluate(self):
"""This method should perform the evaluation process."""
# For example, let's iterate over all datasets and calculate some statistics
gt_project = sly.Project(self.gt_project_path, sly.OpenMode.READ)
pred_project = sly.Project(self.pred_project_path, sly.OpenMode.READ)
gt_stats = {"images_count": defaultdict(int), "objects_count": defaultdict(int)}
pred_stats = {"images_count": defaultdict(int), "objects_count": defaultdict(int)}
for ds_1 in gt_project.datasets:
ds_2 = pred_project.datasets.get(ds_1.name)
ds_1: sly.Dataset
for name in ds_1.get_items_names():
ann_1 = ds_1.get_ann(name, gt_project.meta)
ann_2 = ds_2.get_ann(name, pred_project.meta)
for label in ann_1.labels:
class_name = label.obj_class.name
gt_founded_classes.add(class_name)
gt_stats["objects_count"][class_name] += 1
for label in ann_2.labels:
class_name = label.obj_class.name
pred_founded_classes.add(class_name)
pred_stats["objects_count"][class_name] += 1
for class_name in gt_founded_classes:
gt_stats["images_count"][class_name] += 1
for class_name in pred_founded_classes:
pred_stats["images_count"][class_name] += 1
# save the evaluation results
self.eval_data = {"gt_stats": gt_stats, "pred_stats": pred_stats}
# Dump the eval_data to disk (to be able to load it later)
save_path = Path(self.result_dir) / "eval_data.json"
sly.json.dump_json_file(self.eval_data, save_path)
return self.eval_data
2. Custom EvalResult
This class will be used as a data interface to access the evaluation metrics in the visualizer.
When initializing the EvalResult
object, it calls the _read_files
method to load the evaluation metrics from disk and the _prepare_data
method to prepare the data for easy access. So, you need to implement these two methods in the MyEvalResult
class.
Let's create a new file eval_result.py
and implement the MyEvalResult
class.
# src/eval_result.py
from collections import defaultdict
from pathlib import Path
import supervisely as sly
from supervisely.nn.benchmark.base_evaluator import BaseEvalResult
class MyEvalResult(BaseEvalResult):
def _read_files(self, path: str) -> None: # ⬅︎ This method is required
"""This method should LOAD evaluation metrics from disk."""
save_path = Path(path) / "eval_data.json" # path to the saved evaluation metrics
self.eval_data = sly.json.load_json_file(str(save_path))
def _prepare_data(self) -> None: # ⬅︎ This method is required
"""This method should PREPARE data to allow easy access to the data."""
gt = self.eval_data.get("gt_stats", {})
pred = self.eval_data.get("pred_stats", {})
# class statistics (class names as keys and number of objects as values)
self._objects_per_class = self._get_objects_per_class(gt, pred)
# GT metrics
gt_obj_num = self._get_total_objects_count(gt)
gt_cls_num = self._get_num_of_used_classes(gt)
gt_cls_most_freq = self._get_most_frequent_class(gt)
# Prediction metrics
pred_obj_num = self._get_total_objects_count(pred)
pred_cls_num = self._get_num_of_used_classes(pred)
pred_cls_most_freq = self._get_most_frequent_class(pred)
self._key_metrics = {
"Objects Count": [gt_obj_num, pred_obj_num],
"Found Classes": [gt_cls_num, pred_cls_num],
"Classes with Max Figures": [gt_cls_most_freq, pred_cls_most_freq],
}
# ---------------- ⬇︎ Properties to access the data easily ⬇︎ ----------------- #
@property
def key_metrics(self):
"""Return key metrics as a dictionary."""
return self._key_metrics.copy()
@property
def objects_per_class(self):
"""Return the number of objects per class."""
return self._objects_per_class.copy()
# ------- ⬇︎ Utility methods (you can create any methods you need) ⬇︎ --------- #
def _get_most_frequent_class(self, stats: dict):
name = max(stats.get("objects_count", {}).items(), key=lambda x: x[1])[0]
return f"{name} ({stats['objects_count'][name]})"
def _get_total_objects_count(self, stats: dict):
return sum(stats.get("objects_count", {}).values())
def _get_objects_per_class(self, gt: dict, pred: dict):
gt_img_stats = gt.get("objects_count", {})
pred_img_stats = pred.get("objects_count", {})
images_per_class = defaultdict(dict)
for name, gt_images_count in gt_img_stats.items():
pred_images_count = pred_img_stats.get(name, 0)
images_per_class[name] = [gt_images_count, pred_images_count]
return images_per_class
def _get_num_of_used_classes(self, stats: dict):
return len(stats.get("images_count", {}))
3. Visualizer, Charts and Widgets
This step involves creating a custom Visualizer
class that inherits from BaseVisualizer
. The class should generate visualizations, save them to disk, and upload them to the Team Files (to open the visualizations in the web interface).
First, let's create a few widgets that we will use in the visualizer. We will start with the MarkdownWidget
, TableWidget
, and ChartWidget
. Our example will include three sections in the report: Intro
(header + overview), KeyMetrics
(text + table), and CustomMetric
(text + chart). To make the code more readable, we will split the code into separate files for each section.
Feel free to change the widget content and appearance to suit your needs. The example below with Markdown, Table, and Chart widgets is just a starting point.
Take a look at the Intro
widget:

Let's create the KeyMetrics
section.
Here is the KeyMetrics
widget in action:

Let's create the CustomMetric
section.
The CustomMetric
widget will look like this:

Finally, let's implement the custom visualizer class that will use these widgets. All you need to do is to implement the _create_widgets
and _create_layout
methods.
# src/visualizer.py
from supervisely.nn.benchmark.base_visualizer import BaseVisualizer
from supervisely.nn.benchmark.visualization.widgets import (
ContainerWidget,
SidebarWidget,
)
from supervisely.nn.task_type import TaskType
from src.widgets import CustomMetric, Intro, KeyMetrics
class MyVisualizer(BaseVisualizer):
@property
def cv_task(self):
return TaskType.OBJECT_DETECTION
def _create_widgets(self):
"""In this method, we initialize and configure all the widgets that we will use"""
vis_text = "N/A" # not used in this example
# Intro (Markdown)
me = self.api.user.get_my_info()
intro = Intro(vis_text, self.eval_result)
self.intro_header = intro.get_header(me.login)
self.intro_md = intro.md
# Key Metrics (Markdown + Table)
key_metrics = KeyMetrics(vis_text, self.eval_result)
self.key_metrics_md = key_metrics.md
self.key_metrics_table = key_metrics.table
# Custom Metric (Markdown + Chart)
custom_metric = CustomMetric(vis_text, self.eval_result)
self.custom_metric_md = custom_metric.md
self.custom_metric_chart = custom_metric.chart
def _create_layout(self):
"""
Method to create the layout of the visualizer.
We define the order of the widgets in the report and their visibility in the sidebar.
"""
# Create widgets
self._create_widgets()
# Configure sidebar
# (if 1 - will display in sidebar, 0 - will not display in sidebar)
is_anchors_widgets = [
# Intro
(0, self.intro_header),
(1, self.intro_md),
# Key Metrics
(1, self.key_metrics_md),
(0, self.key_metrics_table),
# Custom Metric
(1, self.custom_metric_md),
(0, self.custom_metric_chart),
]
anchors = []
for is_anchor, widget in is_anchors_widgets:
if is_anchor:
anchors.append(widget.id)
sidebar = SidebarWidget(widgets=[i[1] for i in is_anchors_widgets], anchors=anchors)
layout = ContainerWidget(title="Custom Benchmark", widgets=[sidebar])
return layout
4. Run the code
Before we run the custom benchmark, prepare the environment credentials in the supervisely.env
file:
SERVER_ADDRESS= # ⬅︎ change the value
API_TOKEN= # ⬅︎ change the value
Learn about the basics of authentication in Supervisely here.
Create a main.py
script to run the custom benchmark:
# src/main.py
import os
import supervisely as sly
from dotenv import load_dotenv
from src.evaluator import MyEvaluator
from src.visualizer import MyVisualizer
if sly.is_development():
load_dotenv(os.path.expanduser("~/supervisely.env"))
load_dotenv("local.env")
api = sly.Api()
team_id = sly.env.team_id()
gt_project_id = 73
pred_project_id = 159
workdir = sly.app.get_data_dir()
gt_path = os.path.join(workdir, "gt_project")
pred_path = os.path.join(workdir, "pred_project")
eval_result_dir = os.path.join(workdir, "evaluation")
vis_result_dir = os.path.join(workdir, "vizualizations")
# 0. Download projects
for project_id, path in [(gt_project_id, gt_path), (pred_project_id, pred_path)]:
if not sly.fs.dir_exists(path):
sly.download_project(
api,
project_id,
path,
log_progress=True,
save_images=False,
save_image_info=True,
)
# 1. Initialize Evaluator
evaluator = MyEvaluator(gt_path, pred_path, eval_result_dir)
# 2. Run evaluation
evaluator.evaluate()
# 3. Initialize EvalResult object
eval_result = evaluator.get_eval_result()
# 4. Initialize visualizer and visualize
visualizer = MyVisualizer(api, [eval_result], vis_result_dir)
visualizer.visualize()
# 5. Upload to Supervisely Team Files
remote_dir = "/model-benchmark/custom_benchmark"
api.file.upload_directory(team_id, evaluator.result_dir, remote_dir + "/evaluation")
# ⬇︎ required to open visualizations in the web interface
visualizer.upload_results(team_id, remote_dir + "/visualizations/")
🔗 Recap of the files structure:
.
├── src/
│ ├── __init__.py
│ ├── evaluator.py # 49 lines of code
│ ├── eval_result.py # 73 lines of code
│ ├── visualizer.py # 64 lines of code
│ ├── widgets/
│ │ ├── __init__.py
│ │ ├── intro.py # 36 lines of code
│ │ ├── key_metrics.py # 38 lines of code
│ │ └── custom_metric.py # 22 lines of code
│ └── main.py # 53 lines of code
└── local.env # 1 line of code
Run the main.py
script – python src/main.py
in the terminal. For debugging, you can use the launch.json
file in the repository (select the "Python Current File" configuration and press F5
or Run and Debug
).

After the evaluation is complete, you will receive a link to the report in the logs. You can open the report in the web interface by clicking on the link. Also, you will find the evaluation results in the Team Files in the folder that you specified in the script (/model-benchmark/custom_benchmark/vizualizations/Model Evaluation Report.lnk
)

hooray! 🎉 You have successfully implemented a custom benchmark evaluation in Supervisely!

But wait, there is another way to run the custom benchmark – using Deployed NN Model. Let's move on to the next step.
Run Evaluation of your Models
In this section we will evaluate a model, that is deployed in Supervisely platform. We will use the model to get predictions and evaluate them, instead of using downloaded project with predictions.
For this purpose, we will create a new custom benchmark class that inherits from the BaseBenchmark
.
All you need to do is to change only 3 lines of code! 💫
1. Create a new file benchmark.py
with the following content:
# src/benchmark.py
from supervisely.nn.benchmark.base_benchmark import BaseBenchmark
from supervisely.nn.task_type import TaskType
from src.evaluator import MyEvaluator
from src.visualizer import MyVisualizer
class CustomBenchmark(BaseBenchmark):
visualizer_cls = MyVisualizer # ⬅︎ the visualizer class
@property
def cv_task(self) -> str:
return TaskType.OBJECT_DETECTION # ⬅︎ the visualizer class
def _get_evaluator_class(self) -> type:
return MyEvaluator # ⬅︎ the visualizer class
Here is a brief overview of the relationships between the classes in this scenario. As you can see, we will use the same engine classes, but the input will be different – the GT project ID and the deployed model session ID (instead of the local projects).

2. Update the main.py
script to run the custom benchmark on the deployed model session:
# src/main.py
import os
import supervisely as sly
from dotenv import load_dotenv
from src.benchmark import CustomBenchmark
if sly.is_development():
load_dotenv(os.path.expanduser("~/supervisely.env"))
load_dotenv("local.env")
api = sly.Api()
team_id = 8
gt_project_id = 73
pred_project_id = 159
model_session_id = 1234
# 1. Initialize benchmark
bench = CustomBenchmark(api, gt_project_id, output_dir=sly.app.get_data_dir())
# 2. Run evaluation
bench.evaluate(pred_project_id) # ⬅︎ evaluate without inference
# bench.run_evaluation(model_session_id) # ⬅︎ evaluate with inference
# 3. Generate charts and dashboards
bench.visualize()
# 4. Upload to Supervisely Team Files
remote_dir = f"/model-benchmark/custom_benchmark/{model_session_id}"
bench.upload_eval_results(remote_dir + "/evaluation/")
# ⬇︎ required to open visualizations in the web interface
bench.upload_visualizations(remote_dir + "/visualizations/")
That's it! And you are ready to run the custom evaluation on different deployed models.
Run the main.py
script – python src/main.py
in the terminal (or use the launch.json
file from source code for debugging).
As in the previous step, you will receive a link to the report in the logs or find the evaluation results in the Team Files.

Great job! 🎉 You have successfully implemented a custom benchmark evaluation on the deployed model in Supervisely!
Let's move on to the next level and integrate the custom benchmark with the GUI interface 🎨.
Plug-in the Custom Benchmark to the GUI
In this step, we will create a sly.Application
(high-level class in the Supervisely SDK that allows you to create a FastAPI application with GUI interface) that will run the custom benchmark evaluation. The application will allow you to select the GT project, the deployed model session, and the evaluation parameters and run the evaluation with a few clicks in the web interface.
You can take a look at the Evaluator for Model Benchmark app in the Ecosystem to see how we implemented the GUI interface for the evaluation process.
First, let's create the local.env
file with the following variables:
SLY_APP_DATA_DIR = "APP_DATA"
TEAM_ID = 8
Now, we will upgrade the main.py
from the simple script to the sly.Application
.
Launch the application using the following command in the terminal:
uvicorn src.main:app --host 0.0.0.0 --port 8000 --ws websockets --reload
Or use the launch.json
file from the source code for debugging and press F5
or Run and Debug
.
Open the browser and go to http://localhost:8000 to see the GUI interface.Select the GT project, the deployed model session, and press the Evaluate
button to run the evaluation. The app will connect to the deployed NN model, run the inference, upload predictions to a new project, evaluate the model, and generate the report.
After the process is complete, you will see a widget with the evaluation report (click on the link to open the report in the web interface).

Check out our Developer Portal to learn more on how to release your app as a private app in Supervisely – here.
Congratulations! 🎉 You have successfully integrated the custom benchmark with the GUI interface in Supervisely!

Last updated
Was this helpful?