We have prepared a GitHub repository with the source code for this guide. You can clone the repository and follow the instructions to implement a custom benchmark evaluation in Supervisely.
In this guide, we will show you how to integrate a custom benchmark evaluation using Supervisely SDK. For most use cases, our Evaluator for Model Benchmark app in the Ecosystem provides a set of built-in evaluation metrics for various task types, such as object detection, instance segmentation, and semantic segmentation. However, in some cases, you may need to define custom metrics that are specific to your use case. The custom benchmark implementation allows you to achieve this goal – to evaluate the model performance with your own business metrics and visualize the results in a comprehensive report.
Key features of the custom benchmark implementation in Supervisely:
Custom Metrics and Charts: Implement custom evaluation metrics and visualizations for your specific use case.
Automate with Python SDK & API: Run evaluations from the script in your pipeline or release as a private app. Releasing as a private app allows you to run evaluations with a few clicks in the GUI interface or automate the launch using the Supervisely API (learn more here).
Easy integration with experiments in Supervisely: Integrate the custom benchmark with your custom training app to evaluate the best model checkpoint after training automatically and visualize the results in the Experiments page.
Implement Custom Evaluation
The custom benchmark implementation consists of several classes that interact with each other to perform the evaluation process and generate the report.
🧩 Key Components.
Here are the main classes that you need to subclass to implement your custom benchmark:
BaseEvaluator: This class responsible for all calculations and evaluation procedures needed for metrics and visualizations. It should output evaluation data, from which the necessary metrics and charts will be generated. It should also save this data to disk.
BaseEvalResult: A data interface class that provides easy access to metrics and data for charts. The EvalResult class will be used by the Visualizer to retrieve ready-to-use metrics.
BaseVisualizer: A class that generates the resulting evaluation report. It responsible for renedering all your code and widgets into a web page.
🛠️ Brief overview of the relationships between the instances:
And all you need to do is to implement these classes with your custom logic to calculate the metrics and generate the visualizations. We will guide you through the process step by step. Let's get started!
Check out the GitHub repository with the source code for this guide to see examples of the implementation.
1. Custom Evaluator
The Evaluator is the key component of a custom benchmark. Its main responsibility is to process Ground Truth (GT) and Predictions data, preparing it for evaluation.
Instead of computing every metric and chart directly, the Evaluator focuses on generating essential processed data that serves as the foundation for further analysis. Some computer vision tasks require computationally expensive operations. For example, in object detection, each predicted instance must be matched with the corresponding GT instance, which can take significant time. However, once this matching is done, calculating metrics becomes straightforward.
To optimize performance, the Evaluator:
Processes raw GT and Prediction data into a structured format suitable for metric calculation.
Handles computationally intensive tasks like matching predictions to GT.
Saves processed data to disk, avoiding redundant computations and speeding up further analysis.
By handling the heavy lifting in the evaluation pipeline, the Evaluator ensures that metric computation remains efficient and scalable.
Before you start, make sure you have downloaded Ground Truth and Predictions projects in Supervisely format with the same datasets and classes. If you need to run evaluations on a subset of classes, you can provide a classes_whitelist parameter.
BaseEvaluator is a base class that provides the basic functionality for the evaluator.
Available arguments in the BaseEvaluator class:
gt_project_path: Path to the local GT project.
pred_project_path: Path to the local Predictions project.
result_dir: Optional: Directory to save evaluation results. Default is ./evaluation.
classes_whitelist: Optional: List of classes to evaluate.
Let's start by creating a new class MyEvaluator that inherits from BaseEvaluator and overrides the evaluate method. In our example, evaluate method compares GT and Predicted annotations from two projects, counting the occurrences of each object class in images and objects. It iterates through datasets and images, retrieves annotations, and collects statistics for both ground truth and predicted data. Finally, it saves the evaluation results in self.eval_data and dumps it to disk.
# src/evaluator.py
from collections import defaultdict
from pathlib import Path
import supervisely as sly
from supervisely.nn.benchmark.base_evaluator import BaseEvaluator
from src.eval_result import MyEvalResult
class MyEvaluator(BaseEvaluator):
eval_result_cls = MyEvalResult # we will implement this class in the next step
def evaluate(self):
"""This method should perform the evaluation process."""
# For example, let's iterate over all datasets and calculate some statistics
gt_project = sly.Project(self.gt_project_path, sly.OpenMode.READ)
pred_project = sly.Project(self.pred_project_path, sly.OpenMode.READ)
gt_stats = {"images_count": defaultdict(int), "objects_count": defaultdict(int)}
pred_stats = {"images_count": defaultdict(int), "objects_count": defaultdict(int)}
for ds_1 in gt_project.datasets:
ds_2 = pred_project.datasets.get(ds_1.name)
ds_1: sly.Dataset
for name in ds_1.get_items_names():
ann_1 = ds_1.get_ann(name, gt_project.meta)
ann_2 = ds_2.get_ann(name, pred_project.meta)
for label in ann_1.labels:
class_name = label.obj_class.name
gt_stats["objects_count"][class_name] += 1
for label in ann_2.labels:
class_name = label.obj_class.name
pred_stats["objects_count"][class_name] += 1
for class_name in gt_founded_classes:
gt_stats["images_count"][class_name] += 1
for class_name in pred_founded_classes:
pred_stats["images_count"][class_name] += 1
# save the evaluation results
self.eval_data = {"gt_stats": gt_stats, "pred_stats": pred_stats}
# Dump the eval_data to disk (to be able to load it later)
save_path = Path(self.result_dir) / "eval_data.json"
sly.json.dump_json_file(self.eval_data, save_path)
return self.eval_data
2. Custom EvalResult
This class will be used as a data interface to access the evaluation metrics in the visualizer.
When initializing the EvalResult object, it calls the _read_files method to load the evaluation metrics from disk and the _prepare_data method to prepare the data for easy access. So, you need to implement these two methods in the MyEvalResult class.
Let's create a new file eval_result.py and implement the MyEvalResult class.
# src/eval_result.py
from collections import defaultdict
from pathlib import Path
import supervisely as sly
from supervisely.nn.benchmark.base_evaluator import BaseEvalResult
class MyEvalResult(BaseEvalResult):
def _read_files(self, path: str) -> None: # ⬅︎ This method is required
"""This method should LOAD evaluation metrics from disk."""
save_path = Path(path) / "eval_data.json" # path to the saved evaluation metrics
self.eval_data = sly.json.load_json_file(str(save_path))
def _prepare_data(self) -> None: # ⬅︎ This method is required
"""This method should PREPARE data to allow easy access to the data."""
gt = self.eval_data.get("gt_stats", {})
pred = self.eval_data.get("pred_stats", {})
# class statistics (class names as keys and number of objects as values)
self._objects_per_class = self._get_objects_per_class(gt, pred)
# GT metrics
gt_obj_num = self._get_total_objects_count(gt)
gt_cls_num = self._get_num_of_used_classes(gt)
gt_cls_most_freq = self._get_most_frequent_class(gt)
# Prediction metrics
pred_obj_num = self._get_total_objects_count(pred)
pred_cls_num = self._get_num_of_used_classes(pred)
pred_cls_most_freq = self._get_most_frequent_class(pred)
self._key_metrics = {
"Objects Count": [gt_obj_num, pred_obj_num],
"Found Classes": [gt_cls_num, pred_cls_num],
"Classes with Max Figures": [gt_cls_most_freq, pred_cls_most_freq],
# ---------------- ⬇︎ Properties to access the data easily ⬇︎ ----------------- #
def key_metrics(self):
"""Return key metrics as a dictionary."""
return self._key_metrics.copy()
def objects_per_class(self):
"""Return the number of objects per class."""
return self._objects_per_class.copy()
# ------- ⬇︎ Utility methods (you can create any methods you need) ⬇︎ --------- #
def _get_most_frequent_class(self, stats: dict):
name = max(stats.get("objects_count", {}).items(), key=lambda x: x[1])[0]
return f"{name} ({stats['objects_count'][name]})"
def _get_total_objects_count(self, stats: dict):
return sum(stats.get("objects_count", {}).values())
def _get_objects_per_class(self, gt: dict, pred: dict):
gt_img_stats = gt.get("objects_count", {})
pred_img_stats = pred.get("objects_count", {})
images_per_class = defaultdict(dict)
for name, gt_images_count in gt_img_stats.items():
pred_images_count = pred_img_stats.get(name, 0)
images_per_class[name] = [gt_images_count, pred_images_count]
return images_per_class
def _get_num_of_used_classes(self, stats: dict):
return len(stats.get("images_count", {}))
3. Visualizer, Charts and Widgets
This step involves creating a custom Visualizer class that inherits from BaseVisualizer. The class should generate visualizations, save them to disk, and upload them to the Team Files (to open the visualizations in the web interface).
Key points of the visualizer implementation:
Widgets: Widgets are the building blocks of the visualizations in the report. All widgets should be initialized in the _create_widgets method of the visualizer class.
Available widgets: MarkdownWidget, TableWidget, ChartWidget, CollapseWidget, ContainerWidget, GalleryWidget, RadioGroupWidget, NotificationWidget, and SidebarWidget.
Grouping widgets: Each widget has a to_html() method, and, in the end, it is just HTML code. You can combine them as you like, but we recommend organizing them into separate classes that inherit from BaseVisMetric, which can be responsible for a specific section of the report or ML metric, for example, Precision, Recall, F1-score, etc.
Layout: The BaseVisualizer class has a _create_layout method and here you need to define the order of the widgets in the report and the anchors in the sidebar.
Upload to Team Files: To open the report in the web interface, you need to upload the visualization results to the Team Files. We will call the upload_results or upload_visualizations methods in the main script.
First, let's create a few widgets that we will use in the visualizer. We will start with the MarkdownWidget, TableWidget, and ChartWidget. Our example will include three sections in the report: Intro (header + overview), KeyMetrics (text + table), and CustomMetric (text + chart). To make the code more readable, we will split the code into separate files for each section.
Feel free to change the widget content and appearance to suit your needs. The example below with Markdown, Table, and Chart widgets is just a starting point.
# src/widgets/key_metrics.py
from supervisely.nn.benchmark.object_detection.base_vis_metric import BaseVisMetric
from supervisely.nn.benchmark.visualization.widgets import MarkdownWidget, TableWidget
class KeyMetrics(BaseVisMetric):
def md(self) -> MarkdownWidget:
text = (
"## Key Metrics\n"
"In this section, you can explore in table key metrics, such as:\n\n"
"> **Note:** Markdown syntax is supported."
return MarkdownWidget(name="key_metrics", title="Key Metrics", text=text)
def table(self) -> TableWidget:
columns = ["Metric", "GT Project", "Predictions Project"]
columns_options = [{"disableSort": True}] * len(columns)
content = []
for metric, values in self.eval_result.key_metrics.items():
row = [metric, *values]
content.append({"row": row, "id": metric, "items": row})
data = {"columns": columns, "content": content, "columnsOptions": columns_options}
return TableWidget(
Here is the KeyMetrics widget in action:
Let's create the CustomMetric section.
# src/widgets/custom_metric.py
from supervisely.nn.benchmark.object_detection.base_vis_metric import BaseVisMetric
from supervisely.nn.benchmark.visualization.widgets import ChartWidget, MarkdownWidget
class CustomMetric(BaseVisMetric):
def md(self) -> MarkdownWidget:
text = (
"## Number of Objects per Class\n"
" In this section, you can explore the number of objects per class"
" in the GT and predictions projects."
return MarkdownWidget(name="custom_metric", title="Custom Metric", text=text)
def chart(self) -> ChartWidget:
import plotly.graph_objects as go
x = list(self.eval_result.objects_per_class.keys())
y1, y2 = zip(*self.eval_result.objects_per_class.values())
fig = go.Figure()
fig.add_trace(go.Bar(y=y1, x=x, name="GT"))
fig.add_trace(go.Bar(y=y2, x=x, name="Predictions"))
fig.update_layout(barmode="group", bargap=0.15, bargroupgap=0.05)
return ChartWidget(name="images_chart", figure=fig)
The CustomMetric widget will look like this:
Finally, let's implement the custom visualizer class that will use these widgets. All you need to do is to implement the _create_widgets and _create_layout methods.
# src/visualizer.py
from supervisely.nn.benchmark.base_visualizer import BaseVisualizer
from supervisely.nn.benchmark.visualization.widgets import (
from supervisely.nn.task_type import TaskType
from src.widgets import CustomMetric, Intro, KeyMetrics
class MyVisualizer(BaseVisualizer):
def cv_task(self):
def _create_widgets(self):
"""In this method, we initialize and configure all the widgets that we will use"""
vis_text = "N/A" # not used in this example
# Intro (Markdown)
me = self.api.user.get_my_info()
intro = Intro(vis_text, self.eval_result)
self.intro_header = intro.get_header(me.login)
self.intro_md = intro.md
# Key Metrics (Markdown + Table)
key_metrics = KeyMetrics(vis_text, self.eval_result)
self.key_metrics_md = key_metrics.md
self.key_metrics_table = key_metrics.table
# Custom Metric (Markdown + Chart)
custom_metric = CustomMetric(vis_text, self.eval_result)
self.custom_metric_md = custom_metric.md
self.custom_metric_chart = custom_metric.chart
def _create_layout(self):
Method to create the layout of the visualizer.
We define the order of the widgets in the report and their visibility in the sidebar.
# Create widgets
# Configure sidebar
# (if 1 - will display in sidebar, 0 - will not display in sidebar)
is_anchors_widgets = [
# Intro
(0, self.intro_header),
(1, self.intro_md),
# Key Metrics
(1, self.key_metrics_md),
(0, self.key_metrics_table),
# Custom Metric
(1, self.custom_metric_md),
(0, self.custom_metric_chart),
anchors = []
for is_anchor, widget in is_anchors_widgets:
if is_anchor:
sidebar = SidebarWidget(widgets=[i[1] for i in is_anchors_widgets], anchors=anchors)
layout = ContainerWidget(title="Custom Benchmark", widgets=[sidebar])
return layout
4. Run the code
Before we run the custom benchmark, prepare the environment credentials in the supervisely.env file:
SERVER_ADDRESS= # ⬅︎ change the value
API_TOKEN= # ⬅︎ change the value
Learn about the basics of authentication in Supervisely here.
Create a main.py script to run the custom benchmark:
# src/main.py
import os
import supervisely as sly
from dotenv import load_dotenv
from src.evaluator import MyEvaluator
from src.visualizer import MyVisualizer
if sly.is_development():
api = sly.Api()
team_id = sly.env.team_id()
gt_project_id = 73
pred_project_id = 159
workdir = sly.app.get_data_dir()
gt_path = os.path.join(workdir, "gt_project")
pred_path = os.path.join(workdir, "pred_project")
eval_result_dir = os.path.join(workdir, "evaluation")
vis_result_dir = os.path.join(workdir, "vizualizations")
# 0. Download projects
for project_id, path in [(gt_project_id, gt_path), (pred_project_id, pred_path)]:
if not sly.fs.dir_exists(path):
# 1. Initialize Evaluator
evaluator = MyEvaluator(gt_path, pred_path, eval_result_dir)
# 2. Run evaluation
# 3. Initialize EvalResult object
eval_result = evaluator.get_eval_result()
# 4. Initialize visualizer and visualize
visualizer = MyVisualizer(api, [eval_result], vis_result_dir)
# 5. Upload to Supervisely Team Files
remote_dir = "/model-benchmark/custom_benchmark"
api.file.upload_directory(team_id, evaluator.result_dir, remote_dir + "/evaluation")
# ⬇︎ required to open visualizations in the web interface
visualizer.upload_results(team_id, remote_dir + "/visualizations/")
Please note that to open the report in the web interface, visualization results need to be uploaded to the Team Files. The upload_results method in the visualizer class will take care of this.
🔗 Recap of the files structure:
├── src/
│ ├── __init__.py
│ ├── evaluator.py # 49 lines of code
│ ├── eval_result.py # 73 lines of code
│ ├── visualizer.py # 64 lines of code
│ ├── widgets/
│ │ ├── __init__.py
│ │ ├── intro.py # 36 lines of code
│ │ ├── key_metrics.py # 38 lines of code
│ │ └── custom_metric.py # 22 lines of code
│ └── main.py # 53 lines of code
└── local.env # 1 line of code
Run the main.py script – python src/main.py in the terminal. For debugging, you can use the launch.json file in the repository (select the "Python Current File" configuration and press F5 or Run and Debug).
After the evaluation is complete, you will receive a link to the report in the logs. You can open the report in the web interface by clicking on the link. Also, you will find the evaluation results in the Team Files in the folder that you specified in the script (/model-benchmark/custom_benchmark/vizualizations/Model Evaluation Report.lnk)
hooray! 🎉 You have successfully implemented a custom benchmark evaluation in Supervisely!
But wait, there is another way to run the custom benchmark – using Deployed NN Model. Let's move on to the next step.
Run Evaluation of your Models
In this section we will evaluate a model, that is deployed in Supervisely platform. We will use the model to get predictions and evaluate them, instead of using downloaded project with predictions.
For this purpose, we will create a new custom benchmark class that inherits from the BaseBenchmark.
The BaseBenchmark class is a all-in-one base class that orchestrates all the processes – runs the inference, evaluation, and visualization processes. It provides the basic functionality to run the evaluation process and generate the report.
All you need to do is to change only 3 lines of code! 💫
1. Create a new file benchmark.py with the following content:
# src/benchmark.py
from supervisely.nn.benchmark.base_benchmark import BaseBenchmark
from supervisely.nn.task_type import TaskType
from src.evaluator import MyEvaluator
from src.visualizer import MyVisualizer
class CustomBenchmark(BaseBenchmark):
visualizer_cls = MyVisualizer # ⬅︎ the visualizer class
def cv_task(self) -> str:
return TaskType.OBJECT_DETECTION # ⬅︎ the visualizer class
def _get_evaluator_class(self) -> type:
return MyEvaluator # ⬅︎ the visualizer class
Here is a brief overview of the relationships between the classes in this scenario. As you can see, we will use the same engine classes, but the input will be different – the GT project ID and the deployed model session ID (instead of the local projects).
2. Update the main.py script to run the custom benchmark on the deployed model session:
# src/main.py
import os
import supervisely as sly
from dotenv import load_dotenv
from src.benchmark import CustomBenchmark
if sly.is_development():
api = sly.Api()
team_id = 8
gt_project_id = 73
pred_project_id = 159
model_session_id = 1234
# 1. Initialize benchmark
bench = CustomBenchmark(api, gt_project_id, output_dir=sly.app.get_data_dir())
# 2. Run evaluation
bench.evaluate(pred_project_id) # ⬅︎ evaluate without inference
# bench.run_evaluation(model_session_id) # ⬅︎ evaluate with inference
# 3. Generate charts and dashboards
# 4. Upload to Supervisely Team Files
remote_dir = f"/model-benchmark/custom_benchmark/{model_session_id}"
bench.upload_eval_results(remote_dir + "/evaluation/")
# ⬇︎ required to open visualizations in the web interface
bench.upload_visualizations(remote_dir + "/visualizations/")
That's it! And you are ready to run the custom evaluation on different deployed models.
Run the main.py script – python src/main.py in the terminal (or use the launch.json file from source code for debugging).
As in the previous step, you will receive a link to the report in the logs or find the evaluation results in the Team Files.
Great job! 🎉 You have successfully implemented a custom benchmark evaluation on the deployed model in Supervisely!
Using the BaseBenchmark class, you still have the flexibility to run evaluations on two projects. And you can do it even easier – just pass project IDs instead of paths and use the bench.evaluation(pred_project_id) method. The BaseBenchmark class will take care of the rest.
Let's move on to the next level and integrate the custom benchmark with the GUI interface 🎨.
Plug-in the Custom Benchmark to the GUI
In this step, we will create a sly.Application (high-level class in the Supervisely SDK that allows you to create a FastAPI application with GUI interface) that will run the custom benchmark evaluation. The application will allow you to select the GT project, the deployed model session, and the evaluation parameters and run the evaluation with a few clicks in the web interface.
You can take a look at the Evaluator for Model Benchmark app in the Ecosystem to see how we implemented the GUI interface for the evaluation process.
Open the browser and go to http://localhost:8000 to see the GUI interface.Select the GT project, the deployed model session, and press the Evaluate button to run the evaluation. The app will connect to the deployed NN model, run the inference, upload predictions to a new project, evaluate the model, and generate the report.
After the process is complete, you will see a widget with the evaluation report (click on the link to open the report in the web interface).
Check out our Developer Portal to learn more on how to release your app as a private app in Supervisely – here.
Congratulations! 🎉 You have successfully integrated the custom benchmark with the GUI interface in Supervisely!
Example of the Custom Benchmark report we will create in this guide
Schema of the benchmark process using GT and Prediction projects
Schema of the benchmark process with GT project and a deployed model