Custom Benchmark Integration
Overview
In this guide, we will show you how to integrate a custom benchmark evaluation using Supervisely SDK. For most use cases, our Evaluator for Model Benchmark app in the Ecosystem provides a set of built-in evaluation metrics for various task types, such as object detection, instance segmentation, and semantic segmentation. However, in some cases, you may need to define custom metrics that are specific to your use case. The custom benchmark implementation allows you to achieve this goal โ to evaluate the model performance with your own business metrics and visualize the results in a comprehensive report.

Key features of the custom benchmark implementation in Supervisely:
Custom Metrics and Charts: Implement custom evaluation metrics and visualizations for your specific use case.
Automate with Python SDK & API: Run evaluations from the script in your pipeline or release as a private app. Releasing as a private app allows you to run evaluations with a few clicks in the GUI interface or automate the launch using the Supervisely API (learn more here).
Easy integration with experiments in Supervisely: Integrate the custom benchmark with your custom training app to evaluate the best model checkpoint after training automatically and visualize the results in the Experiments page.
Implement Custom Evaluation
The custom benchmark implementation consists of several classes that interact with each other to perform the evaluation process and generate the report.
๐ ๏ธ Brief overview of the relationships between the instances:

And all you need to do is to implement these classes with your custom logic to calculate the metrics and generate the visualizations. We will guide you through the process step by step. Let's get started!
1. Custom Evaluator
The Evaluator is the key component of a custom benchmark. Its main responsibility is to process Ground Truth (GT) and Predictions data, preparing it for evaluation.
Instead of computing every metric and chart directly, the Evaluator focuses on generating essential processed data that serves as the foundation for further analysis. Some computer vision tasks require computationally expensive operations. For example, in object detection, each predicted instance must be matched with the corresponding GT instance, which can take significant time. However, once this matching is done, calculating metrics becomes straightforward.
To optimize performance, the Evaluator:
Processes raw GT and Prediction data into a structured format suitable for metric calculation.
Handles computationally intensive tasks like matching predictions to GT.
Saves processed data to disk, avoiding redundant computations and speeding up further analysis.
By handling the heavy lifting in the evaluation pipeline, the Evaluator ensures that metric computation remains efficient and scalable.
Before you start, make sure you have downloaded Ground Truth and Predictions projects in Supervisely format with the same datasets and classes. If you need to run evaluations on a subset of classes, you can provide a classes_whitelist parameter.
BaseEvaluator is a base class that provides the basic functionality for the evaluator.
Available arguments in the BaseEvaluator class:
gt_project_path: Path to the local GT project.pred_project_path: Path to the local Predictions project.evaluation_params: Optional: Evaluation parameters.result_dir: Optional: Directory to save evaluation results. Default is./evaluation.classes_whitelist: Optional: List of classes to evaluate.
Let's start by creating a new class MyEvaluator that inherits from BaseEvaluator and overrides the evaluate method. In our example, evaluate method compares GT and Predicted annotations from two projects, counting the occurrences of each object class in images and objects. It iterates through datasets and images, retrieves annotations, and collects statistics for both ground truth and predicted data. Finally, it saves the evaluation results in self.eval_data and dumps it to disk.
2. Custom EvalResult
This class will be used as a data interface to access the evaluation metrics in the visualizer.
When initializing the EvalResult object, it calls the _read_files method to load the evaluation metrics from disk and the _prepare_data method to prepare the data for easy access. So, you need to implement these two methods in the MyEvalResult class.
Let's create a new file eval_result.py and implement the MyEvalResult class.
3. Visualizer, Charts and Widgets
This step involves creating a custom Visualizer class that inherits from BaseVisualizer. The class should generate visualizations, save them to disk, and upload them to the Team Files (to open the visualizations in the web interface).
First, let's create a few widgets that we will use in the visualizer. We will start with the MarkdownWidget, TableWidget, and ChartWidget. Our example will include three sections in the report: Intro (header + overview), KeyMetrics (text + table), and CustomMetric (text + chart). To make the code more readable, we will split the code into separate files for each section.
Feel free to change the widget content and appearance to suit your needs. The example below with Markdown, Table, and Chart widgets is just a starting point.
Take a look at the Intro widget:

Let's create the KeyMetrics section.
Here is the KeyMetrics widget in action:

Let's create the CustomMetric section.
The CustomMetric widget will look like this:

Finally, let's implement the custom visualizer class that will use these widgets. All you need to do is to implement the _create_widgets and _create_layout methods.
4. Run the code
Before we run the custom benchmark, prepare the environment credentials in the supervisely.env file:
Learn about the basics of authentication in Supervisely here.
Create a main.py script to run the custom benchmark:
๐ Recap of the files structure:
Run the main.py script โ python src/main.py in the terminal. For debugging, you can use the launch.json file in the repository (select the "Python Current File" configuration and press F5 or Run and Debug).

After the evaluation is complete, you will receive a link to the report in the logs. You can open the report in the web interface by clicking on the link. Also, you will find the evaluation results in the Team Files in the folder that you specified in the script (/model-benchmark/custom_benchmark/vizualizations/Model Evaluation Report.lnk)

hooray! ๐ You have successfully implemented a custom benchmark evaluation in Supervisely!

But wait, there is another way to run the custom benchmark โ using Deployed NN Model. Let's move on to the next step.
Run Evaluation of your Models
In this section we will evaluate a model, that is deployed in Supervisely platform. We will use the model to get predictions and evaluate them, instead of using downloaded project with predictions.
For this purpose, we will create a new custom benchmark class that inherits from the BaseBenchmark.
All you need to do is to change only 3 lines of code! ๐ซ
1. Create a new file benchmark.py with the following content:
Here is a brief overview of the relationships between the classes in this scenario. As you can see, we will use the same engine classes, but the input will be different โ the GT project ID and the deployed model session ID (instead of the local projects).

2. Update the main.py script to run the custom benchmark on the deployed model session:
That's it! And you are ready to run the custom evaluation on different deployed models.
Run the main.py script โ python src/main.py in the terminal (or use the launch.json file from source code for debugging).
As in the previous step, you will receive a link to the report in the logs or find the evaluation results in the Team Files.

Great job! ๐ You have successfully implemented a custom benchmark evaluation on the deployed model in Supervisely!
Let's move on to the next level and integrate the custom benchmark with the GUI interface ๐จ.
Plug-in the Custom Benchmark to the GUI
In this step, we will create a sly.Application (high-level class in the Supervisely SDK that allows you to create a FastAPI application with GUI interface) that will run the custom benchmark evaluation. The application will allow you to select the GT project, the deployed model session, and the evaluation parameters and run the evaluation with a few clicks in the web interface.
You can take a look at the Evaluator for Model Benchmark app in the Ecosystem to see how we implemented the GUI interface for the evaluation process.
First, let's create the local.env file with the following variables:
Now, we will upgrade the main.py from the simple script to the sly.Application.
Launch the application using the following command in the terminal:
Or use the launch.json file from the source code for debugging and press F5 or Run and Debug.
Open the browser and go to http://localhost:8000 to see the GUI interface.Select the GT project, the deployed model session, and press the Evaluate button to run the evaluation. The app will connect to the deployed NN model, run the inference, upload predictions to a new project, evaluate the model, and generate the report.
After the process is complete, you will see a widget with the evaluation report (click on the link to open the report in the web interface).

Check out our Developer Portal to learn more on how to release your app as a private app in Supervisely โ here.
Congratulations! ๐ You have successfully integrated the custom benchmark with the GUI interface in Supervisely!

Last updated