Custom Benchmark Integration

We have prepared a GitHub repository with the source code for this guide. You can clone the repository and follow the instructions to implement a custom benchmark evaluation in Supervisely.

Overview

In this guide, we will show you how to integrate a custom benchmark evaluation using Supervisely SDK. For most use cases, our Evaluator for Model Benchmark app in the Ecosystem provides a set of built-in evaluation metrics for various task types, such as object detection, instance segmentation, and semantic segmentation. However, in some cases, you may need to define custom metrics that are specific to your use case. The custom benchmark implementation allows you to achieve this goal โ€“ to evaluate the model performance with your own business metrics and visualize the results in a comprehensive report.

Example of the Custom Benchmark report we will create in this guide

Implement Custom Evaluation

The custom benchmark implementation consists of several classes that interact with each other to perform the evaluation process and generate the report.

๐Ÿงฉ Key Components.

Here are the main classes that you need to subclass to implement your custom benchmark:

  • BaseEvaluator: This class responsible for all calculations and evaluation procedures needed for metrics and visualizations. It should output evaluation data, from which the necessary metrics and charts will be generated. It should also save this data to disk.

  • BaseEvalResult: A data interface class that provides easy access to metrics and data for charts. The EvalResult class will be used by the Visualizer to retrieve ready-to-use metrics.

  • BaseVisualizer: A class that generates the resulting evaluation report. It responsible for renedering all your code and widgets into a web page.

๐Ÿ› ๏ธ Brief overview of the relationships between the instances:

Schema of the benchmark process using GT and Prediction projects

And all you need to do is to implement these classes with your custom logic to calculate the metrics and generate the visualizations. We will guide you through the process step by step. Let's get started!

Check out the GitHub repository with the source code for this guide to see examples of the implementation.

1. Custom Evaluator

The Evaluator is the key component of a custom benchmark. Its main responsibility is to process Ground Truth (GT) and Predictions data, preparing it for evaluation.

Instead of computing every metric and chart directly, the Evaluator focuses on generating essential processed data that serves as the foundation for further analysis. Some computer vision tasks require computationally expensive operations. For example, in object detection, each predicted instance must be matched with the corresponding GT instance, which can take significant time. However, once this matching is done, calculating metrics becomes straightforward.

To optimize performance, the Evaluator:

  • Processes raw GT and Prediction data into a structured format suitable for metric calculation.

  • Handles computationally intensive tasks like matching predictions to GT.

  • Saves processed data to disk, avoiding redundant computations and speeding up further analysis.

By handling the heavy lifting in the evaluation pipeline, the Evaluator ensures that metric computation remains efficient and scalable.

BaseEvaluator is a base class that provides the basic functionality for the evaluator.

Available arguments in the BaseEvaluator class:

  • gt_project_path: Path to the local GT project.

  • pred_project_path: Path to the local Predictions project.

  • evaluation_params: Optional: Evaluation parameters.

  • result_dir: Optional: Directory to save evaluation results. Default is ./evaluation.

  • classes_whitelist: Optional: List of classes to evaluate.

Let's start by creating a new class MyEvaluator that inherits from BaseEvaluator and overrides the evaluate method. In our example, evaluate method compares GT and Predicted annotations from two projects, counting the occurrences of each object class in images and objects. It iterates through datasets and images, retrieves annotations, and collects statistics for both ground truth and predicted data. Finally, it saves the evaluation results in self.eval_data and dumps it to disk.

2. Custom EvalResult

This class will be used as a data interface to access the evaluation metrics in the visualizer.

When initializing the EvalResult object, it calls the _read_files method to load the evaluation metrics from disk and the _prepare_data method to prepare the data for easy access. So, you need to implement these two methods in the MyEvalResult class.

Let's create a new file eval_result.py and implement the MyEvalResult class.

3. Visualizer, Charts and Widgets

This step involves creating a custom Visualizer class that inherits from BaseVisualizer. The class should generate visualizations, save them to disk, and upload them to the Team Files (to open the visualizations in the web interface).

Key points of the visualizer implementation:

  • Widgets: Widgets are the building blocks of the visualizations in the report. All widgets should be initialized in the _create_widgets method of the visualizer class.

  • Available widgets: MarkdownWidget, TableWidget, ChartWidget, CollapseWidget, ContainerWidget, GalleryWidget, RadioGroupWidget, NotificationWidget, and SidebarWidget.

  • Grouping widgets: Each widget has a to_html() method, and, in the end, it is just HTML code. You can combine them as you like, but we recommend organizing them into separate classes that inherit from BaseVisMetric, which can be responsible for a specific section of the report or ML metric, for example, Precision, Recall, F1-score, etc.

  • Layout: The BaseVisualizer class has a _create_layout method and here you need to define the order of the widgets in the report and the anchors in the sidebar.

  • Upload to Team Files: To open the report in the web interface, you need to upload the visualization results to the Team Files. We will call the upload_results or upload_visualizations methods in the main script.

First, let's create a few widgets that we will use in the visualizer. We will start with the MarkdownWidget, TableWidget, and ChartWidget. Our example will include three sections in the report: Intro (header + overview), KeyMetrics (text + table), and CustomMetric (text + chart). To make the code more readable, we will split the code into separate files for each section.

Feel free to change the widget content and appearance to suit your needs. The example below with Markdown, Table, and Chart widgets is just a starting point.

src/widgets/intro.py

Take a look at the Intro widget:

Let's create the KeyMetrics section.

src/widgets/key_metrics.py

Here is the KeyMetrics widget in action:

Let's create the CustomMetric section.

src/widgets/custom_metric.py

The CustomMetric widget will look like this:

Finally, let's implement the custom visualizer class that will use these widgets. All you need to do is to implement the _create_widgets and _create_layout methods.

4. Run the code

Before we run the custom benchmark, prepare the environment credentials in the supervisely.env file:

Learn about the basics of authentication in Supervisely here.

Create a main.py script to run the custom benchmark:

Please note that to open the report in the web interface, visualization results need to be uploaded to the Team Files. The upload_results method in the visualizer class will take care of this.

๐Ÿ”— Recap of the files structure:

Run the main.py script โ€“ python src/main.py in the terminal. For debugging, you can use the launch.json file in the repository (select the "Python Current File" configuration and press F5 or Run and Debug).

.vscode/launch.json

After the evaluation is complete, you will receive a link to the report in the logs. You can open the report in the web interface by clicking on the link. Also, you will find the evaluation results in the Team Files in the folder that you specified in the script (/model-benchmark/custom_benchmark/vizualizations/Model Evaluation Report.lnk)

hooray! ๐ŸŽ‰ You have successfully implemented a custom benchmark evaluation in Supervisely!

But wait, there is another way to run the custom benchmark โ€“ using Deployed NN Model. Let's move on to the next step.

Run Evaluation of your Models

In this section we will evaluate a model, that is deployed in Supervisely platform. We will use the model to get predictions and evaluate them, instead of using downloaded project with predictions.

For this purpose, we will create a new custom benchmark class that inherits from the BaseBenchmark.

The BaseBenchmark class is a all-in-one base class that orchestrates all the processes โ€“ runs the inference, evaluation, and visualization processes. It provides the basic functionality to run the evaluation process and generate the report.

All you need to do is to change only 3 lines of code! ๐Ÿ’ซ

1. Create a new file benchmark.py with the following content:

Here is a brief overview of the relationships between the classes in this scenario. As you can see, we will use the same engine classes, but the input will be different โ€“ the GT project ID and the deployed model session ID (instead of the local projects).

Schema of the benchmark process with GT project and a deployed model

2. Update the main.py script to run the custom benchmark on the deployed model session:

That's it! And you are ready to run the custom evaluation on different deployed models.

Run the main.py script โ€“ python src/main.py in the terminal (or use the launch.json file from source code for debugging).

As in the previous step, you will receive a link to the report in the logs or find the evaluation results in the Team Files.

Great job! ๐ŸŽ‰ You have successfully implemented a custom benchmark evaluation on the deployed model in Supervisely!

Using the BaseBenchmark class, you still have the flexibility to run evaluations on two projects. And you can do it even easier โ€“ just pass project IDs instead of paths and use the bench.evaluation(pred_project_id) method. The BaseBenchmark class will take care of the rest.

Let's move on to the next level and integrate the custom benchmark with the GUI interface ๐ŸŽจ.

Plug-in the Custom Benchmark to the GUI

In this step, we will create a sly.Application (high-level class in the Supervisely SDK that allows you to create a FastAPI application with GUI interface) that will run the custom benchmark evaluation. The application will allow you to select the GT project, the deployed model session, and the evaluation parameters and run the evaluation with a few clicks in the web interface.

You can take a look at the Evaluator for Model Benchmark app in the Ecosystem to see how we implemented the GUI interface for the evaluation process.

Find the source code for this guide in the GitHub repository.

First, let's create the local.env file with the following variables:

Now, we will upgrade the main.py from the simple script to the sly.Application.

src/main.py

Launch the application using the following command in the terminal:

Or use the launch.json file from the source code for debugging and press F5 or Run and Debug.

.vscode/launch.json

Open the browser and go to http://localhost:8000 to see the GUI interface.Select the GT project, the deployed model session, and press the Evaluate button to run the evaluation. The app will connect to the deployed NN model, run the inference, upload predictions to a new project, evaluate the model, and generate the report.

After the process is complete, you will see a widget with the evaluation report (click on the link to open the report in the web interface).

Check out our Developer Portal to learn more on how to release your app as a private app in Supervisely โ€“ here.

Congratulations! ๐ŸŽ‰ You have successfully integrated the custom benchmark with the GUI interface in Supervisely!

Last updated