In Enterprise Edition you can not only store files on a hard drive, but also connect Azure Blob Storage, Google Cloud or any S3 compatible storage (i.e. AWS S3).
You can upload files from your PC to connected cloud storage or use already uploaded files from cloud storage as a source (without duplicating it).
How we store files
Supervisely uses DATA_PATH from .env (defaults to /supervisely/data) to keep caches, database and etc. But we are interested in storage subfolder generated content, like uploaded images or neural networks are stored.
You can find two subfolders here:
<something>-public/
<something>-private/
That's because we maintain the same structure in local storage as if you would use a remote storage. In that case those two folders are buckets or containers. You may notice that one has "public" in it's name, but it only reflects the kind of data we store in it. Both buckets are private and does not provide anonymous read.
Configure Supervisely to use S3 compatible storage (Amazon S3, Minio)
This section describes how to configure Supervisely to store its data on a cloud storage rather than on a hard drive. This won't allow you to use existing images and videos on your cloud. If you need to use existing images and videos, please check the section below.
Edit .env configuration file - you can find it by running supervisely where command.
Change STORAGE_PROVIDER from http (local hard drive) to minio (S3 storage backend).
Also, you need to provide STORAGE_ACCESS_KEY and STORAGE_SECRET_KEY credentials along with endpoint of your S3 storage.
For example, here are settings for Amazon S3:
STORAGE_ENDPOINT=s3.amazonaws.com
STORAGE_PORT=443
So in the end, here is how your .env settings could look like:
Execute sudo supervisely up -d to apply the new settings.
If you're working with large files (4GB+) you might also want to add permission for "s3:ListBucketMultipartUploads" at the bucket level, so Supervisely can initiate multipart uploads for larger artifacts.
Configure Supervisely to use Azure Blob Storage
This section describes how to configure Supervisely to store its data on a cloud storage rather than on a hard drive. This won't allow you to use existing images and videos on your cloud. If you need to use existing images and videos, please check the section below.
Edit .env configuration file - you can find it by running supervisely where command.
Change STORAGE_PROVIDER from http (local hard drive) to azure (Azure storage backend).
Also, you need to provide STORAGE_ACCESS_KEY (your storage account name) and STORAGE_SECRET_KEY (secret key) credentials along with endpoint of your blob storage.
Here is how your .env settings could look like:
JUPYTER_DOWNLOAD_FILES_BEFORE_START=true
STORAGE_JUPYTER_SYNC=true
STORAGE_ACCESS_KEY=<account name>
STORAGE_ENDPOINT=https://<account name>.blob.core.windows.net
STORAGE_PROVIDER=azure
STORAGE_SECRET_KEY=<secret key 88 chars long or so: aflmg+wg23fWA+6gAafWmgF4a>
Execute sudo supervisely up -d to apply the new settings
Configure Supervisely to use Google Cloud Storage
This section describes how to configure Supervisely to store its data on a cloud storage rather than on a hard drive. This won't allow you to use existing images and videos on your cloud. If you need to use existing images and videos, please check the section below.
Edit .env configuration file - you can find it by running supervisely where command.
Change STORAGE_PROVIDER from http (local hard drive) to google (GCS backend).
Also, you need to provide STORAGE_CREDENTIALS_PATH credentials file generated by Google.
Run minio/mc docker image and execute the following commands:
mc config host add s3 https://s3.amazonaws.com <YOUR-ACCESS-KEY> <YOUR-SECRET-KEY>
mc cp <DATA_STORAGE_FROM_HOST>/<your-buckets-prefix>-public s3/<your-buckets-prefix>-public/
mc cp <DATA_STORAGE_FROM_HOST>/<your-buckets-prefix>-private s3/<your-buckets-prefix>-private/
Finally, restart services to apply new configuration: supervisely up -d.
Keys from IAM Role
If you want to use IAM Role you must specify STORAGE_IAM_ROLE=<role_name> in .env file then STORAGE_ACCESS_KEY and STORAGE_SECRET_KEY variables can be ommited.
IAM Roles are only supported for AWS S3.
Frontend caching
Since AWS and Azure can be quite price in case of heavy reads, we enable image caching by default.
If the image is not in the preview cache but in the STORAGE cache it will be generated and put into previews cache, but it will not be fetched from the remote server.
Here are the default values (you can alter them via docker-compose.override.yml file):
If you already have some files on Amazon S3/Google Cloud Storage/Azure Storage and you don't want to upload and store those files in Supervisely, you can use the "Links" plugin to link the files to Supervisely server.
Instead of uploading actual files (i.e. images), you will need to upload .txt file(s) that contains a list of URLs to your files. If your URLs are publicly available (i.e. link looks like https://s3-us-west-2.amazonaws.com/test1/abc and you can open it in your web browser directly), then you can stop reading and start uploading.
If your files are protected, however, you will need to provide credentials in the instance settings or manually create configuration file.
Azure SAS Token minimal permissions
File system provider
Folder path on the server - path to folder on the host server that will be mounted
Storage ID (bucket) - mouted folder identifyer. It will be used in links to mounted folder
For instance, for the example above, when you want to add a new assets (image or video) with local path on your hard drive /data/datasets/persons/image1.jpg, use the following format in API, SDK or corresponding application: fs://local-datasets/persons/image1.jpg
Manual configuration
If you are brave enough, you can create configuration files manually:
Example configuration file:
# amazon s3 examplemy-car-datasets:provider:minioendpoint:s3.amazonaws.comaccess_key:<your access key>secret_key:<your secret key># iam_role: <or just use your iam role>region:eu-central-1# array of bucketsbuckets: - cars_2020_20_10 - cars_2020_10_10# azure storage examplemy-boats-datasets:provider:azureendpoint:https://<account name>.blob.core.windows.netaccess_key:<account name>secret_key:<secret key 88 chars long or so:aflmg+wg23fWA+6gAafWmgF4a> secret_key: or you can also use SAS token here: ?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-10-10T00:00:00Z&st=2020-10-10T00:00:00Z&spr=https&sig=...
# array of bucketsbuckets: - boats_bucket_2020_20_10 - another_boats_bucket_2020_10_10# google cloud storage examplemy-planes-datasets:provider:googleendpoint:storage.googleapis.comcredentials_path:<path to the secret file inside the container># array of bucketsbuckets: - planes_bucket_2020_20_10 - another_planes_bucket_2020_10_10
Create a new file docker-compose.override.yml under cd $(sudo supervisely where):
services:http-storage:volumes: - <path to the configuration file>:/remote_links.yml:ro
Then execute the following to apply the changes:
sudo supervisely up -d http-storage
Google Cloud Storage secret file example, docker-compose.override.yml:
services:http-storage:volumes: - <path to the secret file>:/secret_planes.json:ro
Migrating existing projects to Cloud Storage
If you want to migrate only some of the projects that exist in the Supervisely storage to the linked cloud, you can achieve this using the following code snippet.
The code snippet:
Is designed to change links only for entities that are not linked yet, it means they are stored in Supervisely storage.
Will change links only when all entities are uploaded to remote storage.
Can be run again in case of failure. Will not re-upload entities that are already uploaded to remote storage.
Save nested datasets in remote storage as a flat structure. All datasets will be placed in the project directory.
Will not delete entities from Supervisely storage after migration.
Function to use in your code: migrate_project(project: Union[sly.ProjectInfo, int])
Remember to configure the REMOTE_BUCKET and MIGRATION_DIR constants in the code snippet before use.
Click to see the code snippet
import asyncioimport osfrom typing import Unionimport aiohttpfrom aiohttp import FormDatafrom tenacity import before_sleep_log, retry, stop_after_attempt, wait_exponentialfrom tqdm import tqdmimport supervisely as slyfrom supervisely.api.api import ApiFieldfrom supervisely.api.image_api import ImageApifrom supervisely.api.video.video_api import VideoApi# -------------------------------- Global Variables For Migration -------------------------------- #entity_api =Nonedownload_api_url =Noneentities_map ={}api = sly.Api.from_env()# ------------------------------------ Constants For Migration ----------------------------------- #REMOTE_BUCKET ="s3://migration-bucket/"# TODO Change to your remote storage bucketMIGRATION_DIR ="projects-migration-storage"# TODO Change to your remote storage directoryIMAGES_DIR = os.path.join(REMOTE_BUCKET, MIGRATION_DIR, str(sly.ProjectType.IMAGES))VIDEOS_DIR = os.path.join(REMOTE_BUCKET, MIGRATION_DIR, str(sly.ProjectType.VIDEOS))IMAGES_DOWNLOAD_API_URL = api.api_server_address +"/v3/"+"images.download"VIDEOS_DOWNLOAD_API_URL = api.api_server_address +"/v3/"+"videos.download"REMOTE_STORAGE_UPLOAD_API_URL = api.api_server_address +"/v3/"+"remote-storage.upload"# ----------------------------- Asynchronous Functions For Migration ----------------------------- #@retry( stop=stop_after_attempt(10), wait=wait_exponential(multiplier=2, min=2, max=60), before_sleep=before_sleep_log(sly.logger, sly.logger.level),)asyncdefprocess_entity(download_api_url:str,entity_id:int,info:dict,progress_on:bool,total_progress: tqdm,): """This function is used in `upload_entity` to wrap the process of downloading and uploading entities with retries."""
global apiasyncwith aiohttp.ClientSession()as session:asyncwith session.post( url=download_api_url, data={ApiField.ID: entity_id}, headers=api.headers )as response: response.raise_for_status() form =FormData() form.add_field("path", info["remote"]) total_size =int(response.headers.get("Content-Length", 0))if progress_on: progress =tqdm(total=total_size, unit="B", unit_scale=True, desc=info["name"])asyncdeffile_gen():"""This function generates chunks of entity to upload to remote storage."""asyncfor chunk in response.content.iter_chunked(8192):yield chunkif progress_on: progress.update(len(chunk)) form.add_field("file",file_gen(), filename=info["name"], content_type=info["mime"], )asyncwith session.post( url=REMOTE_STORAGE_UPLOAD_API_URL, data=form, headers=api.headers )as post_response: post_response.raise_for_status()if progress_on: progress.close()if total_progress: total_progress.update(1)returnawait post_response.text()asyncdefupload_entity(download_api_url:str,entity_id:int,info:dict,semaphore: asyncio.Semaphore,total_progress: tqdm =None,progress_on:bool=False,):""" This function downloads entity from Supervisely storage as a stream without saving it to disk and uploads it to remote storage as a stream. All operations are done asynchronously in memory by chunks. :param download_api_url: URL to download entity from Supervisely storage via API :type download_api_url: str :param entity_id: ID of the entity to download :type entity_id: int :param info: Information about entity collected during the preparation. Contains name, mime, remote path. :type info: dict :param semaphore: Semaphore to limit the number of concurrent downloads/uploads :type semaphore: asyncio.Semaphore :param total_progress: Progress bar to track the total progress of migration :type total_progress: tqdm :param progress_on: Flag to enable progress bar for the current entity. Don't use it if entity has a small size in megabytes < 100.
:type progress_on: bool :return None """asyncwith semaphore:try: loop = asyncio.get_event_loop()try: remote_info =await loop.run_in_executor(None, entity_api._api.remote_storage.get_file_info_by_path, info["remote"] )if remote_info.get("size")== info.get("size"): sly.logger.debug( f"Entity already exists in remote storage: {info.get('remote')}" )if total_progress: total_progress.update(1)returnNoneexceptException: sly.logger.debug( f"Entity does not exist in remote storage: {info.get('remote')}. Will be uploaded" )returnawaitprocess_entity( download_api_url, entity_id, info, progress_on, total_progress )exceptExceptionas e: sly.logger.error( f"Failed to process entity with ID - {entity_id}, name - {info.get('name')}. " f"Will be skipped from migration due to the error: {e}" ) entities_map.pop(entity_id)returnNoneasyncdefupload():""" This function uploads entities to remote storage in parallel. The number of concurrent uploads is limited by the semaphore as 10. Don't adjust the semaphore value if you are not sure about the performance of instance. """ semaphore = asyncio.Semaphore(10) tasks = [] total_tasks =len(entities_map)withtqdm(total=total_tasks, desc="Uploading entities to remote storage")as total_progress:for e_id, info in entities_map.items(): tasks.append(upload_entity(download_api_url, e_id, info, semaphore, total_progress))await asyncio.gather(*tasks)@retry( stop=stop_after_attempt(4), wait=wait_exponential(multiplier=2, max=60), before_sleep=before_sleep_log(sly.logger, sly.logger.level),)defset_remote_with_retries(entity_api: Union[ImageApi, VideoApi],e_list:list,r_list:list): response = entity_api.set_remote(e_list, r_list)ifnot response.get("success"):raiseException(f"Failed to set remote links for entities: {e_list}")return responsedefmigrate_project(project: Union[sly.ProjectInfo,int]):""" This main function migrates entities of the project to remote storage. :param project: Project ID or ProjectInfo object :type project: Union[sly.ProjectInfo, int] """global api, entity_api, download_api_url, entities_map# -------------------------------- Collecting Entities Information ------------------------------- #ifisinstance(project, int): project_info = api.project.get_info_by_id(project)elifisinstance(project, sly.ProjectInfo): project_info = projectelse:raiseValueError("Unsupported project reference of type: {}".format(type(project)))if project_info.type ==str(sly.ProjectType.IMAGES): entity_api = api.image download_api_url = IMAGES_DOWNLOAD_API_URLelif project_info.type ==str(sly.ProjectType.VIDEOS): entity_api = api.video download_api_url = VIDEOS_DOWNLOAD_API_URLelse:raiseValueError(f"Unsupported project type: {project_info.type}")ifnot entities_map:for dataset in api.dataset.get_list(project_info.id, recursive=True):for entity_info in entity_api.get_list(dataset.id):if entity_info.link isnotNone:continue entities_map[entity_info.id]={} entities_map[entity_info.id]["name"] = entity_info.nameif project_info.type ==str(sly.ProjectType.IMAGES): entities_map[entity_info.id]["mime"] = entity_info.mime entities_map[entity_info.id]["size"] = entity_info.size entities_map[entity_info.id]["remote"] = os.path.join( IMAGES_DIR, str(project_info.id), str(dataset.id), entity_info.name )elif project_info.type ==str(sly.ProjectType.VIDEOS): entities_map[entity_info.id]["mime"] = entity_info.file_meta["mime"] entities_map[entity_info.id]["size"] =int(entity_info.file_meta["size"]) entities_map[entity_info.id]["remote"] = os.path.join( VIDEOS_DIR, str(project_info.id), str(dataset.id), entity_info.name )# --------------------------------- Uploading Entities To Remote --------------------------------- #if entities_map: asyncio.run(upload())# ----------------------------- Setting Remote Links For Entities ---------------------------- # entity_list = [int(entity_id)for entity_id in entities_map.keys()] remote_links_list = [entities_map[e_id]["remote"] for e_id in entity_list]for e_list, r_list inzip( sly.batched(entity_list, batch_size=1000), sly.batched(remote_links_list, batch_size=1000), ):set_remote_with_retries(entity_api, e_list, r_list) sly.logger.info( f"Entities have been migrated to remote storage for project: [{project_info.id}] {project_info.name}" )else: sly.logger.info( f"No entities to migrate for project: [{project_info.id}] {project_info.name}" )
If you need to keep the nested dataset structure in remote storage
You can modify the script to create nested directories in the remote storage. To do this, you need to change the remote path of the entity to include the dataset name. For that, you can replace api.dataset.get_tree(...) with api.dataset.get_list(...) and iterate over the tree. Then, you can modify the remote path of the entity to include the nested dataset ID.
If you have already uploaded entities to remote storage
You will be able just set remote links for them. There are two ways:
To create your own entities_map, that corresponds to the structure used in code above and redefine in section Global Variables
To use SDK API methods with the lists of entity IDs and remote links:
ImageApi(...).set_remote(...)
VideoApi(...).set_remote(...)
For better performance, you can use the function sly.batched to split the list of entities and remote links into batches. It is recommended to create batches not more than 1000 items per batch.