Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang^1,2, Karl Schmeckpeper¹, Brandon B. May¹, Maria Vittoria Minniti¹, Tarik Kelestemur¹, David Watkins¹, Laura Herlant¹

¹The AI Institute ²Stony Brook University

CoRL 2024

Quick Start: Use Pre-trained Theia Models

Through huggingface:

import transformers
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained("theaiinstitute/theia-base-patch16-224-cdiv", trust_remote_code=True)
fake_input = torch.zeros((1, 224 ,224, 3), dtype=torch.uint8)

theia_feature = model.forward_feature(fake_input)
# Theia / intermediate feature, mainly used for robot learning.
# To change different feature reduction methods, pass `feature_reduction_method` argument in AutoModel.from_pretrained() method

predicted_features = model(fake_input)
# predicted_features is dict[str, torch.Tensor] where each kv pair is target model name and predicted feature
# they are predicted features that tries to match teacher model features.

theia-<size>-patch16-224-cdiv are used for main evaluations in the paper.

Installation

Make sure you have Python >= 3.10. Create any virtual Python environment you like or use the Dockerfile. Then

pip install -e .

Data Preparation

Datasets

The datasets should be organized in webdataset format.

Prepare images from ImageNet

First download and prepare ImageNet.

cd src/theia/scripts/preprocessing/image_datasets
python organize_imagenet_webdataset.py --dataset <dataset_name> --imagenet-raw-path <path_to_raw_images> --output-path <root_dir_to_hold_datasets>

For any other image dataset you want to use, you can simply dump all of them in a folder (any subfolder also works), and modify how you can get their paths in organize_imagenet_webdataset.py (variable image_paths).

(Optional) Prepare frames from video datasets

cd src/theia/scripts/preprocessing/video_datasets
python subsampling_videos.py --dataset <dataset_name> --dataset-path <path_to_raw_videos> --output-path <root_dir_to_hold_datasets> [--subsampling-rate] [--samples-per-shard]

Feature Extraction

cd src/theia/scripts/preprocessing
python feature_extraction.py --dataset <dataset_name> --output-path <root_dir_to_hold_datasets> --model <model_name> --split <train or val (or test)> [--num-gpus]

You can also refer to the integrated script src/theia/scripts/preprocessing/iv_feature_extraction.py that launches feature extraction for multiple models at the same time.

During training we will need mean and variance for each teacher model to normalize teacher features. You can extract them using src/theia/scripts/preprocessing/calc_feature_mean.py or use the stats we provide in feature_stats.

Expected Dataset Format

More details about dataset format are available at dataset_format. Please use this to verify or troubleshoot your data.

Training

cd src/theia/scripts

# train theia tiny using training configuration trian_rvfm_imagenet
# with teacher models CLIP, DINOv2, and ViT
torchrun --nproc_per_node=8 --nnodes 1 --rdzv_backend c10d --rdzv_endpoint localhost:11111 train_rvfm.py --config-name=train_rvfm_imagenet logging.notes=imagenet_cdiv training/target_models=cdiv dataset.dataset_ratio=1.0 model.backbone.backbone=facebook/deit-tiny-patch16-224 logging.save_ckpt_interval=50000 dataset.dataset_root=<root_dir_to_hold_datasets>

To change output paths and wandb logging configs, override or modify src/theia/configs/logging/default.yaml.

To use different teacher models, override training/target_models=<teacher model config>. Available configs are under src/theia/configs/training/target_models

To change different datasets, override dataset=<dataset config>. Available configs are under src/theia/configs/dataset.

Decode Theia-representation to VFM outputs

You can decode Theia-predicted VFM representations to their outputs. For DINOv2 we apply the PCA vsiualization, for SAM we use decoder to generate segmentation masks (but with SAM's pipeline of prompting), and for Depth-Anything we use the deocder head to do depth prediction. Below are example outputs. Theia model should be trained on those teachers during distillation. To use any models available online, you can find models with cddsv in its name, indicating that it is trained on all teachers.

Try out our online demo or notebook example, or you can get outputs from local checkpoints by

cd src/theia/scripts/decoding
python decoding_example.py --backbone <backbone_name> --checkpoint-path <path to theia model checkpoint> --feature-stat-dir <where feature mean and std are placed> --media-to-vis-path <path to the video or image to decode>

References

Webdataset, transformers, safetensors, DINOv2, CLIP, ViT, SAM, RADIO, DepthAnything

Citation

If you use Theia in your research, please use the following BibTeX entry:

@inproceedings{
    shang2024theia,
    title={Theia: Distilling Diverse Vision Foundation Models for Robot Learning},
    author={Jinghuan Shang and Karl Schmeckpeper and Brandon B. May and Maria Vittoria Minniti and Tarik Kelestemur and David Watkins and Laura Herlant},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024},
    url={https://openreview.net/forum?id=ylZHvlwUcI}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
doc		doc
feature_stats		feature_stats
media		media
src/theia		src/theia
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang^1,2, Karl Schmeckpeper¹, Brandon B. May¹, Maria Vittoria Minniti¹, Tarik Kelestemur¹, David Watkins¹, Laura Herlant¹

Quick Start: Use Pre-trained Theia Models

Installation

Data Preparation

Datasets

Feature Extraction

Expected Dataset Format

Training

Decode Theia-representation to VFM outputs

References

Citation

About

Releases

Packages

Contributors 6

Languages

License

bdaiinstitute/theia

Folders and files

Latest commit

History

Repository files navigation

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang1,2, Karl Schmeckpeper1, Brandon B. May1, Maria Vittoria Minniti1, Tarik Kelestemur1, David Watkins1, Laura Herlant1

Quick Start: Use Pre-trained Theia Models

Installation

Data Preparation

Datasets

Feature Extraction

Expected Dataset Format

Training

Decode Theia-representation to VFM outputs

References

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Jinghuan Shang^1,2, Karl Schmeckpeper¹, Brandon B. May¹, Maria Vittoria Minniti¹, Tarik Kelestemur¹, David Watkins¹, Laura Herlant¹

Packages