Skip to content

Audio-Visual Generalized Zero-Shot Learning using Large Pre-Trained Models

License

Notifications You must be signed in to change notification settings

dkurzend/ClipClap-GZSL

Repository files navigation

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

This repository is the official implementation of Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models.

Requirements

Install all required dependencies into a new virtual environment via conda.

conda env create -f clipclap.yml
conda activate clipclap

Download Features

You can download the CLIP and CLAP features of all three datasets here:

It does not matter where the features are stored, but the path has to be specified in the --root_dir option when running the training.

unzip data.zip

Training

In order to train the model run the following command: python3 main.py --cfg CFG_FILE --root_dir ROOT_DIR --log_dir LOG_DIR --dataset_name DATASET_NAME --run all

arguments:
--cfg CFG_FILE is the file containing all the hyperparameters for the experiments. To replicate our results, use ```--cfg config/clipclap.yaml``` for all three datasets.
--root_dir ROOT_DIR indicates the location where the dataset is stored.
--dataset_name {VGGSound, UCF, ActivityNet} indicate the name of the dataset.
--log_dir LOG_DIR indicates where to save the experiments.
--run {'all', 'stage-1', 'stage-2'}. 'all' indicates to run both training stages + evaluation, whereas 'stage-1', 'stage-2' indicates to run only those particular training stages

Example commands can be also found in commands.sh.

Run training for UCF-GZSL :

nohup python3 main.py --cfg config/clipclap.yaml \
                        --device cuda:6 \
                        --root_dir /path/to/UCF  \
                        --log_dir logs/ClipClap_UCF \
                        --dataset_name UCF \
                        --epochs 20 \
                        --lr 0.00007 \
                        --use_wavcaps_embeddings True \
                        --modality both  \
                        --word_embeddings both   \
                        --run all > logs/ClipClap_UCF.log &

Run training for ActivityNet-GZSL :

nohup python3 main.py --cfg config/clipclap.yaml \
                        --device cuda:6 \
                        --root_dir /path/to/ActivityNet  \
                        --log_dir logs/ClipClap_ActivityNet \
                        --dataset_name ActivityNet \
                        --epochs 15 \
                        --lr 0.0001 \
                        --use_wavcaps_embeddings True \
                        --modality both  \
                        --word_embeddings both   \
                        --run all > logs/ClipClap_ActivityNet.log &

Run training for VGGSound-GZSL :

nohup python3 main.py --cfg config/clipclap.yaml \
                        --device cuda:5 \
                        --root_dir /path/to/VGGSound  \
                        --log_dir logs/ClipClap_VGGSound \
                        --dataset_name VGGSound \
                        --epochs 15 \
                        --lr 0.0001 \
                        --use_wavcaps_embeddings True \
                        --modality both  \
                        --word_embeddings both   \
                        --run all > logs/ClipClap_VGGSound.log &

Evaluation

Evaluation can be done in two ways. Either you train with --run all which means that after training the evaluation will be done automatically, or you can do it manually.

For manual evaluation run the following command:

python3 get_evaluation.py --cfg CFG_FILE --load_path_stage_A PATH_STAGE_A --load_path_stage_B PATH_STAGE_B --dataset_name DATASET_NAME --root_dir ROOT_DIR

arguments:
--cfg CFG_FILE is the file containing all the hyperparameters for the experiments. To replicate our results, use ```--cfg config/clipclap.yaml``` for all three datasets.
--load_path_stage_A will indicate to the path that contains the network for stage 1
--load_path_stage_B will indicate to the path that contains the network for stage 2
--dataset_name {VGGSound, UCF, ActivityNet} will indicate the name of the dataset
--root_dir points to the location where the dataset is stored

Model weights

The trained models can be downloaded from here.

Results

GZSL performance on VGGSound-GZSL, UCF-GZSL, ActivityNet-GZSL

Method VGGSound-GZSL UCF-GZSL ActivityNet-GZSL
CJME 7.45 25.87 11.64
AVGZSLNET 4.71 42.67 12.70
AVCA 11.26 36.69 21.76
Hyper-multiple 11.87 41.56 20.90
Proposed 16.18 55.97 27.93

ZSL performance on VGGSound-GZSL, UCF-GZSL, ActivityNet-GZSL

Method VGGSound-GZSL UCF-GZSL ActivityNet-GZSL
CJME 6.84 20.46 9.92
AVGZSLNET 5.44 35.66 12.39
AVCA 8.16 38.67 20.88
Hyper-multiple 8.47 40.28 22.18
Proposed 11.53 46.96 22.76

Extracting Features from Scratch

Install all required dependencies into a new virtual environment via conda.

conda env create -f clipclap_feature_extraction.yml
conda activate clipclap_feature_extraction

Place the model weights from WavCaps in the following directories:

WavCaps/retrieval/pretrained_models/audio_encoders/HTSAT_BERT_zero_shot.pt
WavCaps/retrieval/pretrained_models/audio_encoders/HTSAT.ckpt

The files can be downloaded from the WavCaps repository.

In order to extract the CLIP/CLAP features on your own, run the scripts in the /clip_feature_extraction as follows:

python3 clip_feature_extraction/get_clip_features_activitynet.py
python3 clip_feature_extraction/get_clip_features_ucf.py
python3 clip_feature_extraction/get_clip_features_vggsound.py

Given the files extracted by the above scripts, run the following command to obtain the CLIP/CLAP features:

python3 splitting_scripts_cls/create_pkl_files_cls.py --dataset_name DATASET_NAME --path_original_dataset PATH_ORIGINAL_DATASET --path_splitted_dataset PATH_SPLITTED_DATASET

arguments:
--dataset_name: Name of the dataset
--path_original_dataset: the path of the dataset where the above scripts (those in ```cls_feature_extraction```) have extracted the dataset
--path_splitted_dataset: the path where to put the dataset after it is processed in the right way.

To obtain the class embeddings, run folloing scripts:

python3 clip_embeddings_extraction/get_clip_embeddings_activitynet.py
python3 clip_embeddings_extraction/get_clip_embeddings_ucf.py
python3 clip_embeddings_extraction/get_clip_embeddings_vggsound.py

Project structure

src - Contains the code used throughout the project for dataloaders/models/training/testing.
WavCaps - Folder contains the code for the CLAP network.
clip_feature_extraction - Contains the code used to extract the CLIP/CLAP features from all 3 datasets.
clip_embeddings_extraction - Contains the code used to extract the CLIP and CLAP class embeddings from all 3 datasets.
splitting_scripts_cls - Contains files from spltting our dataset into the required structure.

References

If you find this code useful, please consider citing:

@inproceedings{kurzendoerfer2024clipclap,
  author    = {Kurzendörfer, David and Mercea, Otniel-Bogdan and Koepke, A. Sophia and Akata, Zeynep},
  title     = {Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models},
  booktitle = {Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  year      = {2024}
}