This repository contains the training and evaluation code for the paper Infusing fine-grained visual knowledge to Vision-Language Models.
The codebase builds on Universal Image Embeddings / UnED and uses Scenic, JAX/Flax, Grain, and supporting components from Big Vision and TIPS.
The main workflow in this repo is:
- Prepare image datasets and metadata.
- Convert the datasets into
array_recordfiles for Grain-based loading. - Download pretrained backbone checkpoints and optional pretrained descriptors.
- Optionally extract pretrained train descriptors for the target-dataset plus generic-dataset pair.
- Train a universal embedding model with a SigLIP- or TIPS-based ViT backbone.
- Run test-set kNN and optional text evaluation from the saved best checkpoint.
.
├── universal_embedding/
│ ├── main.py # Training entrypoint
│ ├── knn_main.py # Standalone kNN evaluation entrypoint
│ ├── app.py # Shared CLI / runtime wrapper
│ ├── grain_datasets.py # Grain dataset builders
│ ├── classification_with_knn_eval_trainer.py
│ ├── model_init.py
│ ├── train_eval_steps.py
│ ├── knn_utils.py
│ ├── text_eval_utils.py
│ └── configs/
│ ├── config_train_vit.py
│ ├── config_knn_vit.py
│ └── config_knn_vit_no_finetune.py
├── prepare_data.sh # Example conversion commands for dataset records
├── download_data.sh # Downloads pretrained checkpoints / features
├── setup.sh # Environment bootstrap script
├── scripts/ # Main experiment launchers
└── convert_to_array_record.py
- Fine-tuning on a target fine-grained dataset while optionally infusing generic visual data such as
our_imagenet_split. - ViT backbones initialized from SigLIP or TIPS checkpoints.
- Grain dataloading from
array_recordshards. - Periodic and final kNN evaluation during training.
- Standalone checkpoint evaluation with
universal_embedding/knn_main.py. - Optional text evaluation hooks for paired image-text benchmarks.
- Multi-device JAX execution through the standard Scenic/JAX runtime.
The repository assumes a Linux environment with Python 3.10 and a working JAX setup.
At a minimum you will need:
- Python
3.10 venv- JAX / Flax
- Scenic
- Grain
- TensorFlow and
tensorflow_text wandbif you want experiment tracking
The included setup.sh script bootstraps a local environment and installs the main dependencies. It also clones Scenic, Big Vision, and TIPS during setup.
Important notes about setup.sh:
- It uses
sudo aptcommands. - It creates a virtual environment named
scenic_venv. - It installs
jax[tpu]by default, so you will likely want to adjust it if you are running on GPU or CPU only. - It clones
big_visionandtipsnext to the repository, as siblings ofinfusing, because the launch scripts add the parent directory toPYTHONPATH.
If you prefer a manual setup, use setup.sh as a starting point rather than assuming it is portable as-is.
The training configs in this repository default to the following fine-grained datasets:
food2kcarssopinshopinat
The code expects:
- raw images on disk
- JSON metadata files per dataset split under an
info_filesdirectory array_recordshards per split for training / evaluation
For evaluation and some initialization paths, you will also need pretrained checkpoints and optionally pre-extracted descriptors.
If you enable image-text evaluation, you will also need the text-image evaluation assets under data/text_image/.
This project uses Grain with array_record files rather than reading raw image folders directly during training.
The helper scripts are:
convert_to_array_record.py supports both:
--input_format=jsonfor datasets described by split JSON files--input_format=folderfor ImageNet-style class-per-folder directories
prepare_data.sh is now a thin wrapper around that unified converter. It defines reusable shell functions for JSON-based datasets and ImageNet-style folder scans, plus a few example conversions you can uncomment.
In practice, you should:
- Prepare the dataset JSON split files under your metadata directory.
- Update the paths in
prepare_data.shor callconvert_to_array_record.pydirectly. - Generate
array_recordshards for each split you plan to train or evaluate on.
The dataset loader builds record paths from universal_embedding/dataset_infos.py. A typical local layout looks like:
data/
├── array_records/
│ ├── cars/
│ │ ├── train/
│ │ ├── val/
│ │ └── test/
│ ├── food2k/
│ ├── inat/
│ ├── inshop/
│ └── sop/
├── info_files/
│ ├── cars/
│ │ ├── train.json
│ │ ├── val.json
│ │ └── test.json
│ └── ...
└── models/
├── siglip/
└── tips/
You will need to point the config fields below at your local paths:
train_dataset_direval_dataset_dirinfo_files_dirpretrained_ckpt_dirpretrained_train_descriptors_dirwhen applicable
The helper script download_data.sh downloads the public assets from:
https://login.rci.cvut.cz/~ypsilnik/infusing_data
and places them into the repository data/ directory.
It provides:
- model checkpoints under
data/models/ - dataset metadata JSON files under
data/info_files/ - public ArrayRecord files under
data/array_records/ - text-image evaluation assets under
data/text_image/
Run:
bash download_data.shThe script still does not download:
- Precomputed descriptor features
If you use pretrained descriptor distillation or text-evaluation features, make sure those assets exist and update the config paths accordingly.
The training and evaluation scripts enable image-text retrieval evaluation through:
data/text_image/flickr30k
data/text_image/mscoco
These paths are passed through --config.text_datasets in the shell scripts.
For each text-image dataset, the code expects:
queries.tfrecord<dataset_name>_text_embeddings_siglip.npyor<dataset_name>_text_embeddings_tips.npy<dataset_name>_gt.npy
For example:
data/text_image/flickr30k/
├── queries.tfrecord
├── flickr30k_text_embeddings_siglip.npy
├── flickr30k_text_embeddings_tips.npy
└── flickr30k_gt.npy
data/text_image/mscoco/
├── queries.tfrecord
├── mscoco_text_embeddings_siglip.npy
├── mscoco_text_embeddings_tips.npy
└── mscoco_gt.npy
If these assets are not available, disable text evaluation in your run configuration or script invocation.
The main user-facing entrypoints in this repository are the shell scripts in scripts.
Run them from the repository root:
bash scripts/<script_name>.sh ...Available scripts:
- scripts/one_domain_pipeline.sh: full 3-step pipeline for a target-domain run: descriptor extraction, training, then final test evaluation.
- scripts/baseline.sh: baseline training run on
finetuning_dataset,generic_dataset, without the proposed regularization terms used in the full method. - scripts/full_method.sh: training run with pretrained embedding and weight regularization enabled.
- scripts/evaluation.sh: standalone kNN and text evaluation for a saved training run.
- scripts/extract_pretrained_feats.sh: extract descriptors from an off-the-shelf pretrained model.
The scripts accept an optional first argument for wandb_entity. Pass an empty string if you do not want to log to Weights & Biases.
The main public entrypoint for a complete experiment is scripts/one_domain_pipeline.sh, which performs:
- descriptor extraction for
finetuning_dataset,generic_dataset - full-method training
- final test-set kNN evaluation on the saved best checkpoint
Example public run:
bash scripts/one_domain_pipeline.sh "" sop our_imagenet_split siglip 3 cars,sop,inshop,inat,imagenet,food2k round_robin 764 1000 10000This example fine-tunes on sop, uses our_imagenet_split as the generic source dataset, initializes from a SigLIP backbone, extracts pretrained train descriptors, trains the full method, and then evaluates on cars,sop,inshop,inat,imagenet,food2k.
Use scripts/one_domain_pipeline.sh when you want the full intended workflow but with your own datasets, schedule, and weights.
Use scripts/baseline.sh when you want to train the comparison baseline without the proposed regularization.
Use scripts/full_method.sh directly when you already have pretrained train descriptors extracted and only want to train while monitoring validation kNN during training.
Use scripts/evaluation.sh directly when you already have a trained run and only want final test evaluation.
Use scripts/extract_pretrained_feats.sh directly when you only need off-the-shelf descriptors.
The primary configuration files are:
- universal_embedding/configs/config_train_vit.py
- universal_embedding/configs/config_knn_vit.py
- universal_embedding/configs/config_knn_vit_no_finetune.py
The default training config is set up for:
- model class:
siglip_vit_with_embedding - model type:
B/16 - datasets:
food2k,cars,sop,inshop,inat - training epochs:
10 - batch size:
128 - kNN evaluation enabled during training
Before launching a run, review at least these fields:
model_classmodel_typedataset_nameknn_eval_namestrain_dataset_direval_dataset_dirinfo_files_dirpretrained_ckpt_dirtrain_dirbatch_sizenum_training_epochsuse_grain_dataloader
The full method depends on pretrained train descriptors produced from an off-the-shelf model.
The expected flow is:
- scripts/extract_pretrained_feats.sh writes descriptors under:
data/exps/experiments/off-the-shelf/features/<pretraining>_vitB_pretrained_embeddings/descriptors/0/backbone_out_embedd/
- Inside that directory, descriptors are stored per dataset and split, for example:
data/exps/experiments/off-the-shelf/features/siglip_vitB_pretrained_embeddings/descriptors/0/backbone_out_embedd/
├── sop/train.npy
└── our_imagenet_split/train.npy
- scripts/full_method.sh loads those
train.npyfiles through--config.pretrained_train_descriptors_dir.
If those descriptor files already exist for your finetuning_dataset,generic_dataset pair, you can skip extraction and launch training directly.
Training runs through universal_embedding/main.py, but in normal use you should launch it through the scripts in scripts, especially:
During training, the repository runs kNN evaluation on the validation split according to the configured cadence, so you can monitor validation behavior without running the final evaluation script separately.
During startup, the shared wrapper in universal_embedding/app.py saves the resolved configuration to:
<workdir>/config.json
That saved config is later reused by standalone kNN evaluation.
Standalone evaluation runs through universal_embedding/knn_main.py, and the normal entrypoint is scripts/evaluation.sh.
In the intended one-domain workflow, this script is the final pipeline step and evaluates the test split using the best checkpoint saved by training.
For evaluation, make sure the config points to the training directory:
train_dir: directory containing the checkpoints and savedconfig.json
Behavior to know:
- If
only_best_knn=True, the script evaluates only the best checkpoint. - If
only_best_knn=False, it evaluates checkpoints over the configured epoch range. - If
test_pretrained_features=True, it evaluates the initialized model before checkpoint restoration. - If
no_finetune=True, the script can evaluate a pretrained backbone without loading finetuned checkpoints.
To extract pretrained descriptors without a finetuned run, use scripts/extract_pretrained_feats.sh.
The runtime uses CLU metric writers and can optionally initialize Weights & Biases.
Outputs are written under the specified workdir, including:
- saved
config.json - TensorBoard-compatible event files and summaries
- checkpoints
- optional saved descriptors
- optional nearest-neighbor outputs
The event files written to workdir can be viewed directly with TensorBoard. If Weights & Biases logging is enabled, the same run is also synced there for remote logging and visualization.
setup.shis not fully self-contained. Review and adapt it before running.download_data.shis the actual script for model downloads; the previous README referred todownload_models.sh, which does not exist in this repository.- The training and evaluation configs leave several path fields empty by default. They must be filled in for your environment.
prepare_data.shis a helper wrapper, not a full data-ingestion pipeline.- Some features referenced by the configs, such as pretrained descriptors and text-evaluation assets, are expected to exist externally.
If you use this repository in academic work, cite:
@inproceedings{ypsilantis2025infusing,
title={Infusing fine-grained visual knowledge to Vision-Language Models},
author={Ypsilantis, Nikolaos-Antonios and Chen, Kaifeng and Araujo, Andr{\'e} and Chum, Ondrej},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={4226--4235},
year={2025}
}This codebase builds on and/or depends on:
- Scenic
- Universal Image Embeddings / UnED
- Big Vision
- TIPS
- JAX / Flax
- Grain