CATVis

Context-Aware Thought Visualization

A cutting-edge framework that transforms brain activity into visual representations using spatio-temporal convolutional neural networks, transformer-based encoder, contrastive language-image pre-training (CLIP), and stable diffusion. Our system achieves state-of-the-art performance in both EEG-based classification and image generation accuracy.

61.09%
EEG Classification
52.65%
Image Reconstruction
37.50
Inception Score
80.30
FID Score

Research paper accepted for publication at MICCAI 2025, a CORE A-ranked conference held in South Korea.

Authors

1
Lahore University of Management Sciences, Lahore, Pakistan
2
Forman Christian College, University, Lahore, Pakistan
3
Arbisoft, Lahore, Pakistan

These authors contributed equally to this work and are co-corresponding authors.

Abstract

EEG-based brain-computer interfaces (BCIs) have shown promise in various applications, such as motor imagery and cognitive state monitoring. However, decoding visual representations from EEG signals remains a significant challenge due to their complex and noisy nature. We thus propose a novel 5-stage framework for decoding visual representations from EEG signals: (1) an EEG encoder for concept classification, (2) cross-modal alignment of EEG and text embeddings in CLIP feature space, (3) caption refinement via re-ranking, (4) weighted interpolation of concept and caption embeddings for richer semantics, and (5) image generation using a pre-trained Stable Diffusion model. We enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking. Experimental results demonstrate that our method generates high-quality images aligned with visual stimuli, outperforming SOTA approaches by 27.08% in Classification Accuracy, 15.21% in Generation Accuracy and reducing Fréchet Inception Distance by 36.61%, indicating superior semantic alignment and image quality.

Methodology

CATVis Framework Methodology

Overview of our proposed five-stage framework for reconstructing images from EEG signals.

1

Supervised EEG Encoder

Supervised EEG Conformer for concept classification

2

Cross-Modal Alignment

EEG and text embeddings in CLIP space

3

Caption Refinement

Re-ranking for improved descriptions

4

Weighted Interpolation

Dynamic concept-caption balancing

5

Image Generation

Stable Diffusion for visual output

CLIP Cross-modal alignment

Cross-modal alignment showing EEG and text embedding alignment in CLIP feature space

Key Contributions

Context-Aware Generation

Enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking, boosting generation accuracy by 15.21% over baselines.

Supervised Learning

Achieve 27.08% higher classification accuracy using a supervised EEG Conformer, demonstrating that robust EEG representations can be learned without extensive pretraining.

Semantic Fidelity

Modulate the emphasis on concept vs. context, increasing the semantic fidelity of reconstructed images and aligning them more closely with the true visual content.

Experimental Results

Generated Samples Comparison with Ground Truth

Comparison of generated images (Sample 1 & 2) against ground truth (GT) across various object categories

The qualitative samples visually affirm our quantitative results. The generated images not only resemble the visual stimuli in terms of object identity but also capture surrounding context and attributes, such as shape, color, and spatial composition. This level of detail, rarely observed in prior EEG-to-image works, validates the core premise of our framework: that context-aware thought visualization is achievable through careful alignment of EEG semantics and generative priors.

27.08%
Higher Classification Accuracy

Outperforming SOTA approaches

15.21%
Higher Generation Accuracy

Improved image generation quality

36.61%
FID Reduction

Better semantic alignment

These metrics indicate superior semantic alignment between generated images and true visual stimuli, as well as significantly improved image quality. The framework successfully generates context-aware images that capture both the conceptual class and descriptive details of the visual content.

Conclusion

Our CATVis framework successfully addresses the challenge of context-aware EEG-to-image generation through innovative cross-modal alignment and re-ranking strategies. The significant improvements in classification accuracy (27.08%), generation accuracy (15.21%), and image quality (36.61% FID reduction) demonstrate the effectiveness of our approach in bridging the gap between brain signals and visual representations.