Authors
These authors contributed equally to this work and are co-corresponding authors.
Abstract
EEG-based brain-computer interfaces (BCIs) have shown promise in various applications, such as motor imagery and cognitive state monitoring. However, decoding visual representations from EEG signals remains a significant challenge due to their complex and noisy nature. We thus propose a novel 5-stage framework for decoding visual representations from EEG signals: (1) an EEG encoder for concept classification, (2) cross-modal alignment of EEG and text embeddings in CLIP feature space, (3) caption refinement via re-ranking, (4) weighted interpolation of concept and caption embeddings for richer semantics, and (5) image generation using a pre-trained Stable Diffusion model. We enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking. Experimental results demonstrate that our method generates high-quality images aligned with visual stimuli, outperforming SOTA approaches by 27.08% in Classification Accuracy, 15.21% in Generation Accuracy and reducing Fréchet Inception Distance by 36.61%, indicating superior semantic alignment and image quality.
Methodology

Overview of our proposed five-stage framework for reconstructing images from EEG signals.
Supervised EEG Encoder
Supervised EEG Conformer for concept classification
Cross-Modal Alignment
EEG and text embeddings in CLIP space
Caption Refinement
Re-ranking for improved descriptions
Weighted Interpolation
Dynamic concept-caption balancing
Image Generation
Stable Diffusion for visual output

Cross-modal alignment showing EEG and text embedding alignment in CLIP feature space
Key Contributions
Context-Aware Generation
Enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking, boosting generation accuracy by 15.21% over baselines.
Supervised Learning
Achieve 27.08% higher classification accuracy using a supervised EEG Conformer, demonstrating that robust EEG representations can be learned without extensive pretraining.
Semantic Fidelity
Modulate the emphasis on concept vs. context, increasing the semantic fidelity of reconstructed images and aligning them more closely with the true visual content.
Experimental Results

Comparison of generated images (Sample 1 & 2) against ground truth (GT) across various object categories
The qualitative samples visually affirm our quantitative results. The generated images not only resemble the visual stimuli in terms of object identity but also capture surrounding context and attributes, such as shape, color, and spatial composition. This level of detail, rarely observed in prior EEG-to-image works, validates the core premise of our framework: that context-aware thought visualization is achievable through careful alignment of EEG semantics and generative priors.
Outperforming SOTA approaches
Improved image generation quality
Better semantic alignment
These metrics indicate superior semantic alignment between generated images and true visual stimuli, as well as significantly improved image quality. The framework successfully generates context-aware images that capture both the conceptual class and descriptive details of the visual content.
Resources
Access the full research paper with detailed methodology, results, and analysis.
Explore the implementation, run experiments, and contribute to the project.
Explore the foundational research work in-depth that led to the development of CATVis framework.