"triggers" explores the profound parallel between trauma's invisible rewiring of human perception and selective modifications of artificial neural networks. This work examines how traumatic experiences fundamentally alter our processing of sensory input—where ordinary stimuli like locations, smells, or sounds become powerful triggers that activate overwhelming emotional and physiological responses.
Just as trauma creates these heightened associative pathways in the human mind—where certain sensory channels become hypersensitive while others may become muted—this artwork intervenes in the attention mechanisms of CLIP image encoder to simulate this perceptual rewiring. By selectively amplifying and silencing specific attention heads within the neural network, the visual output manifests the invisible yet profound alteration of perception that trauma survivors experience.
Historically, self-portraiture has served as a powerful vehicle for expressing psychological states and emotional distress. Artists have long turned to the human face as a canvas for exploring interior states. This work draws inspiration from this tradition, using computational manipulation to fragment and distort facial representation in ways that echo how trauma disrupts one's self-perception and relationship to others. The resulting imagery invites viewers to contemplate how trauma fundamentally changes our relationship with the world around us, creating invisible but powerful filters through which all subsequent experience must pass. Through this analogy, "triggers" offers a visualization of trauma's enduring impact—revealing how certain channels of perception become permanently heightened while others fade into silence, fundamentally altering our experience of reality without leaving visible traces.
The grid structure, with stark variations between adjacent portraits, seeks to create a state of visual overstimulation. The viewer's gaze is pulled in multiple competing directions simultaneously, unable to settle on a single coherent representation—mirroring the overwhelming cognitive load experienced during trauma responses, where normal integrative processing breaks down under the weight of competing sensory information.
The inspiration for this work stems from my personal experiences with trauma. Post-traumatic responses often manifest as overwhelming sensations that either silence everything else or amplify them to unbearable levels. What fascinates me is the invisibility of post trauma on the surface, when internally my processing of certain places, sounds, and visuals has been completely changed. This is what I wanted to visualize and externalize.
CLIP (Contrastive Language-Image Pre-training) encoders are neural networks trained on diverse image-text pairs, creating a powerful bridge between visual content and textual descriptions. The architecture combines a vision encoder and a text encoder, jointly trained to align embeddings from both modalities in a shared representation space. Within CLIP's transformer architecture, attention mechanisms play a crucial role in processing and interpreting visual information.
The exploration of semantically meaningful directions in latent spaces began with GANs (Goodfellow et al., 2014; Karras et al., 2019), where researchers discovered that moving along specific trajectories enabled controlled image editing operations (Shen et al., 2020). CLIP (Radford et al., 2021) revolutionized this approach by unifying visual and textual representations, enabling text-guided manipulation and spawning numerous applications—from finding traversal directions that align images with text descriptions to enabling zero-shot domain adaptation.
Building on this foundation, Gandelsman et al. recently conducted a granular investigation of CLIP's image encoder, analyzing how individual components contribute to the final representation. Their work characterized each attention head's specific role by algorithmically identifying text representations spanning its output space, revealing that many heads specialize in particular properties such as location or shape.
To explore the potential visual effects associated with these semantic roles, the insights were paired with the IP adapter mechanism—a component facilitating image generation conditioning through reference images that allows visual characteristics to guide new image creation. The proposed algorithm's outputs for the CLIP ViT-H encoder were analyzed, with common themes and patterns within each attention head's outputs being identified through LLM-based analysis. A scaling operation was subsequently implemented for each attention head output, enabling the muting or amplification of specific heads (such as those focused on 'colors') when provided with the relevant attention layer and head indices, thereby modifying the resulting CLIP image embeddings accordingly.
These manipulated embeddings were then incorporated as input to a generation pipeline, wherein an IP adapter and ControlNet were utilized to create portraits. Conditioning for each portrait was established through both a reference face image (thereby maintaining structural similarity across generations) and the manipulated embeddings carrying the semantic modifications.
The grid of portraits displayed in this work demonstrates this process, with each image being generated through the amplification or silencing of different groups of attention heads within CLIP's vision transformer. Text prompts and face conditioning were maintained constant across all generations, with the manipulated CLIP embeddings serving as the sole variable—thus effectively visualizing how selective attention modifications parallel the perceptual alterations experienced in traumatic processing.