[ \textAttention l = \textsoftmax!\left(\fracQ lK_l^\top\sqrtd_k\right)V_l, \quad l \in \textAct,\textScene,\textDialogue ] To align modalities, the loss encourages matching pairs (text‑image, text‑audio) to have higher cosine similarity than mismatched pairs:

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies.