본문 바로가기

Review

Review: CC:DAE

https://arxiv.org/abs/2402.08919

Core Idea

Conceptual similarity between images can be quantified by the complexity of the natural language descriptions required to distinguish the images.

Description Conceptual Similarity
Shorter descriptions that apply well to both images High
Longer descriptions that only fit one image well and not the other Low

Optimization Problem

Original Optimization Problem

The optimization problem is written as:

$$ h^\ast=\arg\min_{h\in H}l(x\vert h)\quad s.t. \quad I(h)\le C $$

where:

  • $h$ is a possible description for an image $x$.
  • $l(x\vert h)=-\log p(x\vert h)$ is the loss function.
    • $p(x\vert h)$ is the likelihood of $x$ given $h$, i.e., the probability that $x$ is explained by $h$.
    • A higher/lower likelihood represents that $h$ is a good/poor description of $x$.
  • $I(h)=-\log p_{code}(h)$ is the complexity (or the coding length) of $h$, and $C$ is upper-bound on the complexity.
    • Since simpler $h$ is more possible, higher probability $p_{code}(h)$ indicates simpler $h$. This negative relationship is expressed as $I(h)$.

Stochastic Relaxation

Since $H$ is discrete, the optimization problem cannot be solved using gradient descent. Instead of directly finding a single description $h$, its probability distribution $q(h)$ is used. Naturally, the loss function $l(x\vert h)$ is replaced by the average loss over $q$, denoted as $E_q[l(x\vert h)]$.

Now the optimization problem is modified as:

$$ q^\ast(h)=\arg\min_{q\in \mathcal Q}E_q[l(x\vert h)]\quad s.t. \quad KL(q(h)\Vert p_{code}(h))\le C $$

Note that the relaxed constraint $KL(q(h)\Vert p_{code}(h))\le C$ replaces the strict constraint $-\log p_{code}(h)\le C$:

  • KL divergence of $q$ from $p_{code}$ measures the difference of $q$ from $p_{code}$.
  • By making $q(h)$ similar to $p_{code}(h)$, we ensure that the probability distribution $q(h)$ does not place too much weight on complex hypotheses.

Conceptual Distance Function

For two images $x_1$ and $x_2$, the conceptual distance is:

$$ d=\frac{\Delta_{2\to1}+\Delta_{1\to2}}{2} $$

where:

$$ \Delta_{2\to1}:=E_{q_1}[l(x_1\vert h)]-E_{q_2}[l(x_1\vert h)] \\ \Delta_{1\to2}:=E_{q_2}[l(x_2\vert h)]-E_{q_1}[l(x_2\vert h)] $$

  • $E_{q}[l(x\vert h)]$ represents the average loss when using the description distribution $q$ to describe an image $x$. Small value indicates that the image description fits well.

Example:

If the word of $q_1$ and $q_2$ exchanges, $\Delta$ and $d$ will have positive value.

AUC

To summarize the conceptual distance between two images, the paper suggests calculating the Area Under Curve (AUC) of the distance function across all complexity levels $C$.

The AUC provides a single number that reflects the overall difference between the two images at all levels of description complexity.

AUC Representation
Small The images are conceptually similar and can be described with similar descriptions, even as the complexity increases.
Large The images are conceptually distant, and more detailed descriptions are needed to distinguish between them.