https://arxiv.org/abs/2402.08919
Core Idea
Conceptual similarity between images can be quantified by the complexity of the natural language descriptions required to distinguish the images.
Description | Conceptual Similarity |
Shorter descriptions that apply well to both images | High |
Longer descriptions that only fit one image well and not the other | Low |
Optimization Problem
Original Optimization Problem
The optimization problem is written as:
$$ h^\ast=\arg\min_{h\in H}l(x\vert h)\quad s.t. \quad I(h)\le C $$
where:
- $h$ is a possible description for an image $x$.
- $l(x\vert h)=-\log p(x\vert h)$ is the loss function.
- $p(x\vert h)$ is the likelihood of $x$ given $h$, i.e., the probability that $x$ is explained by $h$.
- A higher/lower likelihood represents that $h$ is a good/poor description of $x$.
- $I(h)=-\log p_{code}(h)$ is the complexity (or the coding length) of $h$, and $C$ is upper-bound on the complexity.
- Since simpler $h$ is more possible, higher probability $p_{code}(h)$ indicates simpler $h$. This negative relationship is expressed as $I(h)$.
Stochastic Relaxation
Since $H$ is discrete, the optimization problem cannot be solved using gradient descent. Instead of directly finding a single description $h$, its probability distribution $q(h)$ is used. Naturally, the loss function $l(x\vert h)$ is replaced by the average loss over $q$, denoted as $E_q[l(x\vert h)]$.
Now the optimization problem is modified as:
$$ q^\ast(h)=\arg\min_{q\in \mathcal Q}E_q[l(x\vert h)]\quad s.t. \quad KL(q(h)\Vert p_{code}(h))\le C $$
Note that the relaxed constraint $KL(q(h)\Vert p_{code}(h))\le C$ replaces the strict constraint $-\log p_{code}(h)\le C$:
- KL divergence of $q$ from $p_{code}$ measures the difference of $q$ from $p_{code}$.
- By making $q(h)$ similar to $p_{code}(h)$, we ensure that the probability distribution $q(h)$ does not place too much weight on complex hypotheses.
Conceptual Distance Function
For two images $x_1$ and $x_2$, the conceptual distance is:
$$ d=\frac{\Delta_{2\to1}+\Delta_{1\to2}}{2} $$
where:
$$ \Delta_{2\to1}:=E_{q_1}[l(x_1\vert h)]-E_{q_2}[l(x_1\vert h)] \\ \Delta_{1\to2}:=E_{q_2}[l(x_2\vert h)]-E_{q_1}[l(x_2\vert h)] $$
- $E_{q}[l(x\vert h)]$ represents the average loss when using the description distribution $q$ to describe an image $x$. Small value indicates that the image description fits well.
Example:
If the word of $q_1$ and $q_2$ exchanges, $\Delta$ and $d$ will have positive value.
AUC
To summarize the conceptual distance between two images, the paper suggests calculating the Area Under Curve (AUC) of the distance function across all complexity levels $C$.
The AUC provides a single number that reflects the overall difference between the two images at all levels of description complexity.
AUC | Representation |
Small | The images are conceptually similar and can be described with similar descriptions, even as the complexity increases. |
Large | The images are conceptually distant, and more detailed descriptions are needed to distinguish between them. |