template conditioned relation detr

TL;DR: A closed-set DETR-style detector predicts one of C fixed classes for every object query. A template-conditioned detector instead asks whether each decoded query matches a supplied exemplar image. The geometry stays DETR-like: Hungarian matching, box regression, transformer decoding, and Relation-DETR’s spatial reasoning remain intact. The semantic head changes from (B, N, C) class logits to (B, N, 1) query-template match logits.

The shift

Standard DETR-style detectors answer a fixed-vocabulary question:

“Which predefined class does this object query represent?”

Template-conditioned detection asks a retrieval-style question:

“How similar is this object query to the exemplar I provided?”

That sounds like a small change, but it moves the class decision boundary out of the final classifier weights and into the input. Instead of learning a static matrix of class weights, the model receives a visual template at inference time and scores every decoded object query against that template.

For a Relation-DETR-like detector, this is appealing because the detector already has a strong geometric engine. The decoder learns object queries, the matcher enforces one-to-one assignments, and the box heads learn localization. We do not need to discard that machinery. We only need to replace closed-set classification with conditional matching.

Architecture change

The original detector and the template-conditioned variant share the same image path. The change is isolated to the semantic branch after the decoder: the closed-set classifier is replaced by a query-template similarity head.

The conceptual diff is: keep the backbone, decoder, query set, and box head fixed; remove the Linear(D, C) classifier; add a template encoder and replace the classifier with normalized query-template matching.

Mathematical formulation

Let the detector decoder produce N query embeddings:

Let a template encoder produce one visual prototype from the exemplar crop:

For each query, project the query embedding and compute a normalized similarity score against the template:

where:

W in R^(D x D) is a learned projection
tau = exp(s) is a learned temperature
ell_i is the conditional match logit for query i

Then:

The output shape becomes:

closed-set classification:       (B, N, C)
template-conditioned matching:   (B, N, 1)

The model no longer predicts a class id directly. It predicts whether each query matches the exemplar.

Matching and loss

The localization losses can stay almost unchanged. Hungarian assignment still matches predicted boxes to ground-truth boxes. The classification term becomes binary: positive if the ground-truth object belongs to the template’s category, negative otherwise.

For a matched pair (i, j):

This keeps the detector DETR-like in geometry and metric-learning-like in recognition. The decoder still has to produce good boxes; the head only changes how semantic evidence is computed.

Design pressure

Closed-set detectors work well when the vocabulary is known ahead of deployment. They become brittle when the target category is rare, private, changing, or easier to show than to name.

Template conditioning is useful when:

a user can provide a crop but not a stable class label
the category is absent from the training taxonomy
product, logo, symbol, or defect appearances change faster than retraining cycles
the same image needs to be searched repeatedly for different exemplars

In that last case, the system can cache the target image features once, then run multiple template embeddings against the same decoded representation.

Core implementation

The class head is the surgical part. A conventional DETR classifier maps every query to C logits. The template-conditioned head maps every query to one match logit.

class TemplateClassHead(nn.Module):
    def __init__(self, embed_dim: int = 256):
        super().__init__()
        self.query_proj = nn.Linear(embed_dim, embed_dim)
        self.logit_scale = nn.Parameter(torch.tensor(math.log(10.0)))
        self._template: torch.Tensor | None = None
 
    def set_template(self, template: torch.Tensor) -&gt; None:
        self._template = template  # (B, 1, D)
 
    def forward(self, query: torch.Tensor) -&gt; torch.Tensor:
        if self._template is None:
            raise RuntimeError('TemplateClassHead requires a template before scoring queries.')
 
        q = F.normalize(self.query_proj(query), dim=-1)      # (B, N, D)
        v = F.normalize(self._template, dim=-1)              # (B, 1, D)
        score = torch.bmm(q, v.transpose(1, 2))              # (B, N, 1)
        return score * self.logit_scale.exp()

A minimal visual prompt encoder can be built from pooled backbone features:

class VisualPromptEncoder(nn.Module):
    def __init__(self, embed_dim: int = 256, in_ch: int = 2048):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Conv2d(in_ch, embed_dim, kernel_size=1),
            nn.GroupNorm(32, embed_dim),
        )
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, embed_dim),
            nn.ReLU(inplace=True),
            nn.Linear(embed_dim, embed_dim),
            nn.LayerNorm(embed_dim),
        )
 
    def forward(self, feat: torch.Tensor) -&gt; torch.Tensor:
        x = self.proj(feat)
        x = self.pool(x).flatten(1)
        x = self.mlp(x)
        return x.unsqueeze(1)  # (B, 1, D)

The inference path can reuse image features across templates:

cached = model.extract_features(image)
 
for template in templates:
    template_embedding = model.encode_template(template)
    detections = model.detect_with_cached_features(cached, template_embedding)

That matters in workflows where the same large image is queried with many exemplars: schematic symbols, retail products, industrial defects, logo search, or repeated visual inspection.

Training signal

Each training sample can be formed as:

one target image
one template crop sampled from an object category in that image
all boxes of that category as positives
other decoded queries as negatives through the binary classification loss

This gives a dense “find all objects like this” signal. The target image teaches localization, while the template controls which subset of objects should be scored as matches.

A useful extension is to sample hard negatives deliberately: visually similar classes, nearby parts, same-category variants, or confusing background structures. Without that pressure, cosine heads can overfire on lookalikes.

What this borrows

This design is a recombination of existing ideas rather than a new detection paradigm.

From DETR, it keeps:

set prediction
learned object queries
bipartite matching
L1 and GIoU box losses
the decoder-to-box-head interface

From Relation-DETR, it keeps the belief that the geometric decoder should remain strong and mostly untouched. The point is not to rebuild detection; the point is to change the semantic scoring layer while preserving the spatial reasoning stack.

From FS-DETR and one-shot detection work, it borrows the idea that a visual example can define the target category at inference time.

From OV-DETR, it borrows the conditional binary matching framing: the detector should predict whether a region matches a supplied query, rather than always predicting over a fixed class vocabulary.

From CLIP-style retrieval heads, it borrows normalized embeddings and a learned temperature for cosine-style scoring.

The difference is the scope of the modification. Instead of injecting templates deeply into the transformer decoder or building a full open-vocabulary text-and-image query system, this variant replaces the class head with a template-conditioned similarity head. It is intentionally narrow.

Failure modes

The obvious failure mode is template quality. If the crop contains background clutter, partial objects, scale artifacts, or ambiguous context, the prototype becomes noisy.

Near-lookalike categories are another problem. If two objects share local texture or shape, a single global template vector may not separate them reliably. This is where hard negative mining, multi-template prototypes, and contrastive auxiliary losses become useful.

Thresholding is also brittle. A fixed score cutoff may work on one dataset and fail on another because the learned temperature, template quality, and visual domain all affect score calibration.

Promising next steps:

use multiple templates per target and aggregate a prototype set
add hard negative mining with confusing lookalikes
train a calibration head for more stable thresholds
add contrastive supervision between query and template embeddings
preserve spatial template structure instead of collapsing the exemplar into one vector

Related work and implementations

DETR: End-to-End Object Detection with Transformers introduced the set-prediction formulation, Hungarian matching, and object-query decoder. Code: facebookresearch/detr.
Relation DETR: Exploring Explicit Position Relation Prior for Object Detection adds explicit position-relation priors to strengthen DETR-style decoding. Code: xiuqhou/Relation-DETR.
FS-DETR: Few-Shot Detection Transformer with Prompting and without Re-training feeds visual templates as prompts so novel categories can be detected without test-time fine-tuning.
OV-DETR: Open-Vocabulary DETR with Conditional Matching formulates detection as conditional binary matching against text or exemplar-image queries.
OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features performs one-shot detection through learned feature matching and geometric alignment. Code: aosokin/os2d.
One-Shot Object Detection with Co-Attention and Co-Excitation uses support-query interaction for one-shot detection. Code: timy90022/one-shot-object-detection.
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching adapts CLIP region features inside a DETR-style open-vocabulary detector.
Few-Shot Pattern Detection via Template Matching and Regression revisits template matching with modern frozen backbones and regression heads. Code: Chipmunk-g4/Template-Matching-and-Regression.

Template-Conditioned Relation-DETR