27-05-2026
detrrelation-detrobject-detectionfew-shotcomputer-vision

Template-Conditioned Relation-DETR

TL;DR: A closed-set DETR-style detector predicts one of C fixed classes for every object query. A template-conditioned detector instead asks whether each decoded query matches a supplied exemplar image. The geometry stays DETR-like: Hungarian matching, box regression, transformer decoding, and Relation-DETR’s spatial reasoning remain intact. The semantic head changes from (B, N, C) class logits to (B, N, 1) query-template match logits.

The shift

Standard DETR-style detectors answer a fixed-vocabulary question:

  • “Which predefined class does this object query represent?”

Template-conditioned detection asks a retrieval-style question:

  • “How similar is this object query to the exemplar I provided?”

That sounds like a small change, but it moves the class decision boundary out of the final classifier weights and into the input. Instead of learning a static matrix of class weights, the model receives a visual template at inference time and scores every decoded object query against that template.

For a Relation-DETR-like detector, this is appealing because the detector already has a strong geometric engine. The decoder learns object queries, the matcher enforces one-to-one assignments, and the box heads learn localization. We do not need to discard that machinery. We only need to replace closed-set classification with conditional matching.

Architecture change

The original detector and the template-conditioned variant share the same image path. The change is isolated to the semantic branch after the decoder: the closed-set classifier is replaced by a query-template similarity head.

Original Relation-DETRtarget imagebackbone + neckRelation-DETRdecoderobject queries(𝐵,𝑁,𝐷)linear classheadbox headclass logits(𝐵,𝑁,𝐶)boxes(𝐵,𝑁,4)Template-conditioned varianttarget imagebackbone + neckRelation-DETRdecoderobject queries(𝐵,𝑁,𝐷)template similarityheadbox headmatch logits(𝐵,𝑁,1)boxes(𝐵,𝑁,4)template cropvisual promptencodertemplate embedding(𝐵,1,𝐷)replace classifier

The conceptual diff is: keep the backbone, decoder, query set, and box head fixed; remove the Linear(D, C) classifier; add a template encoder and replace the classifier with normalized query-template matching.

Mathematical formulation

Let the detector decoder produce N query embeddings:

𝑄=123𝑞𝑖125𝑁𝑖=1,𝑞𝑖𝐷

Let a template encoder produce one visual prototype from the exemplar crop:

𝑣=𝐸(𝑡),𝑣𝐷

For each query, project the query embedding and compute a normalized similarity score against the template:

𝑖=𝜏𝑊𝑞𝑖𝑊𝑞𝑖2,𝑣𝑣2

where:

  • W in R^(D x D) is a learned projection
  • tau = exp(s) is a learned temperature
  • ell_i is the conditional match logit for query i

Then:

𝑝𝑖=𝜎(𝑖)

The output shape becomes:

closed-set classification:       (B, N, C)
template-conditioned matching:   (B, N, 1)

The model no longer predicts a class id directly. It predicts whether each query matches the exemplar.

Matching and loss

The localization losses can stay almost unchanged. Hungarian assignment still matches predicted boxes to ground-truth boxes. The classification term becomes binary: positive if the ground-truth object belongs to the template’s category, negative otherwise.

For a matched pair (i, j):

ℒ︀𝑖𝑗=𝜆clsℒ︀focal(𝑝𝑖,𝑦𝑗)+𝜆L1𝑏𝑖𝑏𝑗1+𝜆giou(1GIoU(𝑏𝑖,𝑏𝑗))

This keeps the detector DETR-like in geometry and metric-learning-like in recognition. The decoder still has to produce good boxes; the head only changes how semantic evidence is computed.

Design pressure

Closed-set detectors work well when the vocabulary is known ahead of deployment. They become brittle when the target category is rare, private, changing, or easier to show than to name.

Template conditioning is useful when:

  • a user can provide a crop but not a stable class label
  • the category is absent from the training taxonomy
  • product, logo, symbol, or defect appearances change faster than retraining cycles
  • the same image needs to be searched repeatedly for different exemplars

In that last case, the system can cache the target image features once, then run multiple template embeddings against the same decoded representation.

Core implementation

The class head is the surgical part. A conventional DETR classifier maps every query to C logits. The template-conditioned head maps every query to one match logit.

class TemplateClassHead(nn.Module):
    def __init__(self, embed_dim: int = 256):
        super().__init__()
        self.query_proj = nn.Linear(embed_dim, embed_dim)
        self.logit_scale = nn.Parameter(torch.tensor(math.log(10.0)))
        self._template: torch.Tensor | None = None
 
    def set_template(self, template: torch.Tensor) -> None:
        self._template = template  # (B, 1, D)
 
    def forward(self, query: torch.Tensor) -> torch.Tensor:
        if self._template is None:
            raise RuntimeError('TemplateClassHead requires a template before scoring queries.')
 
        q = F.normalize(self.query_proj(query), dim=-1)      # (B, N, D)
        v = F.normalize(self._template, dim=-1)              # (B, 1, D)
        score = torch.bmm(q, v.transpose(1, 2))              # (B, N, 1)
        return score * self.logit_scale.exp()

A minimal visual prompt encoder can be built from pooled backbone features:

class VisualPromptEncoder(nn.Module):
    def __init__(self, embed_dim: int = 256, in_ch: int = 2048):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Conv2d(in_ch, embed_dim, kernel_size=1),
            nn.GroupNorm(32, embed_dim),
        )
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, embed_dim),
            nn.ReLU(inplace=True),
            nn.Linear(embed_dim, embed_dim),
            nn.LayerNorm(embed_dim),
        )
 
    def forward(self, feat: torch.Tensor) -> torch.Tensor:
        x = self.proj(feat)
        x = self.pool(x).flatten(1)
        x = self.mlp(x)
        return x.unsqueeze(1)  # (B, 1, D)

The inference path can reuse image features across templates:

cached = model.extract_features(image)
 
for template in templates:
    template_embedding = model.encode_template(template)
    detections = model.detect_with_cached_features(cached, template_embedding)

That matters in workflows where the same large image is queried with many exemplars: schematic symbols, retail products, industrial defects, logo search, or repeated visual inspection.

Training signal

Each training sample can be formed as:

  • one target image
  • one template crop sampled from an object category in that image
  • all boxes of that category as positives
  • other decoded queries as negatives through the binary classification loss

This gives a dense “find all objects like this” signal. The target image teaches localization, while the template controls which subset of objects should be scored as matches.

A useful extension is to sample hard negatives deliberately: visually similar classes, nearby parts, same-category variants, or confusing background structures. Without that pressure, cosine heads can overfire on lookalikes.

What this borrows

This design is a recombination of existing ideas rather than a new detection paradigm.

From DETR, it keeps:

  • set prediction
  • learned object queries
  • bipartite matching
  • L1 and GIoU box losses
  • the decoder-to-box-head interface

From Relation-DETR, it keeps the belief that the geometric decoder should remain strong and mostly untouched. The point is not to rebuild detection; the point is to change the semantic scoring layer while preserving the spatial reasoning stack.

From FS-DETR and one-shot detection work, it borrows the idea that a visual example can define the target category at inference time.

From OV-DETR, it borrows the conditional binary matching framing: the detector should predict whether a region matches a supplied query, rather than always predicting over a fixed class vocabulary.

From CLIP-style retrieval heads, it borrows normalized embeddings and a learned temperature for cosine-style scoring.

The difference is the scope of the modification. Instead of injecting templates deeply into the transformer decoder or building a full open-vocabulary text-and-image query system, this variant replaces the class head with a template-conditioned similarity head. It is intentionally narrow.

Failure modes

The obvious failure mode is template quality. If the crop contains background clutter, partial objects, scale artifacts, or ambiguous context, the prototype becomes noisy.

Near-lookalike categories are another problem. If two objects share local texture or shape, a single global template vector may not separate them reliably. This is where hard negative mining, multi-template prototypes, and contrastive auxiliary losses become useful.

Thresholding is also brittle. A fixed score cutoff may work on one dataset and fail on another because the learned temperature, template quality, and visual domain all affect score calibration.

Promising next steps:

  • use multiple templates per target and aggregate a prototype set
  • add hard negative mining with confusing lookalikes
  • train a calibration head for more stable thresholds
  • add contrastive supervision between query and template embeddings
  • preserve spatial template structure instead of collapsing the exemplar into one vector

Related work and implementations

Command Palette
Search for a command to run