May 2026

Django · NetBox

Django internals where confident guesses fail. NetBox is a real Django codebase with cross-app signals, object-level permission QuerySets, GenericRelations, and weighted search indexes. This benchmark tests whether an agent can follow the implementation — not just recall Django patterns.

View on GitHub ↗

Benchmark setup

Repository

netbox-community/netbox

Stack

Django / Python · ~1,165 Python files, 9 apps

Tessra version

v2.19.x

Date evaluated

May 2026

Branch / snapshot

main branch, May 2026 snapshot

Results summary

Claude Haiku 4.5

12.375 / 25 (49.5%) → 24.0 / 25 (96%) +11.625 pts (+46.5 pp)

Claude Sonnet 4.6

13.25 / 25 (53%) → 23.125 / 25 (92.5%) +9.875 pts (+39.5 pp)

Tessra improved both lower-cost and premium models by grounding answers in the current implementation instead of generic Django or NetBox assumptions.

Results per case

Case	Question	Without Tessra	With Tessra	Δ
01	What signal receivers fire on Site.save()? Which models in other apps update?	Haiku 1.25 / 5 · Sonnet 1.25 / 5	Haiku 5.0 / 5 · Sonnet 3.125 / 5	Haiku +3.75 · Sonnet +1.875
03	What are Device's direct parents? How many mixins compose NetBoxFeatureSet?	Haiku 4.0 / 5 · Sonnet 4.0 / 5	Haiku 5.0 / 5 · Sonnet 5.0 / 5	Haiku +1.0 · Sonnet +1.0
05	Where does Interface._site update when a Rack changes Site?	Haiku 4.125 / 5 · Sonnet 4.0 / 5	Haiku 5.0 / 5 · Sonnet 5.0 / 5	Haiku +0.875 · Sonnet +1.0
07	How does Device.objects.restrict(user) apply object-level permissions?	Haiku 0.625 / 5 · Sonnet 1.625 / 5	Haiku 5.0 / 5 · Sonnet 5.0 / 5	Haiku +4.375 · Sonnet +3.375
08	What fields does DeviceIndex index for global search, and with what weights?	Haiku 2.375 / 5 · Sonnet 2.375 / 5	Haiku 4.0 / 5 · Sonnet 5.0 / 5	Haiku +1.625 · Sonnet +2.625

Key findings

restrict() uses pk__in subquery, not .distinct()

Both models confidently guessed the common Django answer: .filter(...).distinct(). The actual NetBox code does something more surgical: it builds a separate pk__in subquery to avoid duplicate rows without applying DISTINCT to the outer query. Tessra surfaced the exact implementation and the inline comment explaining why.

Without Tessra

With Tessra

# What all models guessed without Tessra:
return self.filter(constraints_q).distinct()

# Actual code in utilities/querysets.py:64-68:
allowed_objects = self.model.objects.filter(attrs)
return self.filter(pk__in=allowed_objects)
# "#8715: Avoid duplicates when JOIN on M2M without DISTINCT"

.distinct() affects the entire outer query. The pk__in subquery is surgical and avoids side effects on M2M joins.

Cross-app signal cascade hidden from the origin model

Case 01 asks what updates when Site.save() fires signals. The answer is not visible from the Site model alone: a receiver propagates cached field updates to Prefix, Cluster, and WirelessLAN via bulk_update. Without Tessra, the models missed or confused the cascade. With Tessra, they identified the receiver, the affected apps, and the updated fields.

DeviceIndex weights: asset_tag outranks name

Without Tessra, both models assumed name was the highest-priority search field and returned wrong weights for serial and asset_tag. The code shows asset_tag=50 has higher priority than name=100 because lower numbers rank higher.

Without Tessra

With Tessra

# What models assumed:
('name', 100),      # "name is always most important"
('serial', 500),    # invented
('asset_tag', 500)  # invented

# Actual code in dcim/search.py:55-62:
('asset_tag',       50),   # ← HIGHER priority (lower number)
('serial',          60),
('name',           100),
('virtual_chassis', 200),
('description',    500),
('comments',      5000),

Known limitations

This benchmark measures architectural navigation on NetBox only. Cases where a model already knew the answer from public training data were excluded from the canonical panel. Results may vary on other Django projects with different architecture, conventions, and documentation. LLM outputs are non-deterministic.

Try it yourself

See what Tessra surfaces in your repo.

Index an Angular, Django, or Flutter repo and try local context for 7 days.

Start free ← All benchmarks