Open Source Benchmarks · 2026

Tested on real repos.
Results you can verify.

4 benchmarks on public open source repos: Angular, Django, and Flutter. Every prompt, evaluation criterion, result, and limitation is available for review.

Tessra does more than find files. It gives AI agents structured context to understand relationships, impact, and architecture in real repos.

4repos evaluated
up to +77 ppimprovement
3stacks
evidence included

pp = percentage points; some benchmarks use points or case-level criteria.

All results

Four benchmarks. Three stacks. Fully reviewable.

Each benchmark ran on a real, public repository. The prompts, evaluation criteria, results, and known limitations are available for review.

Repo Stack Size Model Without Tessra With Tessra Improvement What it tested
ThingsBoard Angular 17+ ~8 000 Sonnet 4.6 7% 84% +77 pp DI inject(), lazy routes, non-obvious callers
NetBox Django / Python ~1 165 Haiku 4.5 49.5% 96% +46.5 pp Signals, QuerySet internals, SearchIndex weights
ngrx-platform Angular + NgRx Nx ~1 379 Conservative local comparison 8 / 10 9 / 10 +1 pt NgRx internals, workflow leverage, architectural tracing
Ente Photos Flutter / Dart ~4 061 Validated case 2 / 3 3 / 3 +1 criterion Cognitive leverage, architectural tracing, cross-module flows

Most rows show normalized benchmark scores. Ente Photos is shown as a focused validated case: 2/3 to 3/3 on SelectionState, plus directional evidence of stronger architectural tracing in cross-module flows.

Improvements are shown in the unit that fits each benchmark: percentage points (pp), points, or criteria.

In these benchmarks, Tessra helped models produce more complete architecture-level answers, reduce blind exploration, and in several cases approach or exceed premium-model baselines.

Some public repos may already be well represented in frontier-model training data. When a case is already solved or mostly solved without Tessra, it is marked as saturated or directional and excluded from strong lift claims.

Angular · ThingsBoard

Angular on 8,000 TypeScript files

Real lazy routing, modern inject() DI, and deeply nested routes: questions where text search alone is not enough to explain the full relationship.

Sonnet 4.6
7% 84%
+77 pp
Haiku 4.5
0% 68%
+68 pp

Test cases

01 Which classes depend on CalculatedFieldFormService via inject()?
02 How does AlarmRulesComponent inject its dependencies?
03 Which component renders the /dashboards route?
04 Where to add a sibling route to /profiles/deviceProfiles?
05 Which callers use AlarmRulesService — constructor or inject()?
thingsboard/thingsboard
~8 000 TypeScript files · Angular 17+
Angular inject() lazy routing
Key finding
A functional caller inside a ResolveFn was correctly identified through Tessra's inject_di graph. Without Tessra, the model described how it would search. With Tessra, it returned the exact caller, invoked method, and correct dependency classification.
View full report →
Django · NetBox

Django internals across 9 apps

Cross-app signal receivers, QuerySet permission logic, and SearchIndex weights: internal implementation details that are not solved by public documentation alone.

Haiku 4.5
49.5% 96%
+46.5 pp
Sonnet 4.6
53% 92.5%
+39.5 pp

Test cases

01 What signal receivers fire on Site.save()? Which models in other apps update?
02 What are Device's direct parents? How many mixins compose NetBoxFeatureSet?
03 Where does Interface._site update when a Rack changes Site?
04 How does Device.objects.restrict(user) apply object-level permissions?
05 What fields does DeviceIndex index for global search, and with what weights?
netbox-community/netbox
~1 165 Python files · 9 Django apps
Django signals QuerySet
Key finding
restrict() uses pk__in=subquery, not .distinct(). Without Tessra, both models confidently answered .distinct(). With Tessra, the context led to the actual mechanism.
Haiku 4.5 + Tessra (96%) outperformed Sonnet 4.6 alone (53%) in this benchmark.
View full report →
Angular + NgRx · ngrx-platform

NgRx internals: conservative +1 lift, stronger navigation

On this public NgRx monorepo, Tessra reached 9/10 in a verified local-code run. Against a conservative no-Tessra baseline of 8/10, Tessra shows a +1 point lift and clearer navigation through effects internals, specs, entity adapters, and router-store state behavior.

Conservative comparison
8 / 10 9 / 10
+1 pt
Conservative comparison: the raw no-Tessra baseline reached 9/10, adjusted to 8/10 to account for possible editor/chat context retention. [conservative]

Test cases

01 Which internal classes initialize an Effect during bootstrap?
02 Which test files cover createAction()?
03 Which Nx libraries declare a dependency on @ngrx/store?
04 What private methods does EntityStateAdapter expose internally?
05 How does @ngrx/router-store connect router actions to reducers?
ngrx/platform
~1 379 TypeScript files · Nx monorepo
Angular NgRx Nx
Key finding
The win is not only the extra point. The bigger win is a cleaner path to the engineering answer: faster symbol navigation, caller/callee context, related specs, interface boundaries, reducer behavior, and state types with less manual exploration.
View full report →
Flutter · Ente Photos

More cognitive leverage for architectural tracing

Cross-module flows across ≥4 module boundaries in an active Flutter app. This benchmark measures context completeness and cognitive leverage, not failure recovery.

Validated case · SelectionState
2 / 3 3 / 3
+1 pt
Haiku 4.5
baseline search stronger answer
directional

Baseline found the core mechanism. Tessra improved answer quality, focus, and completeness. Not a zero-to-perfect case.

Test cases

01 How does hasMigratedSizes() decide whether to make the HTTP call?
02 What is the full path from UI event to DB for delete suggestions?
03 How does trashFilesOnServer validate file ownership before the request?
04 What endpoint and batch size does the hasMigratedSizes backfill use?
05 Why does SelectionState's InheritedWidget have updateShouldNotify=false?
ente-io/ente
~4 061 Dart files · Flutter
Flutter cross-module EventBus
Key finding
Baseline search found the core mechanism. Tessra added cognitive leverage: it gave the agent a clearer working map of the repo, connected relevant symbols faster, reduced blind exploration, and produced a cleaner architecture-level explanation. The win is not that Tessra finds a file. The win is that it helps the agent turn scattered code paths into an engineering-level explanation.
View full report →
Methodology

Cases where file search is not enough.

A model can answer basic questions when it knows the public API. These cases test something harder: following internal relationships across modules, services, routes, signals, effects, tests, and dependencies.

01
Cross-module cases
Every case requires at least 4 hops across modules to reach the correct answer. They are designed so text search alone is not enough. Each case also includes plausible wrong answers found in public documentation or common model assumptions.
02
Four evaluation criteria
We evaluate accuracy, architecture, actionability, and evidence. Each case can score up to 5 points based on whether the answer identifies correct facts, explains internal relationships, proposes a useful path forward, and cites verifiable evidence.
03
Cases already known by the model
If a frontier model already answers correctly without Tessra, that case does not measure Tessra's contribution well. We mark it as saturated and exclude it from the main result.
Baseline models had standard file reading and search access. Tessra adds structured context on top: symbols, callers, impact radius, and cross-module relationships.
Why it matters

What this means for developers.

Large repos are not hard because files are hidden. They are hard because the answer is spread across routes, services, state, effects, signals, serializers, querysets, widgets, and APIs. Tessra gives that structure to the agent so it knows what connects to what before touching code. The point is not memorizing these repos; the point is giving the model real code relationships when it needs them.

Reproducibility

Inspect the evidence.

Every benchmark has a public report with tested questions, per-case results, key findings, and known limitations. No sign-up required.

Angular ThingsBoard Lazy routes, inject(), ResolveFn guards, and non-obvious cross-module callers. View report → Django NetBox Permission QuerySets, cross-app signals, and SearchIndex field weights. View report → Angular + NgRx ngrx-platform Adapters, selectors, router-store, and NgRx Effects workflow leverage. View report → Flutter Ente Photos Cross-module flows, local DB, event dispatch, and state propagation chains. View report →

What each report includes

Questions tested
The exact prompts used to evaluate architectural navigation, without editing or cherry-picking.
Per-case results
Scores for each case with and without Tessra, including saturated and directional outcomes.
Key findings
Concrete differences in answer quality — what changed, what the model missed, and why the context mattered.
Known limitations
What the benchmark does not prove, where results may vary, and what was excluded from the main claim.

These benchmarks measure architectural navigation and context quality, not code generation. Results may vary across repos with different structural patterns. Model responses are non-deterministic — individual scores may differ across runs.

Try it yourself

See what Tessra surfaces in your repo.

Index an Angular, Django, or Flutter repo and try local context for 7 days.