Nomograph Labs Nomograph Labs

Benchmark Results

sysml-bench · 132 tasks · 4 models · 40+ conditions · N=3–5

On the Horizon

Two arXiv preprints in preparation from this data. Paper A argues that representation matters more than retrieval, with O4 and O12 as the designated primary hypothesis pair and a pre-registered confirmatory design on a second corpus. Paper B argues that aggregate benchmarks hide task-level structure, with O1, O10, and O8 as lead evidence and a methodological contribution on per-task analysis with paired effect sizes. Both frame the current study as exploratory. We will link them here when they are posted.

Overview

Corpus

Eve Online Mining Frigate SysML v2 model. 19 files, 798 elements, 1,515 relationships.

Models

Claude Sonnet 3.5, GPT-4o, GPT-4o-mini, o3-mini.

Replication

N=3 exploratory sweeps. N=5 for key comparisons (O1, O3, O4, O12). T=0.3 for all CLI runs.

Caveat

All results exploratory. None survive multiple comparison correction across 14 observations.

sysml-bench is an exploratory benchmark evaluating how tool-augmented LLMs perform on structured engineering tasks in SysML v2. 132 tasks across 8 categories (discovery, reasoning, explanation, layer, boundary, vector-sensitive, structural trace, corpus scaling), tested with 4 models and 40+ experimental conditions.

The study generated 14 observations. Three achieved nominal statistical significance. None survive correction for running 14 tests simultaneously. We want to be clear about what this study is and isn't: it is a well-characterized exploratory study that identifies patterns and estimates effect sizes. It is not a confirmatory study that proves those patterns are real. The contribution is the benchmark methodology, the identification of task-tool interaction as a key variable, and the effect size estimates that make confirmatory follow-up designable.

That said, the patterns are consistent and the effect sizes are large. We think they are worth sharing and worth testing further.

Two Central Theses

The 14 observations, three significance tests, and extensive null results converge on two theses. These are the study's intellectual contribution: specific, testable claims that a second corpus can confirm or refute.

Thesis 1

Representation > Retrieval

Lead evidence: O4 (d=0.83), O12 (d=0.75), O8 (d=0.64)

Null results that constrain the claim: O5, O6, O9, O14

Representation matters more than retrieval

For AI systems working with structured knowledge, the form in which information is presented to the model matters more than the mechanism used to find it.

Most of the energy in the tool-augmented LLM space is going into retrieval infrastructure: vector databases, graph traversal engines, multi-hop reasoning chains. On our benchmark, every retrieval intervention we tested produced null results:

Every representation and guidance intervention produced large effects:

The implication, if these patterns hold on other corpora, is that the marginal investment in retrieval infrastructure has near-zero return on structured engineering corpora of this scale. The marginal investment in representation infrastructure (views, summaries, structured renderings of domain artifacts) has large return. This is testable across every domain where LLMs operate on structured data.

Thesis 2

Aggregate Benchmarks Lie

Lead evidence: O1 (per-task d: −0.400 to +0.800)

Supporting: O10 (bimodal scaling), O8 (task-type interaction), O4 (range +1.000 to −0.200)

Aggregate benchmarks hide task-level structure

Standard benchmark reporting (mean accuracy across tasks) actively misleads when task-level variance dominates condition-level variance.

Our aggregate tool comparison showed no significant difference (O1, p=0.391). A naive reading: tools don't matter. But per-task analysis revealed effect sizes ranging from −0.400 to +0.800. Enormous, opposite-signed effects that cancel in the mean. The aggregate null is not "no effect." It is "large effects in both directions, hidden by averaging."

This pattern recurs throughout the data:

Every benchmark that reports a single accuracy number is potentially hiding this structure. The methodological contribution is demonstrating that per-task analysis with paired effect sizes is necessary to surface real patterns in tool-augmented LLM evaluation.


Observations

14 observations from the benchmark. We present the six most interesting ones in detail, then summarize the rest. All p-values are from individual tests and have not been corrected for running 14 tests simultaneously. When corrected, none remain significant across the full set.

0.887

Guided graph score vs 0.750 unguided. Matches the 2-tool baseline at 0.880.

p=0.009 (paired t, uncorrected)

d=0.75, N=16 tasks

Power: 0.80

O12 — Context engineering outperforms tool restriction

The naive response to "too many tools hurt performance" is to restrict the tool set. The better response is a sentence in the system prompt. When agents are instructed to start with search and read_file, escalating to graph tools only when search is insufficient, the 13-point discovery penalty from over-tooling disappears entirely. Performance with 6 tools matches and marginally exceeds the 2-tool baseline (0.887 vs 0.880).

The affected tasks (D11, D12, D16, D6) are those where unguided agents select structurally complex tools for attribute-lookup tasks that search handles trivially. The agent doesn't need graph traversal to find a part's mass. It needs search. But without guidance, it reaches for the most powerful tool available, and the overhead of using it (more tokens, more turns, more opportunities to go off track) costs accuracy.

This is the only adequately powered observation in the study (power=0.80). It is also the lowest nominal p-value (0.009). If we had to pick one finding to bet on replicating, this would be it.

ConfigScore$/task
Pre-rendered0.873$0.008
Agent-assembled0.490$0.053

Tasks E4 and E7: 1.000 with render, 0.000 with agent assembly.

p=0.047 (Wilcoxon, uncorrected)

d=0.83, N=8 tasks

O4 — Pre-rendered views outperform agent-assembled context

On explanation tasks, pre-rendered model views scored 0.873 vs 0.490 for letting the agent assemble its own context. A 38-point gap. Two tasks collapsed entirely without rendering: 1.000 with a pre-rendered view, 0.000 with agent assembly. The agent-assembled approach exhausts turns building context that a single render call provides.

The advantage is explanation-specific. On discovery tasks, pre-rendering scored 0.719, worse than search (0.880). Pre-rendering the wrong view adds noise, not signal. This matters: it means pre-rendering is not a universal improvement. It is a task-dependent one, and the task type determines whether it helps or hurts.

The cost difference is striking: $0.008 per task with rendering vs $0.053 with agent assembly. Better results at 6.6× lower cost. The pre-rendered view does the work at index time that the agent would otherwise do at query time, and it does it once instead of per-query.

This has the largest effect size in the study (d=0.83) but is underpowered (power=0.53, needs 14 tasks for 80%). The effect is real but we can't be confident in its magnitude yet.

Task typeCLIRAG
Discovery0.8550.566
Reasoning0.3230.459

Discovery: p=0.021 (paired t), d=0.64

Reasoning: p=0.403 (not significant)

O8 — Retrieval strategy interacts with task type

CLI tool-based search dominated structured lookup (+29 points over RAG, p=0.021, d=0.64, N=16 tasks). RAG edged ahead on cross-file reasoning (+14 points, p=0.403, not significant), likely because it injects all relevant context at once, avoiding the problem where the agent runs out of turns before it can chain together enough tool calls to answer multi-step questions.

The CLI advantage on discovery is driven by 5 tasks where RAG scores 0.000: tasks requiring iterative tool-mediated retrieval that single-shot context injection cannot perform. The model needs to search, read a result, search again based on what it found, and repeat. RAG gives it everything at once, which is the right coverage but the wrong format for these tasks.

Neither retrieval architecture is universally better. This is interesting because it suggests that the right approach is not to pick one, but to route queries to the right strategy based on task type. Whether that routing can be done cheaply enough to be practical is an open question.

TaskSearchGraph
D111.0000.200
D61.0000.400
D130.6001.000
D100.7001.000

Aggregate: p=0.391 (not significant). Per-task: up to 0.80 difference.

O1 — Tool-task interaction is heterogeneous

Graph tools hurt discovery tasks, help layer tasks, and are near-neutral on reasoning. The aggregate difference is not statistically significant (paired t-test p=0.391, N=16) because the effect is task-dependent: graph tools help on tasks requiring structural completeness checking (D10, D13: +0.300 to +0.400) and hurt on tasks where search retrieves the answer directly (D11, D6: −0.600 to −0.800).

The pattern holds across all four models tested, making it one of the most robust qualitative observations in the benchmark despite the null aggregate test. This is the lead evidence for Thesis 2: the aggregate null is not "tools don't matter." It is "tools matter enormously, but in opposite directions on different tasks, and the average hides everything interesting."

CorpusScore
19 files0.880
95 files, search0.423
95 files, graph0.389
95 files, +vectors0.409

Failure modes: 55% budget exhaustion, 27% reasoning errors, 0% search failure.

O10 — Corpus scale is the dominant difficulty factor

Performance roughly halves from 19 to 95 files (0.880 to 0.423). Graph tools and vector search make things worse at scale: schema overhead and retrieval noise compound without compensating signal. 11 of 20 scaling tasks fall below 0.333. The distribution is bimodal: easy tasks remain easy, hard tasks become impossible.

The failure mode is revealing: 55% of the time the agent ran out of turns before finishing. 27% were reasoning errors. 0% were search failures. The agent can find the information. It just can't process enough of it within the turn budget to reach the right answer. This suggests the path forward is better orchestration (smarter turn allocation, hierarchical planning) rather than better search.

This is the observation that keeps us honest. Our other results come from a 19-file corpus. Real engineering repositories are hundreds or thousands of files. The scaling problem is unsolved by any method we tested, and small-corpus benchmarks produce optimistic estimates that may not transfer.

Remaining observations summarized. Full details in the benchmark repository.

Other observations

IDSummaryClassification
O2Model quality gap: Sonnet consistently outperformed OpenAI modelsDescriptive
O3o3-mini is the only model where graph tools help on reasoning (+0.056)Exploratory (power=0.08)
O5Vector search: exact tie with keyword search on small corpus (0.880 vs 0.880)Null
O6Planning tools (sysml_stat, sysml_plan): +0.035 on hard tasks, not significantNull
O7RFLP layer tasks: cli_full showed slight advantage (~0.25 effect)Exploratory
O9Graph tools at 2–3 hops: no benefit (d=0.16, power=0.07)Null
O11Turn budget is a partial bottleneck but not the whole storyDescriptive
O13Few-shot examples hurt mini models (GPT-4o-mini, o3-mini)Exploratory
O14Graph tools at 4–5 hops: not significant (d=0.44, power=0.19)Null (underpowered)

Methodology

Scoring

Per-field structured scoring. Each task defines expected fields with typed scorers.

Known issue

ListStr binarization at 0.8 creates cliff effects. A response scoring 0.79 on F1 receives 0.0; one scoring 0.80 receives 1.0. This amplifies variance.

Scoring

Per-field structured scoring. Each task defines expected fields with typed scorers: Bool (exact match), Float (numeric within tolerance), Str (exact string match), StrContains (case-insensitive substring), ListStr (F1 score with 0.8 threshold, binarized). Task score = mean of field scores. Condition score = mean of task scores across N runs.

Tool sets

Tool SetToolsSchema TokensDescription
cli_search2~250search + read_file
cli_graph6~1120search + trace + check + query + inspect + read_file
cli_render7~1200cli_graph + sysml_render
cli_full9~1500cli_render + sysml_stat + sysml_plan
+guidedvaries+~50System prompt with tool selection hint
+vectorsvaries+0Adds fastembed HNSW vector index

Known confounds

cli_search (2 tools, ~250 schema tokens) vs cli_graph (6 tools, ~1120 schema tokens) confounds tool count, schema overhead, and selection complexity. When graph tools "hurt," the cause could be the tools themselves, the schema overhead consuming context window space, or the difficulty of choosing the right tool from a larger set. O12 (guided selection) partially disentangles selection difficulty. A schema ablation experiment (same tools, reduced schema) would isolate overhead.

Ground truth

Created by the primary author from SysML v2 model inspection. Two corrections applied during experimentation: D16 (35.0→37.0), R5 (3→2). Structural trace scoring schema corrected in v2 (ST2/ST7 scores changed 0.542→0.865). No independent verification. Single-author ground truth is a limitation.

Statistical Context

Holm-Bonferroni

Step-down correction at α=0.05 across 14 observations. Controls family-wise error rate.

Effect size conventions

Cohen's d: 0.2 = small, 0.5 = medium, 0.8 = large. Calibrated for behavioral science; benchmark score differences may have different practical significance.

Multiple comparison correction

14 observations tested at α=0.05 yields a family-wise error rate of about 51%. When corrected (Holm-Bonferroni step-down), no observation remains significant:

RankObservationRaw pHolm thresholdSurvives?
1O12 (render)0.0090.0036No
2O8 (discovery)0.0210.0038No
3O40.0470.0042No

If O4 and O12 are designated as the only two hypotheses under test (Holm-Bonferroni at m=2), both survive: O12 adjusted p=0.018, O4 adjusted p=0.047. This designation was made after seeing the data. It becomes pre-registered only if declared before collecting new data on a second corpus.

Power analysis

Only one observation (O12) has enough statistical power to reliably detect its effect. The power analysis tells us exactly how large a follow-up study needs to be. This is itself a contribution: it makes the confirmatory work designable.

ObservationEffect size (d)Current tasksPowerTasks for 80%Tasks for 80% (α=0.025)
O12 (guided render)0.75160.801721
O8 (CLI vs RAG)0.64160.702025
O4 (render vs assembly)0.8380.531417
O1 (heterogeneity)0.22160.13163210
O14 (graph 4-5 hops)0.4480.194253
Threats to Validity

Six identified threats. The single-corpus limitation is the most fundamental.

T1: Single corpus. All primary observations derive from one 19-file SysML v2 model. The corpus is small enough that exhaustive search may substitute for structured traversal, potentially explaining why graph tools show no advantage. All claims are scoped to "on our benchmark corpus."

T2: Confounded tool sets. cli_search vs cli_graph differs in tool count, schema overhead, and selection complexity simultaneously. O12 partially disentangles selection difficulty.

T3: Multiple comparisons. 14 observations at α=0.05 yields ~51% family-wise error rate. No observation survives full correction.

T4: Underpowered tests. Only O12 achieves 80% power. Most observations would need 30–300+ tasks to detect their effects reliably.

T5: Scoring methodology. ListStr binarization at 0.8 creates cliff effects. StrContains scoring for explanation tasks may be too lenient.

T6: Ground truth. Created and verified by a single author. Two corrections applied mid-experiment. No inter-rater reliability assessment.

What's Next

Confirmatory study

Pre-register O4 and O12 as primary hypotheses before collecting data on a second SysML v2 corpus. Design task sets for 80% power: 20+ explanation tasks, 20+ discovery tasks. This converts the post-hoc designation into genuine pre-registration.

Additional experiments we'd like to run: continuous scoring (remove the 0.8 binarization threshold), schema overhead ablation (isolate whether the graph tool penalty is from schema tokens vs selection difficulty), and deeper analysis with model explanation techniques to understand why these effects occur, not just that they occur.

Publishing

Two arXiv preprints in preparation. The first argues that representation matters more than retrieval (O4, O12 as the designated primary hypothesis pair). The second argues that aggregate benchmarks hide task-level structure (O1, O10, O8). Both frame the current study as exploratory with a confirmatory design for follow-up.

The benchmark harness, task definitions, ground truth, and scoring code will publish as a community artifact for future comparison.