TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation#
- Status:
Accepted
- Venue:
ICSOFT 2026 - International Conference on Software Technologies, Porto, Portugal
- Authors:
AI4FM, Loyola University Chicago
- DOI / PDF:
Available upon publication
Abstract#
TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model.
Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model’s correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5× the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint, preventing the trivial-property failure mode.
Key Results#
Metric |
Result |
|---|---|
Syntactic correctness — best public baseline |
26.6% |
Semantic correctness — best public baseline (8.6% untuned) |
8.6% |
TLA-Prover Gold (passes TLC) pass@1 |
30% |
TLA-Prover Diamond pass@1 |
30% |
DPO variant Diamond pass@1 |
20% |
Improvement over untuned baseline |
~3.5× |
Training Approach#
Supervised Fine-Tuning (SFT) The model is first trained on a curated set of verified TLA+ specifications to establish a strong syntactic and semantic foundation.
Group-Relative Policy Optimization (GRPO) In the GRPO stage, the model generates multiple candidate specifications per prompt. Rejected outputs are used as repair targets, teaching the model to iteratively correct its own mistakes. TLC model-checker output serves as the reward signal — no learned reward model is involved.
Direct Preference Optimization (DPO) ablation A DPO variant is trained from the same SFT checkpoint to compare preference-optimization strategies. It reaches 20% at Diamond, versus 30% for the GRPO variant.
Grading Tiers#
Tier |
Criterion |
|---|---|
Bronze |
Specification parses (SANY) |
Silver |
No parser warnings |
Gold |
Passes TLC model checker |
Diamond |
Gold, plus TLC detects a violation when the correctness property is deliberately mutated |
Diamond prevents the trivial-property failure mode: a specification whose correctness property is always true passes TLC by vacuity and contributes nothing meaningful. The mutation test catches this. Gold and Diamond coincide at every checkpoint in TLA-Prover, indicating the model never falls back to trivial properties.
Relationship to Prior Work#
This paper extends the findings of our evaluation study Can LLMs Write Correct TLA+ Specifications?, which established the 8.6% semantic correctness baseline across 25 LLMs. TLA-Prover demonstrates that targeted fine-tuning with verifiable reward signals can close much of that gap.
Citation#
Note
The paper has been accepted at ICSOFT 2026. Full citation will be available upon publication.