Can LLMs Write Correct TLA+ Specifications? Our New Evaluation Study#
We have submitted a new paper evaluating whether large language models can generate semantically correct TLA+ specifications from natural language. We evaluated 30 LLMs across eight families on 205 TLA+ specifications. Best semantic correctness achieved was only 8.6%, and model size did not predict quality.
Note
This paper is currently under submission.