Adoption of generative AI tools and large language models (LLMs) since OpenAI introduced its ChatGPT chatbot in late November 2022 has only accelerated, reaching deep into organizations of all shapes, sizes, and sectors, and software developers have not been immune to the siren call.
LLMs and Software Verification
A group of computer scientists led by the University of Massachusetts Amherst earlier this month said they were directing the power of generative AI and LLMs to address the gnarly challenge of verifying code to help prevent vulnerabilities in software.
Specifically, the team focused on using AI to develop a method to automatically generate whole proofs to use to verify software code, a painstaking, expensive, and time-consuming process that typically has been done through manual procedures. An even more difficult process is machine checking: creating a mathematical proof to show the code does what’s expected and then using a theorem provider to ensure the proof is correct.
TRENDING STORIES
As such, the systems that have actually been formally verified is fairly small, according to Yuriy Brun, professor in the Manning College of Information and Computer Sciences at UMass Amherst and the paper’s senior author.
Buggy Software Shouldn’t Just Be Accepted
Doing so will help address an even larger problem: the presence of flaws in software that can be annoying or — if exploited by cyber-attackers or present in complex systems whose failure could have broad negative impacts — dangerous.
“Software is an important part of our everyday lives,” Brun said. “You can’t do anything. You can’t drive a car, you can’t ride an elevator, without using software. Unfortunately, software today, as a rule, is buggy. Any software you buy in a store, we almost expect to have some bugs. It’s just it’s too hard a problem to solve, so there’s a bunch of different ways to try to improve quality of software.”
Building Baldur
The UMass Amherst researchers looked to LLMs as a possible solution for automatically generating proofs. However, even that posed challenges, the biggest one being that the models aren’t always correct. Rather than crashing — and thus letting the developer know something is wrong — they “fail silently,” producing an incorrect answer but presenting it as thought it’s correct. Failing silent often is the worst thing that can happen, Brun said.
Enter Baldur, a new AI-based method for developing proofs the UMass Amherst team created. The researchers used Google’s Minerva LLM, which is trained on a large amount of natural-language text. It was then further trained on 118GB of mathematical scientific papers and web pages containing mathematical expressions and then trained even more on Isabell/HOL, a language in which mathematical proofs are written.


Enter Thor
Brun and his team — which also included Markus Rabe, who worked at Google at the time, and Talia Ringer, an assistant professor at the University of Illinois-Urbana Champaign — looked at Thor, a framework for integrating language models and automated theorem provers. By itself, Thor could produce proofs 57% of the time, he said.
Combining it with Baldur — Thor’s brother in Norse mythology — they were able to create proofs 65.7% of the time. The two methods complemented each other.
Thor “uses a large-language model to try to predict what the next likely step is of the proof, but it also uses these things called ‘hammers,’” Brun said. “Hammers are these mathematical tools that say, ‘I know a bunch of mathematical tags. Let me try them. Let me try this, try this, try this.’ It’s like hammering at the problem, just trying things differently and seeing if anything works. It was trying all of these things at once.”
Still More to Do
Brun admitted that the degree of error is still large, but said Baldur still represents the most effective and efficient way to verify that software code is correct. AI techniques are continuing to improve, so he expects Baldur’s capabilities will as well.
Going forward, the researchers are looking to improve the 65.7% number by fine-tuning the data the LLM is trained on. A hurdle for verification is that, right now, there isn’t a lot of data out there, so creating a dataset isn’t easy. They’re working on a dataset that will enable the fine-tuning of the model.
They also want to enable developers to give the model feedback, which will further help the model to grow as it works to generate a proof.
“The developer says, ‘Okay, so it worked great,’” Brun said. “But if it didn’t work, the developer can often look [and say], ‘I see that you tried induction here, but you’re trying it on the wrong thing.’ It can give some feedback to the model, and then the model can try again. This iterative approach is likely to really simplify the developer’s job but enable the model to prove things that it can’t on its own.”
It creates a semi-automated approach.