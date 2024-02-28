Google Security Engineering and The Carnegie Mellon University Software Engineering Institute (in collaboration with OpenAI) have sorted through the hype – and done some serious thinking and formal research on developing “better approaches for evaluating LLM cybersecurity” and AI-powered patching: the future of automated vulnerability fixes. This is some great formative framing of the challenges ahead as we collectively sort out the implications of the convergence of generative AI and future cyber capabilities (offensive and defensive).

Research from Jeff Gennari, Shing-hon Lau, and Samuel J. Perl of the SEI; Joel Parish and Girish Sastry from OpenAI also contributed to this research effort.

From the SEI post summarizing the research:

Large language models (LLMs) have shown a remarkable ability to ingest, synthesize, and summarize knowledge while simultaneously demonstrating significant limitations in completing real-world tasks. One notable domain that presents both opportunities and risks for leveraging LLMs is cybersecurity. LLMs could empower cybersecurity experts to be more efficient or effective at preventing and stopping attacks. However, adversaries could also use generative artificial intelligence (AI) technologies in kind. We have already seen evidence of actors using LLMs to aid in cyber intrusion activities (e.g., WormGPT, FraudGPT, etc.). Such misuse raises many important cybersecurity-capability-related questions including:

Can an LLM like GPT-4 write novel malware?

Will LLMs become critical components of large-scale cyber-attacks?

Can we trust LLMs to provide cybersecurity experts with reliable information?

Recently, a team of researchers in the SEI CERT Division worked with OpenAI to develop better approaches for evaluating LLM cybersecurity capabilities. This SEI Blog post, excerpted from a recently published paper that we coauthored with OpenAI researchers Joel Parish and Girish Sastry, summarizes 14 recommendations to help assessors accurately evaluate LLM cybersecurity capabilities.

The Challenge of Using LLMs for Cybersecurity Tasks

Without a clear understanding of how an LLM performs on applied and realistic cybersecurity tasks, decision makers lack the information they need to assess opportunities and risks. We contend that practical, applied, and comprehensive evaluations are required to assess cybersecurity capabilities. Realistic evaluations reflect the complex nature of cybersecurity and provide a more complete picture of cybersecurity capabilities.

Recommendations for Cybersecurity Evaluations

To properly judge the risks and appropriateness of using LLMs for cybersecurity tasks, evaluators need to carefully consider the design, implementation, and interpretation of their assessments. Favoring tests based on practical and applied cybersecurity knowledge is preferred to general fact-based assessments. However, creating these types of assessments can be a formidable task that encompasses infrastructure, task/question design, and data collection. The following list of recommendations is meant to help assessors craft meaningful and actionable evaluations that accurately capture LLM cybersecurity capabilities. The expanded list of recommendations is outlined in our paper.

Define the real-world task that you would like your evaluation to capture.

Starting with a clear definition of the task helps clarify decisions about complexity and assessment. The following recommendations are meant to help define real-world tasks:

Consider how humans do it: Starting from first principles, think about how the task you would like to evaluate is accomplished by humans, and write down the steps involved. This process will help clarify the task. Use caution with existing datasets: Current evaluations within the cybersecurity domain have largely leveraged existing datasets, which can influence the type and quality of tasks evaluated. Define tasks based on intended use: Carefully consider whether you are interested in autonomy or human-machine teaming when planning evaluations. This distinction will have significant implications for the type of assessment that you conduct.

Represent tasks appropriately.

Most tasks worth evaluating in cybersecurity are too nuanced or complex to be represented with simple queries, such as multiple-choice questions. Rather, queries need to reflect the nature of the task without being unintentionally or artificially limiting. The following guidelines ensure evaluations incorporate the complexity of the task:

Define an appropriate scope: While subtasks of complex tasks are usually easier to represent and measure, their performance does not always correlate with the larger task. Ensure that you do not represent the real-world task with a narrow subtask. Develop an infrastructure to support the evaluation: Practical and applied tests will generally require significant infrastructure support, particularly in supporting interactivity between the LLM and the test environment. Incorporate affordances to humans where appropriate: Ensure your assessment mirrors real-world affordances and accommodations given to humans. Avoid affordances to humans where inappropriate: Evaluations of humans in higher education and professional-certification settings may ignore real-world complexity.

Make your evaluation robust.

Use care when designing evaluations to avoid spurious results. Assessors should consider the following guidelines when creating assessments:

Use preregistration: Consider how you will grade the task ahead of time. Apply realistic perturbations to inputs: Changing the wording, ordering, or names in a question would have minimal effects on a human but can result in dramatic shifts in LLM performance. These changes must be accounted for in assessment design. Beware of training data contamination: LLMs are frequently trained on large corpora, including news of vulnerability feeds, Common Vulnerabilities and Exposures (CVE) websites, and code and online discussions of security. These data may make some tasks artificially easy for the LLM.

Frame results appropriately.

Evaluations with a sound methodology can still misleadingly frame results. Consider the following guidelines when interpreting results:

Avoid overgeneralized claims: Avoid making sweeping claims about capabilities from the task or subtask evaluated. For example, strong model performance in an evaluation measuring vulnerability identification in a single function does not mean that a model is good at discovering vulnerabilities in a real-world web application where resources, such as access to source code may be restricted. Estimate best-case and worst-case performance: LLMs may have wide variations in evaluation performance due to different prompting strategies or because they use additional test-time compute techniques (e.g., Chain-of-Thought prompting). Best/worst case scenarios will help constrain the range of outcomes. Be careful with model selection bias: Any conclusions drawn from evaluations should be put into the proper context. If possible, run tests on a variety of contemporary models, or qualify claims appropriately. Clarify whether you are evaluating risk or evaluating capabilities. A judgment about the risk of models requires a threat model. In general, however, the capability profile of the model is only one source of uncertainty about the risk. Task-based evaluations can help understand the capability of the model.

For further insights and recommendations from the SEI/OpenAI collaborators, find the full research paper at: Considerations for Evaluating Large Language Models for Cybersecurity Tasks by Jeffrey Gennari, Shing-hon Lau, Samuel Perl, Joel Parish (Open AI), and Girish Sastry (Open AI).

Jan Keller and Jan Nowakowski from Google Security Engineering have released a Technical Report on the automation of vulnerability fixes with generative AI – the possibilities and pitfalls of it all.

As AI continues to advance at rapid speed, so has its ability to unearth hidden security vulnerabilities in all types of software. Every bug uncovered is an opportunity to patch and strengthen code—but as detection continues to improve, we need to be prepared with new automated solutions that bolster our ability to fix those bugs. That’s why our Secure AI Framework (SAIF) includes a fundamental pillar addressing the need to “automate defenses to keep pace with new and existing threats.” This paper shares lessons from our experience leveraging AI to scale our ability to fix bugs, specifically those found by sanitizers in C/C++, Java, and Go code.

By automating a pipeline to prompt Large Language Models (LLMs) to generate code fixes for human review, we have harnessed our Gemini model to successfully fix 15% of sanitizer bugs discovered during unit tests, resulting in hundreds of bugs patched. Given the large number of sanitizer bugs found each year, this seemingly modest success rate will with time save significant engineering effort. We expect this success rate to continually improve and anticipate that LLMs can be used to fix bugs in various languages across the software development lifecycle.

From the paper:

An LLM-powered pipeline

An end-to-end solution needs a pipeline to:

1. Find vulnerabilities

2. Isolate and reproduce them

3. Use LLMs to create fixes

4. Test the fixes

5. Surface the best fix for human review and submission

Results

At the time of writing, we’ve accepted several hundred of these LLM-generated commits into Google’s codebase,with another several hundred in the process of being validated and submitted.

Instead of a software engineer spending an average of two hours to create each of these commits, the necessary patches are now automatically created in seconds. Perhaps unsurprisingly, we’ve seen the best success rate in fixing errors stemming from the use of an uninitialized value, a relatively simple fix. But the LLM-generated fixes didn’t target only simple errors. They also,for example, effectively initialized matrices and images using the appropriate library methods. In order of the highest fix success rate, the most commonly fixed sanitizer errors fell into four types:

1. Using uninitialized values

2. Data races

3. Buffer overflows

4. Temporal memory errors(e.g. use-after-scope)

Though a 15% success rate might sound low, many thousands of new bugs are found each year, and automating the fixes for even a small fraction of them saves months of engineeringeffort—meaning that potential security vulnerabilities are closed even faster. We expect improvements to continue pushing that number higher.

Looking ahead

While these initial results are promising, this is just a first step toward a future of AI-powered automated bug patching. We’re currently working on expanding capabilities to include multi-file fixe an to integrate multiple bug sources into the pipeline.

