,

How Software Testing Can Help Agencies Optimize Their Generative AI Outputs

Artificial intelligence (AI) is all around us: on our phones, in our cars, and embedded in the products and services we use every day. Now, generative AI (GenAI) is unleashing more AI in more places — including state and local government.

Big Tech leaders from Facebook to Google to Microsoft continue to develop solutions based on the large language models (LLMs) behind GenAI. These tools will address numerous use cases, including the ability to use everyday language to ask complex questions of a set of data and receive detailed, accurate answers.

That capability holds great potential for state and local government agencies. Imagine a webpage or app where residents can quickly get answers to a wide range of routine questions: how to comply with ordinances, apply for permits or licenses, access services, and more. Constituents would benefit from more satisfying experiences. Agencies would free up staff to focus on more complex issues.

But the promise of GenAI comes with risks. One is that AI outputs, especially in this early stage, could be biased, treating different groups differently. Another is that even as AI models improve, some AI responses could become less accurate over time.

The solution? As agencies take advantage of GenAI, they also need to invest in software testing tools to ensure that GenAI systems continually serve constituencies equitably and effectively.

Weeding Out Bias in GenAI Outputs

It’s well-recognized that bias can appear in AI outputs, including GenAI content. For instance, some AI systems have assigned women lower credit-card limits than their husbands. Bias can appear if an AI model is trained on a dataset that includes bias or that isn’t representative of the group it serves. Agencies have a legal requirement to ensure that systems don’t encourage or enforce bias based on protected attributes such as age, gender and race.

It’s difficult to test whether a GenAI model contains bias, because LLMs are usually trained on huge sets of information gathered from across the internet and other sources. But you can test whether built-in bias is affecting your GenAI outputs.

To do so, you’ll need the expertise of a data scientist, who can identify the best way to represent the data, and a quality assurance (QA) professional, who can put the system through its paces. It’s a data-intensive process, so it helps to use a software-testing tool that can automate as much of the process as possible.

Start by partitioning the data by attributes such as age, gender and race. Then use a testing tool to create scenarios using different combinations of attributes. For example, you could generate outputs using age and gender, then add race and see if you get a different output, and so on, cycling through attributes.

Taking Account of Data Drift

Another issue is that GenAI systems sometimes “hallucinate,” or invent erroneous responses. Imagine the havoc if your agency’s GenAI chatbot provided residents with misinformation — about where they can park, what they can build, what they should do in an emergency, and so on.

The statistical properties of a dataset used to train an AI model can change over time, a process called “data drift.” This can happen because of changes in the data source, changes in laws and policy, and other factors. Sometimes improvements to one aspect of a model degrade other aspects. ChatGPT, for instance, seems to be having mixed results with math. 

Software testing can identify drift in your GenAI outputs. You should regularly test your GenAI system, asking questions to verify the answers are the same as in the past. A testing tool can automate detection of such changes.

Note that it’s fundamentally impossible to automate the testing of whether output is accurate or inaccurate. Testing for drift is more like the “check engine” light in your car. It tells you something might be wrong, but you need an expert to home in on the problem. 

Testing Tools for Continual Improvement

Look for a software-testing solution that can test for both bias and drift. It’s also crucial that your tool integrates with a wide range of technologies. GenAI solutions are delivered in a variety of ways —wrapped in a chatbot user interface (UI), say, or connected to through an application programming interface (API). Your tool must be able to work with those UIs and APIs to monitor outputs.

Without testing, your GenAI system might not perform as expected, and its performance might degrade over time. With effective testing, your GenAI outputs can continually improve and work as promised. That should be the goal for your investment in GenAI: to meet the needs of constituents today and to evolve with changing demands to support your mission in the future.


Dave Colwell is Vice President of Artificial Intelligence and Machine Learning for Tricentis, a provider of automated software testing solutions designed to accelerate application delivery and digital transformation. Previously he held positions in innovation, solutions architecture, and quality assurance.

Photo by ThisIsEngineering at pexels.com

Leave a Comment

Leave a comment

Leave a Reply