Beyond the Benchmark: Dr. Muhammad Waseem on What AI Is Actually Changing in Software Engineering

"I think we are already past the stage where the most interesting discussion is about which model performs slightly better on a benchmark or which tool generates code faster."
The picture looks different depending on whether you are in industry or academia, though the two are starting to converge. Companies have already seen that generative AI can write code and support development tasks. The question now is whether these systems can be trusted, how far their autonomy should go, and how they fit into real development processes without introducing new risks. From a research perspective, those are exactly the questions that matter most.
"The genuine new capability is not only faster code generation, but the possibility of goal-driven, multi-step software engineering support. In that sense, generative AI is starting to influence not only software products, but also software processes and development methods."
Concepts like "Vibe Coding," where developers guide AI toward an outcome through prompting rather than writing every line of code themselves, signal that software development is not just picking up new tools but rethinking how humans and AI work together. The harder questions follow naturally.
"As systems become more autonomous, questions of reliability, transparency, accountability, and validation become significantly more complex. These are not minor implementation concerns, but foundational software engineering questions."
From Demo to Practice
Waseem argues that the most important research happens in the gap between an impressive demonstration and a production-ready tool that works day-to-day inside a real organization.
"A quickly built demo using generative AI can show that a model is capable of doing something impressive in a controlled setting. However, turning that demo into a production system for real-world use is a very different challenge. In practice, a production system must be reliable, robust, secure, maintainable, and truly useful within an actual project environment."
Bridging that gap is the main goal at GPT-Lab, founded in 2023 by Professor Pekka Abrahamsson. The lab builds systems alongside real organizations and tests them with real users.
"Very often, what matters most is not just model capability, but whether the solution fits the real context in which it is used."
An example is a tender intelligence system that helps public sector procurement teams analyze tender documents, extract compliance requirements, evaluate proposals, and generate structured decision reports. Work that previously took days can be completed far more efficiently.
The lab also supports companies using AI for rapid prototyping, letting them test ideas and prove value before committing to full development.
"When it stays at the level of a demo, those surrounding questions of integration, governance, usability, and long-term value, are still unanswered."
Photo: Jonne RenvallThe Technical Core
A central focus of the lab involves multi-agent systems, where several independent AI “agents” work together on tasks like analyzing requirements or generating code. Each agent handles a piece of the problem, uses external tools to look up information, and passes results along to the next.
"Building a basic multi-agent system today is not necessarily difficult. We already have quite mature frameworks like LangGraph, Crew AI, and others. The real challenge begins when you try to make these systems work reliably in practice."
The lab has identified three main hurdles. The first is coordination: a small error from one agent can ripple through the whole system. The second is memory, since these systems can struggle to retain context across a long workflow. The third is validation: when work is spread across several agents interacting with live data, verifying that the right thing was done becomes difficult.
Waseem notes, however, that these are limitations of current technology rather than permanent barriers. Context windows are expanding, memory mechanisms are evolving, and tool integration is becoming more robust, ensuring that progress on all three fronts remains steady. The deeper finding is that structure matters more than raw model power.
"How tasks are decomposed, how agents communicate, where human oversight is introduced, and how outputs are validated, these design decisions often matter more than the individual model being used."
Research Meets Reality
GPT-Lab works with public bodies, small and medium businesses, and larger companies across Finland through both direct collaboration and public funding aimed at supporting responsible AI adoption. Working across these contexts reveals how strongly outcomes depend on the specific environment in which systems are used.
"What is common across all of these is that the work is anchored in real problems. We are not starting from technology and asking where it might fit."
The lab's role, as Waseem describes it, is to bridge exploration and validation by prototyping ideas quickly while observing how they hold up in practice over time, connecting practical impact with rigorous, evidence-based research.
What the Next Five Years Require
Looking ahead, Waseem expects the most significant changes in software engineering to come from how work is structured rather than from any single technological advance. The role of the engineer shifts toward designing systems that include AI components and ensuring that these systems behave as intended.
"Software engineering will become less about writing everything yourself, and more about designing, guiding, and validating systems that include AI as an active participant."
This points to a growing need for what he calls AI fluency: knowing how to frame problems clearly, delegate tasks to AI systems effectively, and evaluate the outputs critically. These are not exclusively technical skills, but they will matter for anyone working in a software-related role.
Focus on understanding how systems are designed and validated, not just how code is written. The tools will continue to evolve quickly, but the ability to work effectively with them, and to build systems that can be trusted in practice, will be the more lasting skill.
Muhammad Waseem
Muhammad Waseem
Postdoctoral Research Fellow
Faculty of Information Technology and Communication Sciences | Computing Sciences
Vice head of GPT-Lab Tampere
https://orcid.org/0000-0001-7488-2577
He is investigating the application of Generative AI (GEN-AI) in various areas of software engineering, such as requirements, design, development, testing, and deployment. In parallel, I am also exploring Quantum Software Engineering, as well as Multi-Cloud and Distributed Architectures.
Photo: Jonne Renvall







