Research

Beyond the Benchmark: Dr. Muhammad Waseem on What AI Is Actually Changing in Software Engineering

Published on 21.4.2026

Tampere University

Photo: Jonne Renvall

The conversation around AI and software development often narrows to tools and benchmarks. For Dr. Muhammad Waseem, Postdoctoral Researcher and Vice-Head of GPT-Lab at Tampere University, that framing misses what is actually at stake. What is changing is not just how fast code can be written, but how software systems are designed, validated, and integrated into real-world environments.

"I think we are already past the stage where the most interesting discussion is about which model performs slightly better on a benchmark or which tool generates code faster."

The picture looks different depending on whether you are in industry or academia, though the two are starting to converge. Companies have already seen that generative AI can write code and support development tasks. The question now is whether these systems can be trusted, how far their autonomy should go, and how they fit into real development processes without introducing new risks. From a research perspective, those are exactly the questions that matter most.

"The genuine new capability is not only faster code generation, but the possibility of goal-driven, multi-step software engineering support. In that sense, generative AI is starting to influence not only software products, but also software processes and development methods."

Concepts like "Vibe Coding," where developers guide AI toward an outcome through prompting rather than writing every line of code themselves, signal that software development is not just picking up new tools but rethinking how humans and AI work together. The harder questions follow naturally.

"As systems become more autonomous, questions of reliability, transparency, accountability, and validation become significantly more complex. These are not minor implementation concerns, but foundational software engineering questions."

From Demo to Practice

Waseem argues that the most important research happens in the gap between an impressive demonstration and a production-ready tool that works day-to-day inside a real organization.

"A quickly built demo using generative AI can show that a model is capable of doing something impressive in a controlled setting. However, turning that demo into a production system for real-world use is a very different challenge. In practice, a production system must be reliable, robust, secure, maintainable, and truly useful within an actual project environment."

Bridging that gap is the main goal at GPT-Lab, founded in 2023 by Professor Pekka Abrahamsson. The lab builds systems alongside real organizations and tests them with real users.

"Very often, what matters most is not just model capability, but whether the solution fits the real context in which it is used."

An example is a tender intelligence system that helps public sector procurement teams analyze tender documents, extract compliance requirements, evaluate proposals, and generate structured decision reports. Work that previously took days can be completed far more efficiently.

The lab also supports companies using AI for rapid prototyping, letting them test ideas and prove value before committing to full development.

"When it stays at the level of a demo, those surrounding questions of integration, governance, usability, and long-term value, are still unanswered."

The Technical Core

A central focus of the lab involves multi-agent systems, where several independent AI “agents” work together on tasks like analyzing requirements or generating code. Each agent handles a piece of the problem, uses external tools to look up information, and passes results along to the next.

"Building a basic multi-agent system today is not necessarily difficult. We already have quite mature frameworks like LangGraph, Crew AI, and others. The real challenge begins when you try to make these systems work reliably in practice."

The lab has identified three main hurdles. The first is coordination: a small error from one agent can ripple through the whole system. The second is memory, since these systems can struggle to retain context across a long workflow. The third is validation: when work is spread across several agents interacting with live data, verifying that the right thing was done becomes difficult.

Waseem notes, however, that these are limitations of current technology rather than permanent barriers. Context windows are expanding, memory mechanisms are evolving, and tool integration is becoming more robust, ensuring that progress on all three fronts remains steady. The deeper finding is that structure matters more than raw model power.

"How tasks are decomposed, how agents communicate, where human oversight is introduced, and how outputs are validated, these design decisions often matter more than the individual model being used."

Research Meets Reality

GPT-Lab works with public bodies, small and medium businesses, and larger companies across Finland through both direct collaboration and public funding aimed at supporting responsible AI adoption. Working across these contexts reveals how strongly outcomes depend on the specific environment in which systems are used.

"What is common across all of these is that the work is anchored in real problems. We are not starting from technology and asking where it might fit."

The lab's role, as Waseem describes it, is to bridge exploration and validation by prototyping ideas quickly while observing how they hold up in practice over time, connecting practical impact with rigorous, evidence-based research.

What the Next Five Years Require

Looking ahead, Waseem expects the most significant changes in software engineering to come from how work is structured rather than from any single technological advance. The role of the engineer shifts toward designing systems that include AI components and ensuring that these systems behave as intended.

"Software engineering will become less about writing everything yourself, and more about designing, guiding, and validating systems that include AI as an active participant."

This points to a growing need for what he calls AI fluency: knowing how to frame problems clearly, delegate tasks to AI systems effectively, and evaluate the outputs critically. These are not exclusively technical skills, but they will matter for anyone working in a software-related role.

Focus on understanding how systems are designed and validated, not just how code is written. The tools will continue to evolve quickly, but the ability to work effectively with them, and to build systems that can be trusted in practice, will be the more lasting skill.
Muhammad Waseem

Muhammad Waseem

Postdoctoral Research Fellow
Faculty of Information Technology and Communication Sciences | Computing Sciences

Vice head of GPT-Lab Tampere

https://orcid.org/0000-0001-7488-2577

He is investigating the application of Generative AI (GEN-AI) in various areas of software engineering, such as requirements, design, development, testing, and deployment. In parallel, I am also exploring Quantum Software Engineering, as well as Multi-Cloud and Distributed Architectures.

Muhammad Waseem, Postdoctoral Research Fellow

Author: Sujatro Majumdar

Published on 15.6.2026

All news in category

Published on 17.6.2026

All news

Beyond the Benchmark: Dr. Muhammad Waseem on What AI Is Actually Changing in Software Engineering

From Demo to Practice

The Technical Core

Research Meets Reality

What the Next Five Years Require

Muhammad Waseem

Related news

Professor Pekka Abrahamsson wants AI research to be visible in students’ and companies’ everyday life

Detecting the unexpected: Dr. Fahad Sohrab on teaching machines to recognise what they’ve never seen

Terhi Kilamo advocates for safe and inclusive spaces across academia

Latest news in category Research

The Design for Disassembly approach holds promise for enhancing the economic and environmental performance of buildings

Scientists discover hidden driver of urban air pollution

Learned verses echo nature – Academy Professor Sari Kivistö uses poetry as a lens to research the story of science

Large-scale genetic study uncovers new factors associated with a pregnancy-related liver disease

The Producer Center Living Lab project has built a permanent producer network for the creative industries

SecureSoC strengthens cybersecurity, security of supply and semiconductor expertise in Finland

Latest news

The Design for Disassembly approach holds promise for enhancing the economic and environmental performance of buildings

Scientists discover hidden driver of urban air pollution

Tampere University included in international QS Rankings

Dresden and Tampere Strengthen European Semiconductor Cooperation

FinE course organized on 8–9 June 2026 by the FinE Platform Tampere node attracted great interest

Learned verses echo nature – Academy Professor Sari Kivistö uses poetry as a lens to research the story of science

Give feedback on our website!