Studying the Quality of Source Code Generated by Different AI Generative Engines: An Empirical Evaluation

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

The advent of Generative Artificial Intelligence is opening essential questions about whether and when AI will replace human abilities in accomplishing everyday tasks. This issue is particularly true in the domain of software development, where generative AI seems to have strong skills in solving coding problems and generating software source code. In this paper, an empirical evaluation of AI-generated source code is performed: three complex coding problems (selected from the exams for the Java Programming course at the University of Insubria) are prompted to three different Large Language Model (LLM) Engines, and the generated code is evaluated in its correctness and quality by means of human-implemented test suites and quality metrics. The experimentation shows that the three evaluated LLM engines are able to solve the three exams but with the constant supervision of software experts in performing these tasks. Currently, LLM engines need human-expert support to produce running code that is of good quality.

Studying the Quality of Source Code Generated by Different AI Generative Engines: An Empirical Evaluation

Tosi D.^Primo

2024-01-01

Abstract

The advent of Generative Artificial Intelligence is opening essential questions about whether and when AI will replace human abilities in accomplishing everyday tasks. This issue is particularly true in the domain of software development, where generative AI seems to have strong skills in solving coding problems and generating software source code. In this paper, an empirical evaluation of AI-generated source code is performed: three complex coding problems (selected from the exams for the Java Programming course at the University of Insubria) are prompted to three different Large Language Model (LLM) Engines, and the generated code is evaluated in its correctness and quality by means of human-implemented test suites and quality metrics. The experimentation shows that the three evaluated LLM engines are able to solve the three exams but with the constant supervision of software experts in performing these tasks. Currently, LLM engines need human-expert support to produce running code that is of good quality.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Rivista
	
				FUTURE INTERNET
			
	DOI
	
				https://dx.doi.org/10.3390/fi16060188
			
	Codice Web of Science
	
				WOS:001256248100001
			
	Codice Scopus
	
				2-s2.0-85197181777
			
	Parole chiave
	
				generative artificial intelligence; software metrics; software quality; source code generation
			
	Tutti gli autori
	
						Tosi, D.
					
	Appare nelle tipologie:
	
				Articolo su Rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2178594

Attenzione

L'Ateneo sottopone a validazione solo i file PDF allegati

Citazioni

ND

25

12

social impact