Conference Paper


User Centric Evaluation of Code Generation Tools

Abstract

With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper also reports a case study with the method in the evaluation of ChatGPT’s usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5.



The fulltext files of this resource are currently embargoed.
Embargo end: 2025-09-25

Authors

Miah, Tanha
Zhu, Hong

Oxford Brookes departments

School of Engineering, Computing and Mathematics

Dates

Year of publication: 2024
Date of RADAR deposit: 2024-06-19



“© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”


Related resources

This RADAR resource is the Accepted Manuscript of [arXiv preprint, version 3] User Centric Evaluation of Code Generation Tools
This RADAR resource is the Accepted Manuscript of User Centric Evaluation of Code Generation Tools

Details

  • Owner: Daniel Croft (removed)
  • Collection: Outputs
  • Version: 1 (show all)
  • Status: Live
  • Views (since Sept 2022): 351