Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
Document Type
Conference Proceeding
Publication Date
4-2025
Abstract
With the advanced AI technologies, Large Language Models (LLMs) have improved programming automation. However, LLMs often produce code with unnecessary logic, hallucinated content, and errors due to ambiguous prompts. This research measures the efficiency of Python code generated by GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo models using metrics like execution time, memory usage, and maximum memory usage, while maintaining problem-solving correctness. Using the EffiBench dataset on Google’s Vertex AI Workbench with different machine configurations, the study uses the seed parameter for consistency and optimization techniques like Chain-of-Thought (CoT) prompting and fine-tuning GPT-4o-Mini. Except for GPT-4-Turbo, the results show that CoT prompting improves efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo. GPT-4o-Mini was selected for fine-tuning due to its better results with CoT prompt and its cost-effectiveness, but fine-tuning compromises accuracy and efficiency. Overall, high-CPU machine configurations, along with GPT-4o-Mini and CoT prompting, improves the efficiency and correctness of LLM-generated code in resource-intensive scenarios.
Recommended Citation
Jonnala, Ramya and Chillimuntha, Sai Kiran, "Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning" (2025). Student Research Symposium 2025. 12.
https://digitalcommons.tamusa.edu/srs_2025/12
Comments
1:00-2:00 p.m.
BLH 262
Studies in Mathematical, Physical & Engineering
Walter Den, Moderator