Background
In the last few years, every organization from Fortune 500 companies to startups has rushed to ride the generative AI wave. However, while demos produce applause-worthy performance, production-ready applications often fall short of expectations. The difference in performance between controlled demonstrations versus that in the real-world is what suggests a model’s ability to generalize. Given the nature of the outputs produced by them, a rigorous evaluation of generalizability for generative AI models requires looking beyond the classical evaluation and model-selection methods.
AI generated text can be difficult to evaluate objectively and with quantitative metrics. Given its high dimensionality, conducting expert evaluations of text output at scale is infeasible. Natural language makes frequent use of devices such as paraphrasing that can generate distinct representations of a semantic construct. Additionally, the generative models that define AI’s leading edge today are all probabilistic in nature. Reproducible, large scale evaluations of model generated text need to effectively address these challenges.
What if we could evaluate AI generated text output with AI itself? Even though it sounds like letting the fox guard the henhouse, the idea may be key to large-scale validation of content produced by AI systems. In the first of this two-part article we review the current methods evaluating generated natural language text as well as emerging ideas on using LLMs for such evaluations.
Limitations of Current Methods
Evaluation metrics like BLEU, ROUGE, and METEOR are inadequate for evaluating text generated by large language models (LLMs). These metrics, which are designed to compare outputs against fixed references, end up penalizing outputs that rephrase ideas or deviate creatively from reference texts, even when those outputs are correct and well-written (Reiter, 2018). Additionally, the single numerical scores provided by these metrics are opaque, offering little actionable feedback about a model's strengths and weaknesses.
Metrics like BERTScore (Zhang et al., 2020) attempt to address these limitations by measuring semantic similarity using contextual embeddings. They tend to perform well in tasks with high semantic alignment, such as paraphrase detection, but struggle in subjective tasks like dialogue generation or summarization. The BERTScore may, for example, assign high scores to semantically similar but poorly written outputs or penalize creative, high-quality outputs that diverge from the reference text (Yuan et al., 2021). Such shortcomings make it less effective in capturing human-valued dimensions like creativity, coherence, or stylistic variation.
A key limitation of the aforementioned metrics is their weak correlation with human judgments of quality, particularly in tasks involving diverse or semantically rich outputs. Studies have shown that metrics like BLEU and ROUGE frequently fail to align with human preferences for logical consistency, content correctness, and stylistic appropriateness (Reiter, 2018; Chaganty et al., 2018). These metrics reward reference-matching and penalize valid paraphrases or novel expressions, leading to a fundamental misalignment with human evaluative standards.

Using LLMs to Evaluate Synthesized Text
Recently, some methods have been proposed to evaluate the quality of LLM generated texts with the help of another LLM. These methods can be grouped in two ways, one based on the technique, and the other based on reference (ground truth) availability. First, the technique based methods can be divided further into two evaluation approaches: prompt-based evaluation uses various prompting techniques to assess text, while fine-tuned evaluation involves training a model specifically for the evaluation task (Li et. al. 2024). Second, there are scenario based differences in reference-based and reference-free, where reference means a ground truth. In most scenarios, we do not have access to references, and thus reference-free evaluations are often preferred in production use-cases.
Prompt based methods for evaluating generated text quality are popular due to their simplicity. Typically these produce a quantitative measure of text quality which may be a real valued score, a scaled value that may be easier to interpret as a probability, or a value on a discrete ordered sale (likert-like). Prompt based evaluation metrics provide a way to inexpensively carry out a baseline evaluation. Frequently they are a starting point for building custom, application-specific evaluation methods.
Prompt-Type | Prompt | Output |
Score-based | Given the source document: [. . . ] Given the model-generated text: [. . . ] Please score the quality of the generated text out of 10 | Score: 7 |
Probability-based | Given the source document: [. . . ] Given the model-generated text: [. . . ] How many semantic content units from the reference text are covered by the generated text? | 0.8 |
Likert-style | Given the source document: [. . . ] Given the model-generated text: [. . . ] Please score the quality of the generated text on a scale of 1 (worst) to 5 (best) | 2 |
Pairwise | Given the source document: [. . . ] Given the model-generated text 1: [. . . ] And given the model-generated text 2: [. . . ] Please answer which text is better-generated and more consistent. | Text 1 |
Ensemble | Given the source document: [...] Given the model-generated text 1: [...] And given the model-generated text 2: [...] We need you to compare the quality of two texts. There are other evaluators performing the same task. You should discuss with them and make a final decision. Here is the discussion history: <Output of Evaluator 1> … <Output of Evaluator n> Please give your opinion. | Text 1 |
Advanced | Given the source document: [. . . ] Given the model-generated text: [. . . ] Given any reference text: [. . . ] We need you to score the generated text on the following criterion: <task introduction>. Follow the instructions when evaluating: <instructions> Based on that, please give a score out of 5. | Score: 3/5 |
Text quality evaluation methods can also be classified based on whether they evaluate generated text against a reference or not. Methods that do not require a reference for comparison (reference free) reference-based methods, assess the quality of generated text based on intrinsic attributes such as fluency, coherence, and relevance. These methods often employ language models or classifiers trained to predict human judgment scores, aiming to provide a more flexible and context-independent evaluation. Studies have shown that reference-free metrics can exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality, especially when reference texts are unavailable or of low quality (Sheng et al, 2024).
Combining reference-based and reference-free evaluations: G-Eval
Based on chain-of-thought prompting and a form-filling paradigm G-Eval (Liu et. al. 2023) can be used for reference-based as well as reference-free evaluations for evaluating generated text. The figure below (taken from Liu et. al. 2023) provides an overview of the method.
As outlined above, there are 3 main components to the framework. First, there are the criterion inputs, including the task introduction, and evaluation criteria. The evaluation criteria are used to aid the CoT of the evaluator LLM. They are part of the prompt template for the evaluator. Second, there are the inputs to the generator LLM, and its output. These are passed to the evaluator LLM as part of the prompt. The final step involves a probability-weighted summation of the output scores. G-Eval shows a higher correlation with human evaluations, which are considered the gold standard compared to the automated metrics reviewed at the beginning of this article.

Automating text evaluations for a real-world use-case
G-Eval can be viewed as a general framework that can be used to evaluate generated text on a predetermined set of criteria. Per the authors of the method, G-Eval was conceived to address tasks like summarization and dialogue response generation (Liu et. al, 2023). When using the framework for the evaluation of other text generation systems, adaptations to G-Eval may be needed.
In the next part of this article we will motivate the need for adapting G-Eval to a text generation problem and illustrate how the three components (criterion inputs, including the task introduction, and evaluation criteria) may be tweaked to address the need. We will also compare the modified evaluation method with the other automated metrics discussed in the foregoing sections.
References
Ehud Reiter. 2018. A Structured Review of the Validity of BLEU. Computational Linguistics, 44(3):393–401.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. Proceedings of ICLR.
Yuan, H., Neubig, G., & Liu, P. (2021). BARTScore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems.
Chaganty, A., Mussmann, S., & Liang, P. (2018). The price of debiasing automatic metrics in natural language evaluation. Proceedings of ACL.
Shuqian Sheng, Yi Xu, Luoyi Fu, Jiaxin Ding, Lei Zhou, Xinbing Wang, and Chenghu Zhou. 2024. Is Reference Necessary in the Evaluation of NLG Systems? When and Where?. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8580–8596, Mexico City, Mexico. Association for Computational Linguistics.
Li, Z., Xu, X., Shen, T., Xu, C., Gu, J. C., Lai, Y., ... & Ma, S. (2024, November). Leveraging large language models for NLG evaluation: Advances and challenges. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 16028-16045).
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-eval: NLG evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634.
Comments