Multi-LLM Knowledge Enhancement and Agent RAG System Optimization

Supervisor: Dr. Albert Ting Leung LEE

Comparison of the system architecture diagrams of a traditional Retrieval Augmentation Generation (RAG) system and a RAG Chatbot System Based On Multiple Model Synthetic Q&A Generation Comparison and Integration. Prior to inference, knowledge is extracted and augmented based on course document content using Multiple LLMs and a high quality Q&A set is generated for downstream retrieval.

Background & Objective

Current Challenges

Traditional RAG systems suffer from heterogeneous document formats and noise interference
Improper segmentation strategies lead to low retrieval accuracy
Severe hallucination problems in course-specific knowledge scenarios
Limited effectiveness when handling proprietary educational content

Research Objectives

Construct high-quality course-specific knowledge bases
Generate structured Q&A knowledge bases through multi-model collaborative knowledge distillation
Enhance RAG system retrieval accuracy for educational applications
Validate methodology effectiveness in real educational scenarios

Examples of LLM lacking critical knowledge in Course-specific Queries. When a student asked about ELEC3442 project requirements, LLM gaves generic hardware design steps, not the course-specific details.

Challenges of traditional RAG systems in the knowledge base building process

导师：Dr. Albert Ting Leung LEE

传统检索增强生成（RAG）系统与基于多模型合成问答生成比较集成的RAG聊天机器人系统的系统架构图比较。在推理之前，使用多个LLM基于课程文档内容提取和增强知识，并生成高质量的问答集用于下游检索。

研究背景与目标

现有挑战

传统RAG系统存在文档格式异构、噪声干扰等问题
分割策略不当导致检索精度低下
在课程专有知识场景中幻觉问题严重
处理教育专有内容时效果有限

研究目标

构建高质量的课程专用知识库
通过多模型协同知识蒸馏生成结构化问答知识库
提升RAG系统在教育应用中的检索准确性
验证方法在实际教育场景中的有效性

LLM 在特定课程问题中缺乏关键知识的示例。当一名学生询问 ELEC3442 项目要求时，LLM 只提供了一般的硬件设计步骤，而没有提供特定课程的详细信息。

传统RAG系统在知识库构建过程中的挑战

Methodology

This research proposes an innovative "Multi-LLM Synthetic Q&A Generation, Comparison, and Integration" pipeline with a core workflow of "generate → evaluate → compare → integrate".

Multi-Model Collaborative Generation

Utilizes five frontier LLMs (Claude Sonnet 4, Gemini 2.5 Pro, DeepSeek R1, ChatGPT o3, Grok 4) with Chain-of-Thought prompts to extract meta-knowledge and generate self-contained Q&A pairs.

Multi-LLM Collaborative Q&A Generation Process

Quantitative Evaluation Framework

Designs comprehensive assessment system encompassing text quality metrics (readability, coherence, technical term density) and retrieval accuracy metrics (recall, precision). Through comprehensive evaluation, we select a high-quality Q&A knowledge base produced through knowledge distillation as the foundation for subsequent integration.

Comprehensive Quantitative Evaluation Framework

Cross-Model Voting Mechanism

Employs semantic similarity comparison and voting mechanisms to filter and integrate complementary Q&A pairs from different models. High-quality Q&A pairs selected through voting mechanisms must have semantic similarity < 10% with existing Q&A knowledge base to ensure diversity and prevent redundancy.

Cross-Model Voting and Integration Mechanism

Knowledge Base Optimization

Applies rule-based segmentation strategies for generated structured Q&A pairs, replacing traditional automatic segmentation methods to achieve optimal knowledge base construction.

Rule-based Knowledge Base Optimization Strategy

Multi-LLM Collaborative Q&A Generation Process

Comprehensive Quantitative Evaluation Framework

Cross-Model Voting and Integration Mechanism

Rule-based Knowledge Base Optimization Strategy

研究方法

本研究提出了创新的"多模型合成问答生成、比较与集成"管道，核心流程为"生成→评估→比较→集成"。

多模型协同生成

采用五个前沿LLM（Claude Sonnet 4、Gemini 2.5 Pro、DeepSeek R1、ChatGPT o3、Grok 4），使用链式思维提示词提取元知识并生成自包含问答对。

量化评估框架

设计综合评估系统，涵盖文本质量指标（可读性、连贯性、技术术语密度等）和检索准确性指标（召回率、精确率）。通过综合评估，选出一个通过知识蒸馏产出的高质量Q&A知识集作为后续集成的基础。

跨模型投票机制

采用语义相似度比较和投票机制，筛选并整合来自不同模型的互补性问答对。通过投票机制选出的高质量Q&A需要与现有Q&A知识集的语义相似度<10%，以确保多样性和避免冗余。

知识库优化

对生成的结构化问答对采用基于规则的分割策略，替代传统自动分割方法，实现最优的知识库构建。

多模型协同问答生成流程

综合量化评估框架

跨模型投票与集成机制

基于规则的知识库优化策略

Experiment & Analysis

Dataset

Documents on Engineering Courses in Higher Education
2k benchmark queries for evaluation
Comprehensive testing across diverse question types

Performance Metrics

Comprehensive evaluation across text quality and retrieval accuracy
LLM-as-Judge evaluation framework
Consistent improvements across diverse educational scenarios

Text Quality Enhancement

Gemini 2.5 Pro achieved optimal readability performance
Claude Sonnet 4 and DeepSeek R1 led in content richness
Significant improvements in text coherence and technical depth

Text readability and content richness metrics across different knowledge bases

Structural quality metrics across different knowledge bases

Retrieval Accuracy Results

~25 percentage points improvement in recall and precision
Claude Sonnet 4: 58.04% recall and 59.97% precision GSB win rates
Qualitative cases: Traditional KB (0% recall, 0% precision) vs Q&A KB (80% recall, 90% precision)

Retrieval accuracy metrics across different knowledge bases

Interactive Demo of knowledge retrieval and responses by chatbots that use two knowledge bases. On the left is a chatbot using the baseline knowledge base, while on the right is a chatbot using our Proposed Q&A knowledge base.

实验与分析

数据集

高等教育工科课程的文档
2k个基准问题用于评估
跨多种问题类型的综合测试

性能指标

跨文本质量和检索准确性的综合评估
基于LLM的评判框架
在多种教育场景下的一致性改进

文本质量提升

Gemini 2.5 Pro在可读性方面表现最佳
Claude Sonnet 4和DeepSeek R1在内容丰富度方面领先
文本连贯性和技术深度显著改善

不同知识库的文本可读性和内容丰富度指标

不同知识库的结构质量指标

检索精度结果

召回率和精确率提升约25个百分点
Claude Sonnet 4：GSB胜率达58.04%(召回率)和59.97%(精确率)
定性案例：传统知识库(0%召回率，0%精确率) vs 问答知识库(80%召回率，90%精确率)

不同知识库的检索准确性指标

使用两个知识库的聊天机器人进行知识检索和回复的互动演示。左侧为使用基础知识库的聊天机器人，右侧为使用我们提出问答对知识库的聊天机器人

Contribution

Methodological Innovation

Proposed a data-centric, model-agnostic knowledge base construction workflow that enhances knowledge quality through preprocessing stages, reducing dependence on expensive fine-tuning or ultra-long context inference.

Empirical Validation

Provided empirical evidence through large-scale metrics and concrete case studies that synthetic multi-LLM Q&A corpora significantly improve RAG accuracy and answer utility in higher education settings.

Evaluation Framework

Established an integrated assessment framework with LLM-as-judge metrics, portable to other courses and domains, enabling continuous improvement of educational chatbots.

Practical Value

Demonstrated that rigorous knowledge engineering approaches offer a pragmatic and scalable path to trustworthy GenAI assistance in knowledge-intensive academic environments.

主要贡献

方法论创新

提出了数据中心化、模型无关的知识库构建工作流程，通过预处理阶段提升知识质量，减少对昂贵微调或超长上下文推理的依赖。

实证验证

通过大规模指标和具体案例研究，实证证明了合成多模型问答语料库在高等教育场景下显著提升RAG精度和答案实用性。

评估框架

建立了包含LLM评判指标的集成评估框架，可移植到其他课程和领域，支持教育聊天机器人的持续改进。

实用价值

验证了严格的知识工程方法为知识密集型学术环境中的可信GenAI辅助提供了实用且可扩展的路径。