Multi-LLM Knowledge Enhancement and Agent RAG System Optimization

For Educational Applications

面向教育场景的多模型知识增强与智能体系统优化研究

提升检索生成增强系统在高等教育中的应用效果

Explore Research
System Architecture Comparison
Comparison of the system architecture diagrams of a traditional Retrieval Augmentation Generation (RAG) system and a RAG Chatbot System Based On Multiple Model Synthetic Q&A Generation Comparison and Integration. Prior to inference, knowledge is extracted and augmented based on course document content using Multiple LLMs and a high quality Q&A set is generated for downstream retrieval.

Background & Objective

Current Challenges

  • Traditional RAG systems suffer from heterogeneous document formats and noise interference
  • Improper segmentation strategies lead to low retrieval accuracy
  • Severe hallucination problems in course-specific knowledge scenarios
  • Limited effectiveness when handling proprietary educational content

Research Objectives

  • Construct high-quality course-specific knowledge bases
  • Generate structured Q&A knowledge bases through multi-model collaborative knowledge distillation
  • Enhance RAG system retrieval accuracy for educational applications
  • Validate methodology effectiveness in real educational scenarios
LLM Hallucination Examples
Examples of LLM lacking critical knowledge in Course-specific Queries. When a student asked about ELEC3442 project requirements, LLM gaves generic hardware design steps, not the course-specific details.
Traditional RAG Problems
Challenges of traditional RAG systems in the knowledge base building process
系统架构比较
传统检索增强生成(RAG)系统与基于多模型合成问答生成比较集成的RAG聊天机器人系统的系统架构图比较。在推理之前,使用多个LLM基于课程文档内容提取和增强知识,并生成高质量的问答集用于下游检索。

研究背景与目标

现有挑战

  • 传统RAG系统存在文档格式异构噪声干扰等问题
  • 分割策略不当导致检索精度低下
  • 在课程专有知识场景中幻觉问题严重
  • 处理教育专有内容时效果有限

研究目标

  • 构建高质量的课程专用知识库
  • 通过多模型协同知识蒸馏生成结构化问答知识库
  • 提升RAG系统在教育应用中的检索准确性
  • 验证方法在实际教育场景中的有效性
LLM Hallucination Examples
LLM 在特定课程问题中缺乏关键知识的示例。当一名学生询问 ELEC3442 项目要求时,LLM 只提供了一般的硬件设计步骤,而没有提供特定课程的详细信息。
Traditional RAG Problems
传统RAG系统在知识库构建过程中的挑战

Methodology

This research proposes an innovative "Multi-LLM Synthetic Q&A Generation, Comparison, and Integration" pipeline with a core workflow of "generate → evaluate → compare → integrate".

Multi-Model Collaborative Generation

Utilizes five frontier LLMs (Claude Sonnet 4, Gemini 2.5 Pro, DeepSeek R1, ChatGPT o3, Grok 4) with Chain-of-Thought prompts to extract meta-knowledge and generate self-contained Q&A pairs.

Multi-Model Generation
Multi-LLM Collaborative Q&A Generation Process

Quantitative Evaluation Framework

Designs comprehensive assessment system encompassing text quality metrics (readability, coherence, technical term density) and retrieval accuracy metrics (recall, precision). Through comprehensive evaluation, we select a high-quality Q&A knowledge base produced through knowledge distillation as the foundation for subsequent integration.

Evaluation Framework
Comprehensive Quantitative Evaluation Framework

Cross-Model Voting Mechanism

Employs semantic similarity comparison and voting mechanisms to filter and integrate complementary Q&A pairs from different models. High-quality Q&A pairs selected through voting mechanisms must have semantic similarity < 10% with existing Q&A knowledge base to ensure diversity and prevent redundancy.

Voting Mechanism
Cross-Model Voting and Integration Mechanism

Knowledge Base Optimization

Applies rule-based segmentation strategies for generated structured Q&A pairs, replacing traditional automatic segmentation methods to achieve optimal knowledge base construction.

Knowledge Base Optimization
Rule-based Knowledge Base Optimization Strategy
Multi-Model Generation
Multi-LLM Collaborative Q&A Generation Process
Evaluation Framework
Comprehensive Quantitative Evaluation Framework
Voting Mechanism
Cross-Model Voting and Integration Mechanism
Knowledge Base Optimization
Rule-based Knowledge Base Optimization Strategy

研究方法

本研究提出了创新的"多模型合成问答生成、比较与集成"管道,核心流程为"生成→评估→比较→集成"

多模型协同生成

采用五个前沿LLM(Claude Sonnet 4Gemini 2.5 ProDeepSeek R1ChatGPT o3Grok 4),使用链式思维提示词提取元知识并生成自包含问答对。

量化评估框架

设计综合评估系统,涵盖文本质量指标(可读性、连贯性、技术术语密度等)和检索准确性指标(召回率、精确率)。通过综合评估,选出一个通过知识蒸馏产出的高质量Q&A知识集作为后续集成的基础。

跨模型投票机制

采用语义相似度比较投票机制,筛选并整合来自不同模型的互补性问答对。通过投票机制选出的高质量Q&A需要与现有Q&A知识集的语义相似度<10%,以确保多样性和避免冗余。

知识库优化

对生成的结构化问答对采用基于规则的分割策略,替代传统自动分割方法,实现最优的知识库构建。

Multi-Model Generation
多模型协同问答生成流程
Evaluation Framework
综合量化评估框架
Voting Mechanism
跨模型投票与集成机制
Knowledge Base Optimization
基于规则的知识库优化策略

Experiment & Analysis

Dataset

  • Documents on Engineering Courses in Higher Education
  • 2k benchmark queries for evaluation
  • Comprehensive testing across diverse question types

Performance Metrics

  • Comprehensive evaluation across text quality and retrieval accuracy
  • LLM-as-Judge evaluation framework
  • Consistent improvements across diverse educational scenarios

Text Quality Enhancement

  • Gemini 2.5 Pro achieved optimal readability performance
  • Claude Sonnet 4 and DeepSeek R1 led in content richness
  • Significant improvements in text coherence and technical depth
Knowledge Base Comparison
Text readability and content richness metrics across different knowledge bases
Retrieval Accuracy Results
Structural quality metrics across different knowledge bases

Retrieval Accuracy Results

  • ~25 percentage points improvement in recall and precision
  • Claude Sonnet 4: 58.04% recall and 59.97% precision GSB win rates
  • Qualitative cases: Traditional KB (0% recall, 0% precision) vs Q&A KB (80% recall, 90% precision)
Text Quality Results
Retrieval accuracy metrics across different knowledge bases
Interactive Demo of knowledge retrieval and responses by chatbots that use two knowledge bases. On the left is a chatbot using the baseline knowledge base, while on the right is a chatbot using our Proposed Q&A knowledge base.

实验与分析

数据集

  • 高等教育工科课程的文档
  • 2k个基准问题用于评估
  • 多种问题类型的综合测试

性能指标

  • 文本质量检索准确性的综合评估
  • 基于LLM的评判框架
  • 在多种教育场景下的一致性改进

文本质量提升

  • Gemini 2.5 Pro可读性方面表现最佳
  • Claude Sonnet 4DeepSeek R1内容丰富度方面领先
  • 文本连贯性技术深度显著改善
Knowledge Base Comparison
不同知识库的文本可读性和内容丰富度指标
Retrieval Accuracy Results
不同知识库的结构质量指标

检索精度结果

  • 召回率精确率提升约25个百分点
  • Claude Sonnet 4:GSB胜率达58.04%(召回率)59.97%(精确率)
  • 定性案例:传统知识库(0%召回率,0%精确率) vs 问答知识库(80%召回率,90%精确率)
Text Quality Results
不同知识库的检索准确性指标
使用两个知识库的聊天机器人进行知识检索和回复的互动演示。左侧为使用基础知识库的聊天机器人,右侧为使用我们提出问答对知识库的聊天机器人

Contribution

Methodological Innovation

Proposed a data-centric, model-agnostic knowledge base construction workflow that enhances knowledge quality through preprocessing stages, reducing dependence on expensive fine-tuning or ultra-long context inference.

Empirical Validation

Provided empirical evidence through large-scale metrics and concrete case studies that synthetic multi-LLM Q&A corpora significantly improve RAG accuracy and answer utility in higher education settings.

Evaluation Framework

Established an integrated assessment framework with LLM-as-judge metrics, portable to other courses and domains, enabling continuous improvement of educational chatbots.

Practical Value

Demonstrated that rigorous knowledge engineering approaches offer a pragmatic and scalable path to trustworthy GenAI assistance in knowledge-intensive academic environments.

主要贡献

方法论创新

提出了数据中心化、模型无关的知识库构建工作流程,通过预处理阶段提升知识质量,减少对昂贵微调超长上下文推理的依赖。

实证验证

通过大规模指标和具体案例研究,实证证明了合成多模型问答语料库在高等教育场景下显著提升RAG精度答案实用性

评估框架

建立了包含LLM评判指标的集成评估框架,可移植到其他课程和领域,支持教育聊天机器人的持续改进

实用价值

验证了严格的知识工程方法为知识密集型学术环境中的可信GenAI辅助提供了实用且可扩展的路径

×