论文

BitsAI-CR: Automated Code Review via LLM in Practice

https://arxiv.org/abs/2501.15134

关键点

规则

  1. AI review能发现的问题是固定的,对应规则,并且按照领域、类别进行组织
  2. 新增规则实际上是重新训练Lora模型
  3. 通过实验证明了分类后的规则训练的模型有效性更高

流程

  1. 提出了RuleChecker - RuleFilter - Aggregator 的3阶段流程
  2. Checker和Filter均训练了Lora模型,Aggregator使用embedding相似度合并问题
  3. 通过实验证明了Filter能够降低幻觉问题,提高精确度
  4. 通过实验证明了Conclusion First的Filter 推理形式精确率最高

指标

  1. 精确率,Precision比Recall更重要
  2. Outdated Rate: 相当于修复率,问题的代码有没有被修改过

持续迭代

  1. 新增规则:来自静态分析的规则 + 人工Review的评论
  2. 在线反馈迭代:用户反馈 + 手动标注 + 监控看板
  3. 指标的趋势看板 + 问卷调查和专家访谈

疑问:

  1. AI输出的是单行,多行问题怎么评测?

启发

  1. 引入二阶段RuleFilter,和三阶段Aggregator
  2. 需要持续地投入标注数据、设计实验、评测结果
  3. 对于新增业务规则,训Lora也是一种方法(另一种方法是RAG)
  4. Precision和Outdated Rate作为关键指标
  5. 持续迭代依赖大量人工标注,总得有人标数据

效果

1
2
3
4
5
Empiri-
cal evaluation demonstrates BitsAI-CR’s effectiveness, achieving
75.0% precision in review comment generation. For the Go language
which has predominant usage at ByteDance, we maintain an Out-
dated Rate of 26.7%.

3. 方法论

3.1 架构

framework

3.2 规则准备

219条规则。按照Dimensions - Categories - Rules 3层进行组织。

rule

3.3 完整的Pipeline

Context Preparation

  1. 将code diff切分为hunks,避免超过上下文长度
  2. 使用tree-sitter将每个code block拓展到完整的函数定义
  3. 增加标注:[deleted or pre-modified @line_number in old code] or [added or post-modified @line_number in new code]

RuleChecker

rulechecker

ReviewFilter

二分类过滤。3种不同的推理模式:

  1. Direct conclusion
  2. Reasoning-First, COT
  3. Conclusion-First

Comment Aggregation

使用embedding model对相似问题进行聚合

3.4 指标评估

精确率

重视精确度 > 召回率,因此使用精确率指标。

1
2
Formally, let 𝐶𝑐𝑜𝑟𝑟𝑒𝑐𝑡 represent the set of correct com-
ments and 𝐶𝑡𝑜𝑡𝑎𝑙 represent all comments generated by BitsAI-CR,

Outdated Rate

就是修复率 Outdated Rate for Automated Evaluation

1
2
3
4
5
𝐶𝑠𝑒𝑒𝑛 represents the set of comments reviewed by code com-
mitters within a one-week measurement window, and the function
isOutdated (𝑐) returns true only if a comment 𝑐 is considered out-
dated, which occurs when if any line within its flagged code range
is modified in subsequent commits.

3.5 数据飞轮:持续进化

新规则来源

  1. 内部静态分析规则
  2. 人工Review的评论

将规则进行整理、采样后,训练新的模型

在线反馈迭代

  1. 用户反馈收集
  2. 手动进行规则的Precision采样,人工标注
  3. 监控Outdated Rate

综合指标,决定是新增还是移除规则

1
Rules are assessed based on specific criteria: an Outdated Rate of around 25% (±5%) with a precision of around 65% (±5%) for 14 days

3.6 实施

跟gitlab集成。当问题被修复后,评论LGTM

gitlab

4 实验

4.1 模型训练

基于doubao训练的2个Lora模型:RuleChecker 和 ReviewFilter

4.2 离线评估

测试数据集

1
2
3
4
To evaluate BitsAI-CR’s effectiveness in code review, we collect an
offline dataset consisting of 1397 cases sampled from the production
codebase, where 767 samples violate and 630 samples follow the
code best practices.

比较方法

1
The review comment is deemed correct only if the model determines it aligns with the ground truth. 

实验了2个版本的模型

  1. BitsAI-CR w/o Taxonomy 57.03%
  2. BitsAI-CR 16.83%

消融实验:证明RuleFilter是有用的

Reason Patterns of ReviewFilter

使用Conclusion First的方式P最高,R最低,Filter Rate最高

更关注P,因此使用Conclusion First形式

result

4.3 在线评估

Precision和Outdate rate的每周进展

定量分析:问卷调查+专家访谈

5 关键Insight

  1. Taxonomy of Review Rules it enables systematic code issue categorization, data collection, and performance evaluation.
  2. Two-Stage Review Generation i enhances automated code review reliability by validating identified issues.
  3. Precision and Outdated Rate Metrics it guides data flywheel optimization through user-centric evaluation.