SOLR深度源码系列解读专栏（三）：索引构建与更新机制

第3篇：索引构建与更新机制

3.1 前言

在上一篇文章中，我们从宏观视角剖析了 SOLR 的整体架构，了解了请求如何从客户端到达服务端并通过核心组件处理。现在，我们将聚焦于 SOLR 的一个核心功能：索引构建与更新。无论是单机模式还是分布式环境，索引是 SOLR 提供高效搜索的基础。本篇将从索引的生命周期入手，逐步揭示 SOLR 如何将文档转化为可搜索的数据，并通过源码分析关键实现细节。

索引的构建和更新涉及多个组件协作，包括客户端请求解析、更新处理逻辑、Lucene 的底层索引操作，以及事务日志和提交策略的优化。通过本篇，你将掌握 SOLR 索引的核心机制，为后续优化和定制奠定基础。

3.2 索引的生命周期

SOLR 的索引过程可以分为以下几个阶段：

客户端提交：通过 HTTP 请求（如 POST JSON）发送文档。
请求分发：SOLR 接收并路由到更新处理器。
文档处理：解析文档、应用 Schema 规则。
索引写入：将文档写入 Lucene 索引。
提交与同步：确保数据持久化并对查询可见。

在分布式环境下，还会涉及分片分配和副本同步，但本篇以单机模式为主，后续会扩展到 SolrCloud。

3.3 客户端提交：从请求开始

索引过程始于客户端提交文档。SOLR 支持多种格式（如 JSON、XML、CSV），以下以 JSON 为例：

POST http://localhost:8983/solr/mycore/update?commit=true
Content-Type: application/json
[
  {"id": "1", "title": "Hello SOLR", "content": "This is a test document"}
]

3.3.1 请求入口

如第二篇所述，请求首先被 SolrDispatchFilter 拦截，创建 HttpSolrCall，并根据路径 /update 路由到 UpdateHandler。

3.3.2 UpdateHandler 的角色

UpdateHandler 是更新操作的总指挥，位于 org.apache.solr.update 包中。它的核心方法是 handleRequestBody：

1
2
3
4
5
public abstract class UpdateHandler implements SolrCoreState.IndexWriterCloser {
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception {
    // 实现由子类提供
  }
}

实际执行逻辑在 DirectUpdateHandler2 中实现。

3.4 文档处理：DirectUpdateHandler2

DirectUpdateHandler2 是 SOLR 默认的更新处理器，负责解析请求并调用底层组件。

3.4.1 处理请求

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class DirectUpdateHandler2 extends UpdateHandler {
  protected final SolrCore core;
  protected final SolrCoreState coreState;
  protected final UpdateLog updateLog;

  @Override
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception {
    UpdateRequestProcessor processor = new UpdateRequestProcessor(req, rsp, this);
    processor.processRequest();
  }
}

core：当前处理的 SolrCore。
updateLog：事务日志，用于崩溃恢复。
UpdateRequestProcessor：负责具体更新逻辑。

3.4.2 UpdateRequestProcessor

UpdateRequestProcessor 是一个管道式处理器，支持链式调用（类似插件机制）。默认实现包括：

LogUpdateProcessor：记录更新日志。
AddUpdateCommand：添加或更新文档。

核心方法：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public void processRequest() throws Exception {
  SolrParams params = req.getParams();
  String action = params.get("action", "add");
  if ("add".equals(action)) {
    AddUpdateCommand cmd = new AddUpdateCommand(req);
    processAdd(cmd);
  } else if ("delete".equals(action)) {
    processDelete(new DeleteUpdateCommand(req));
  }
}

3.4.3 添加文档

以添加文档为例，processAdd 是关键：

1
2
3
4
5
6
protected void processAdd(AddUpdateCommand cmd) throws IOException {
  SolrInputDocument doc = cmd.solrDoc;
  updateLog.add(cmd); // 写入事务日志
  SolrIndexWriter writer = coreState.getIndexWriter(core);
  writer.addDocument(doc, core.getSchema());
}

SolrInputDocument：客户端提交的文档（内存表示）。
SolrIndexWriter：底层索引写入工具。

3.5 索引写入：SolrIndexWriter 与 Lucene

SOLR 的索引最终依赖 Lucene 的 IndexWriter。

3.5.1 SolrIndexWriter

SolrIndexWriter 是对 Lucene IndexWriter 的封装，位于 org.apache.solr.core：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
public class SolrIndexWriter extends IndexWriter {
  private final SolrCore core;
  private final SchemaCodec codec;

  public SolrIndexWriter(String name, Directory dir, IndexWriterConfig conf, SolrCore core) 
      throws IOException {
    super(dir, conf);
    this.core = core;
    this.codec = core.getSchema().getCodec();
  }

  public void addDocument(SolrInputDocument doc, Schema schema) throws IOException {
    Document luceneDoc = schema.toLuceneDocument(doc);
    addDocument(luceneDoc);
  }
}

Directory：索引存储位置（通常是文件系统）。
toLuceneDocument：将 SOLR 文档转换为 Lucene 格式。

3.5.2 Lucene 的索引过程

Lucene 的 addDocument 方法：

分析字段：通过 Analyzer 分词（如 StandardAnalyzer）。
写入倒排索引：构建词项到文档的映射。
存储字段：保存原始内容（若配置）。
缓冲区管理：数据先写入内存，达到阈值后刷新到磁盘。

3.6 事务日志与提交策略

SOLR 使用事务日志和提交策略确保数据一致性和查询可见性。

3.6.1 UpdateLog

UpdateLog 记录每次更新操作，用于崩溃恢复：

1
2
3
4
5
6
public class UpdateLog {
  public void add(AddUpdateCommand cmd) throws IOException {
    TransactionLog tlog = getCurrentLog();
    tlog.write(cmd);
  }
}

TransactionLog：基于文件的日志，顺序写入。

3.6.2 提交策略

SOLR 支持两种提交：

softCommit：仅刷新内存中的索引，使新文档对查询可见，但不持久化。
hardCommit：将索引写入磁盘并同步事务日志。

1
2
3
4
5
6
7
8
9
public void commit(CommitUpdateCommand cmd) throws IOException {
  SolrIndexWriter writer = coreState.getIndexWriter(core);
  if (cmd.softCommit) {
    writer.commit(false); // softCommit
  } else {
    writer.commit(true);  // hardCommit
    updateLog.commit();
  }
}

commit=true：触发 hardCommit。
softCommit=true：仅刷新内存。

3.7 实践：跟踪索引过程

让我们通过一个实例调试索引流程。

步骤

启动 SOLR：使用调试模式运行（参考第一篇）。

提交文档：

1
2
3
curl -X POST -H "Content-Type: application/json" \
http://localhost:8983/solr/mycore/update?commit=true \
--data-binary '[{"id":"2", "title":"Test Doc"}]'

设置断点：
- DirectUpdateHandler2.handleRequestBody。
- SolrIndexWriter.addDocument。
观察流程：
- 请求到达 UpdateHandler。
- 文档写入事务日志。
- Lucene 索引更新。

日志输出

[update] INFO  o.a.s.u.DirectUpdateHandler2 - Adding document id=2
[update] INFO  o.a.s.c.SolrIndexWriter - Committed changes to index

3.8 源码分析：关键点总结

UpdateHandler：协调更新操作。
SolrIndexWriter：桥接 SOLR 和 Lucene。
UpdateLog：保障数据一致性。
提交策略：平衡性能和可靠性。

3.9 小结与预告

本篇详细剖析了 SOLR 的索引构建与更新机制，从客户端请求到 Lucene 索引的全流程。通过源码分析，我们理解了 DirectUpdateHandler2 和 SolrIndexWriter 的核心作用，以及事务日志和提交策略的优化设计。下一篇文章将转向 查询解析与执行，探索 SOLR 如何高效返回搜索结果。

课后练习

修改 solrconfig.xml，调整 softCommit 间隔，观察性能变化。
在 SolrIndexWriter 中添加日志，记录每次文档写入的时间。

SOLR深度源码系列解读专栏（三）：索引构建与更新机制

相关文章：