具体的报错信息是这样的:

表结构如下:

  1. `id` bigint(20) NOT NULL,
  2. `uid` bigint(20) DEFAULT NULL,
  3. `post` text,
  4. `note` text,
  5. PRIMARY KEY (`id`),
  6. KEY `idx_uid` (`uid`)
  7. ) ENGINE=TokuDB DEFAULT CHARSET=utf8

从报错信息来看,是引擎层返回错误的,难道是 TokuDB 数据出问题了么,我们首先要确认的是用户数据是否还能访问。

从表结构来看,出错的语句应该走了二级索引,那么我们强制走 PK 是否能访问数据呢。

  1. select * from t2 force index(primary) where uid > 1 limit 3;
  2. xxx
  3. xxx
  4. xxx
  5. 3 rows in set (0.00 sec)

上面的测试可以说明走 PK 是没问题呢,那么问题可能在二级索引。

同时我们在观察用户的其它 SQL 时发现,二级索引也是可以访问数据的。

比如下面这种:

  1. select * from t2 where uid > 1 order by uid limit 3;
  2. xxx
  3. xxx
  4. xxx
  5. 3 rows in set (0.00 sec)

都是走二级索引,为什么有的会报错呢,这 2 条语句有啥区别呢,explain 看下:

把这个优化关掉呢?

  1. set optimizer_switch='mrr=off';
  2. mysql> explain select id from t2 where uid > 1 limit 3;
  3. +----+-------------+-------+-------+---------------+---------+---------+------+--------+--------------------------+
  4. | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
  5. +----+-------------+-------+-------+---------------+---------+---------+------+--------+--------------------------+
  6. | 1 | SIMPLE | t2 | range | idx_uid | idx_uid | 9 | NULL | 523677 | Using where; Using index |
  7. +----+-------------+-------+-------+---------------+---------+---------+------+--------+--------------------------+
  8. 1 row in set (0.00 sec)
  9. select * from t2 where uid > 1 limit 3;
  10. xxx
  11. xxx
  12. xxx
  13. 3 rows in set (0.00 sec)

可以看到,关掉优化器的 MRR 后,语句就返回正常了。因此基本可以判断是 MRR 导致的。

下面我们从源码层面分析下看,到底是怎么回事。

根据报错信息,来 gdb 跟踪,发现导致报错的栈是这样的,可以看到是在 mrr 执行初始化阶段:

  1. #0 DsMrr_impl::dsmrr_init()
  2. #1 ha_tokudb::multi_range_read_init()
  3. #2 QUICK_RANGE_SELECT::reset()
  4. #3 join_init_read_record()
  5. #4 sub_select()
  6. #5 do_select()
  7. #7 mysql_execute_select()
  8. #8 mysql_select()
  9. #9 handle_select()
  10. #11 mysql_execute_command()
  11. ...

具体在 DsMrr_impl::dsmrr_init 中的逻辑是这样的:

  1. // Transfer ICP from h to h2
  2. if (mrr_keyno == h->pushed_idx_cond_keyno)
  3. {
  4. if (h2->idx_cond_push(mrr_keyno, h->pushed_idx_cond))
  5. {
  6. retval= 1;
  7. goto error;
  8. }
  9. }

我们对应看下 TokuDB 里条件下推接口实现:

可以看到 ha_tokudb::idx_cond_push 是会将原条件在返回给 server 的。因此就导致了 DsMrr_impl::dsmrr_init 返回错误码 1 (Got error 1 from storage engine)。

handler:idx_cond_push() 接口是允许引擎层返回非 NULL 值的,引擎层认为自己没有完全过滤结果集,那么是可以返回条件给 server 层,让 server 层再做一次过滤的:

  1. /**
  2. Push down an index condition to the handler.
  3. The server will use this method to push down a condition it wants
  4. the handler to evaluate when retrieving records using a specified
  5. index. The pushed index condition will only refer to fields from
  6. this handler that is contained in the index (but it may also refer
  7. to fields in other handlers). Before the handler evaluates the
  8. condition it must read the content of the index entry into the
  9. record buffer.
  10. The handler is free to decide if and how much of the condition it
  11. will take responsibility for evaluating. Based on this evaluation
  12. it should return the part of the condition it will not evaluate.
  13. If it decides to evaluate the entire condition it should return
  14. NULL. If it decides not to evaluate any part of the condition it
  15. should return a pointer to the same condition as given as argument.
  16. @param keyno the index number to evaluate the condition on
  17. @param idx_cond the condition to be evaluated by the handler
  18. @return The part of the pushed condition that the handler decides
  19. not to evaluate
  20. */

另外我们在查问题时发现,如果 mysqld 重启或者通过 flush table 关闭表的话,查询是不会出错的:

  1. mysql> flush table t2;
  2. +----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------+
  3. | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
  4. +----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------+
  5. | 1 | SIMPLE | t2 | range | idx_uid | idx_uid | 9 | NULL | 523677 | Using index condition; Using where |
  6. +----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------+

从 explain 结果看,是因为没有用到 MRR,这又是为什么呢?

我们看下优化器是如何选择是否用MRR优化的,在 DsMrr_impl::choose_mrr_impl() 这个函数里是有这样的逻辑的:

  1. /*
  2. If @@optimizer_switch has "mrr_cost_based" on, we should avoid
  3. using DS-MRR for queries where it is likely that the records are
  4. stored in memory. Since there is currently no way to determine
  5. this, we use a heuristic:
  6. a) if the storage engine has a memory buffer, DS-MRR is only
  7. considered if the table size is bigger than the buffer.
  8. b) if the storage engine does not have a memory buffer, DS-MRR is
  9. only considered if the table size is bigger than 100MB.
  10. c) Since there is an initial setup cost of DS-MRR, so it is only
  11. considered if at least 50 records will be read.
  12. */
  13. if (thd->optimizer_switch_flag(OPTIMIZER_SWITCH_MRR_COST_BASED))
  14. {
  15. /*
  16. If the storage engine has a database buffer we use this as the
  17. minimum size the table should have before considering DS-MRR.
  18. */
  19. longlong min_file_size= table->file->get_memory_buffer_size();
  20. if (min_file_size == -1)
  21. {
  22. // No estimate for database buffer
  23. min_file_size= 100 * 1024 * 1024; // 100 MB
  24. }
  25. if (table->file->stats.data_file_length <
  26. static_cast<ulonglong>(min_file_size) ||
  27. rows <= 50)
  28. return true; // Use the default implementation
  29. }

可以看到,MRR 选择条件是这样的:

  1. 如果引擎的 cache 比表大的话,是不会用 MRR 优化的;
  2. 如果引擎没有 cache,默认用 100M,用于自己不管理 cache 引擎,如 MyISAM;
  3. 如果要查询的行数不超过50的话,也是不会用 MRR 优化的;

这个 cache 对 InnoDB 来说,就是 innodb_buffer_pool_size;对 TokuDB 来说,就是 tokudb_cache_size。但是 TokuDB handler 层没有实现 get_memory_buffer_size() 这个接口,导致一直用 100M 做为 cache 来判断,这个是 TokuDB handler 实现的上的一个bug。

data_file_length 这个是值是内存信息,在表刚关闭重新打开的时候,是0,所以不会用MRR优化。

另外还有一个判断条件时,如果要求排序的话,也是不会用 MRR 优化的,这也就是为什么我们刚开始发现的,语句中用了 order by 后,explain 结果中就没有 MRR了。

从上面的分析来看,满足下面条件语句会被影响:

  1. 表大小超过 100M;

临时的解决方法是关闭优化器的 MRR 或者 ICP: