Hive的模式设计

按天划分表就是一种模式,其通常会在表中加入一个时间戳,例如表名为upply_2011_01_01等等.这种每天一张表的方式在数据库领域是反模式的一种方式,但是因为实际情况下数据集增长得很快,这种方式应用还是比较广泛的.

对于Hive,这种情况下应该使用分区表.Hive通过Where子句中的表达式来选择查询所需要的指定分区,这样的查询执行效率高,而且看起来清晰明了:


. . . . . . . . . . . . . . . . . .> partitioned by (day int);
0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110102);
OK
No rows affected (0.088 seconds)
0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110103);
OK
No rows affected (0.067 seconds)
0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110104);
OK
No rows affected (0.083 seconds)
0: jdbc:hive2://hadoopmaster:10000/> select * from supplybypartition
. . . . . . . . . . . . . . . . . .> where day>=20110102 and day<20110103 and quantity<4;
OK
| supplybypartition.id  | supplybypartition.part  | supplybypartition.quantity  | supplybypartition.day  |
+-----------------------+-------------------------+-----------------------------+------------------------+--+
+-----------------------+-------------------------+-----------------------------+------------------------+--+
No rows selected (0.162 seconds)
0: jdbc:hive2://hadoopmaster:10000/>

Hive中分区的功能是非常有用的,这是因为Hive通常要对输入进行全盘扫描,来满足查询条件,通过创建很多的分区确定可以优化一些查询,但是同时可能会对其他一些重要的查询不利:

MapReduce会将一个任务(job)转换成多个任务(task).默认情况下,每个task都是一个新的JVM实例,都需要开启和销毁的开销.对于小文件,每个文件都会对应一个task.在一些情况下,JVM开启和销毁的时间中销毁可能会比实际处理数据的时间消耗要长

因此,一个理想的分区方案不应该导致产生太多的分区和文件夹目录,并且每个目录下的文件应该足够大,应该是文件系统中块大小的若干倍.

接时间范围进行分区的一个好的策略就是按照不同的时间颗粒度来确定合适大小的数据积累量,而且安装这个时间颗粒.随着时间的推移,分区数量的增长是均匀的,而且每个分区下包含的文件大小至少是文件系统中块或块大小的数倍.


hive> delete from test_insert_test where id=1;
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

修改配置文件hive-site.xml

查看分桶

hadoop@hadoopmaster:/usr/local/hive/conf$ hdfs dfs -ls /user/hive/warehouse/test_insert_test
Found 3 items
-rwxrwxr-x   2 hadoop supergroup          6 2016-08-10 10:39 /user/hive/warehouse/test_insert_test/000000_0
-rwxrwxr-x   2 hadoop supergroup          5 2016-08-10 10:40 /user/hive/warehouse/test_insert_test/000000_0_copy_1
-rwxrwxr-x   2 hadoop supergroup          5 2016-08-10 10:40 /user/hive/warehouse/test_insert_test/000000_0_copy_2
hive> hadoop@hadoopmaster:/usr/local/hive/conf$ hdfs dfs -ls /user/hive/warehouse/test_trancaction
Found 3 items
drwxr-xr-x   - hadoop supergroup          0 2016-08-10 10:45 /user/hive/warehouse/test_trancaction/delta_0000001_0000001_0000
drwxr-xr-x   - hadoop supergroup          0 2016-08-10 10:46 /user/hive/warehouse/test_trancaction/delta_0000002_0000002_0000
drwxr-xr-x   - hadoop supergroup          0 2016-08-10 10:46 /user/hive/warehouse/test_trancaction/delta_0000003_0000003_0000
hive> delete from test_trancaction where user_id=1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1470228460967_0010, Tracking URL = http://hadoopmaster:8088/proxy/application_1470228460967_0010/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1470228460967_0010
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2016-08-10 10:48:36,463 Stage-1 map = 0%,  reduce = 0%
2016-08-10 10:48:41,784 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 0.97 sec
2016-08-10 10:48:46,913 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 2.0 sec
2016-08-10 10:48:48,970 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.0 sec
2016-08-10 10:48:50,020 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 4.1 sec
2016-08-10 10:48:54,117 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.76 sec
MapReduce Total cumulative CPU time: 5 seconds 760 msec
Ended Job = job_1470228460967_0010
Loading data to table default.test_trancaction
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 5.76 sec   HDFS Read: 32745 HDFS Write: 701 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 760 msec
Time taken: 26.074 seconds

最后总结一下,做Hive的Transaction其实不合适,资源耗用量大,意义不大,本身hive做离线查询还是可以的.ACID支持你饶了我吧….二点需要分桶表需要修改hive-site.xml文件,剩余的还是很简单的.

6 Hive的模式设计与事务性

Hive的模式设计