Hive的模式设计

    按天划分表就是一种模式,其通常会在表中加入一个时间戳,例如表名为upply_2011_01_01等等.这种每天一张表的方式在数据库领域是反模式的一种方式,但是因为实际情况下数据集增长得很快,这种方式应用还是比较广泛的.

    对于Hive,这种情况下应该使用分区表.Hive通过Where子句中的表达式来选择查询所需要的指定分区,这样的查询执行效率高,而且看起来清晰明了:

    1. . . . . . . . . . . . . . . . . . .> partitioned by (day int);
    2. 0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110102);
    3. OK
    4. No rows affected (0.088 seconds)
    5. 0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110103);
    6. OK
    7. No rows affected (0.067 seconds)
    8. 0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110104);
    9. OK
    10. No rows affected (0.083 seconds)
    11. 0: jdbc:hive2://hadoopmaster:10000/> select * from supplybypartition
    12. . . . . . . . . . . . . . . . . . .> where day>=20110102 and day<20110103 and quantity<4;
    13. OK
    14. | supplybypartition.id | supplybypartition.part | supplybypartition.quantity | supplybypartition.day |
    15. +-----------------------+-------------------------+-----------------------------+------------------------+--+
    16. +-----------------------+-------------------------+-----------------------------+------------------------+--+
    17. No rows selected (0.162 seconds)
    18. 0: jdbc:hive2://hadoopmaster:10000/>

    Hive中分区的功能是非常有用的,这是因为Hive通常要对输入进行全盘扫描,来满足查询条件,通过创建很多的分区确定可以优化一些查询,但是同时可能会对其他一些重要的查询不利:

    MapReduce会将一个任务(job)转换成多个任务(task).默认情况下,每个task都是一个新的JVM实例,都需要开启和销毁的开销.对于小文件,每个文件都会对应一个task.在一些情况下,JVM开启和销毁的时间中销毁可能会比实际处理数据的时间消耗要长

    因此,一个理想的分区方案不应该导致产生太多的分区和文件夹目录,并且每个目录下的文件应该足够大,应该是文件系统中块大小的若干倍.

    接时间范围进行分区的一个好的策略就是按照不同的时间颗粒度来确定合适大小的数据积累量,而且安装这个时间颗粒.随着时间的推移,分区数量的增长是均匀的,而且每个分区下包含的文件大小至少是文件系统中块或块大小的数倍.

    1. hive> delete from test_insert_test where id=1;
    2. FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

    修改配置文件hive-site.xml

    查看分桶

    1. hadoop@hadoopmaster:/usr/local/hive/conf$ hdfs dfs -ls /user/hive/warehouse/test_insert_test
    2. Found 3 items
    3. -rwxrwxr-x 2 hadoop supergroup 6 2016-08-10 10:39 /user/hive/warehouse/test_insert_test/000000_0
    4. -rwxrwxr-x 2 hadoop supergroup 5 2016-08-10 10:40 /user/hive/warehouse/test_insert_test/000000_0_copy_1
    5. -rwxrwxr-x 2 hadoop supergroup 5 2016-08-10 10:40 /user/hive/warehouse/test_insert_test/000000_0_copy_2
    6. hive> hadoop@hadoopmaster:/usr/local/hive/conf$ hdfs dfs -ls /user/hive/warehouse/test_trancaction
    7. Found 3 items
    8. drwxr-xr-x - hadoop supergroup 0 2016-08-10 10:45 /user/hive/warehouse/test_trancaction/delta_0000001_0000001_0000
    9. drwxr-xr-x - hadoop supergroup 0 2016-08-10 10:46 /user/hive/warehouse/test_trancaction/delta_0000002_0000002_0000
    10. drwxr-xr-x - hadoop supergroup 0 2016-08-10 10:46 /user/hive/warehouse/test_trancaction/delta_0000003_0000003_0000
    11. hive> delete from test_trancaction where user_id=1;
    12. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    13. Total jobs = 1
    14. Launching Job 1 out of 1
    15. In order to change the average load for a reducer (in bytes):
    16. set hive.exec.reducers.bytes.per.reducer=<number>
    17. In order to limit the maximum number of reducers:
    18. set hive.exec.reducers.max=<number>
    19. In order to set a constant number of reducers:
    20. set mapreduce.job.reduces=<number>
    21. Starting Job = job_1470228460967_0010, Tracking URL = http://hadoopmaster:8088/proxy/application_1470228460967_0010/
    22. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1470228460967_0010
    23. Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
    24. 2016-08-10 10:48:36,463 Stage-1 map = 0%, reduce = 0%
    25. 2016-08-10 10:48:41,784 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 0.97 sec
    26. 2016-08-10 10:48:46,913 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 2.0 sec
    27. 2016-08-10 10:48:48,970 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.0 sec
    28. 2016-08-10 10:48:50,020 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 4.1 sec
    29. 2016-08-10 10:48:54,117 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.76 sec
    30. MapReduce Total cumulative CPU time: 5 seconds 760 msec
    31. Ended Job = job_1470228460967_0010
    32. Loading data to table default.test_trancaction
    33. MapReduce Jobs Launched:
    34. Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 5.76 sec HDFS Read: 32745 HDFS Write: 701 SUCCESS
    35. Total MapReduce CPU Time Spent: 5 seconds 760 msec
    36. Time taken: 26.074 seconds

    最后总结一下,做Hive的Transaction其实不合适,资源耗用量大,意义不大,本身hive做离线查询还是可以的.ACID支持你饶了我吧….二点 需要分桶表 需要修改hive-site.xml文件,剩余的还是很简单的.