hive实现随机抽样（附详解）-ag真人游戏

select  
    *  
from 
    tab 
order by rand()
limit 1000

select 
    *
from 
(  
    select 
        e.*, cast(rand() * 100000 as int) as idx 
    from  e        
) t 
order by t.idx 
limit 1000

表e为一个存有数据普通表，我们要从表e中随机抽出1000条数据作为数据样本。

rand() 函数产生一个0到1的随机数字，cast(rand() * 100000 as int) as idx为一个0到100000之间的一个随机整数。

数据块取样（block sampling）

block_sample: tablesample (n percent)

根据输入的inputsize，取样n%

比如：输入大小为1g，tablesample (50 percent)将会取样约512m的数据；

使用下面的sql，从表table1中取样50%的数据，创建一个table_new新表：

create table table_new as
select * from table1 tablesample (50 percent);

block_sample: tablesample (nm)

指定取样数据的大小，单位为m

使用下面的sql,将会从表table1中取样30m的数据：

create table table_new as
select * from table1 tablesample (30m);

block_sample: tablesample (n rows)

可以根据行数来取样，注意：这里指定的行数，是在每个inputsplit中取样的行数，也就是每个map中都取样n rows

select count(1) from (select * from table1 tablesample (200 rows)) t;

若有5个maptask(inputsplit),每个map取样200行，一共取样1000行

分桶表取样（sampling bucketized table）

hive中的分桶表（bucket table），根据某一个字段hash取模，放入指定数据的桶中，比如将表table1按照id分成100个桶，其算法是hash(id) % 100，这样，hash(id) % 100 = 0的数据被放到第一个桶中，hash(id) % 100 = 1的记录被放到第二个桶中。

分桶表取样的语法:

table_sample: tablesample (bucket x out of y [on colname])

其中x是要抽样的桶编号，桶编号从1开始，colname表示抽样的列，y表示桶的数量。

select count(1)
from table1 tablesample (bucket 1 out of 10 on rand());

该sql语句表示将表table1随机分成10个桶，抽样第一个桶的数据，出来的结果基本上是原表的十分之一，

注意：这个结果每次运行是不一样的，是按照随机数进行分桶取样的

如果基于一个已经分桶表进行取样，将会更有效率。

执行下面的sql语句，创建一个分桶表（分桶表在创建时候使用cluster by语句创建），并插入数据：

create table table_bucketed (id string)
clustered by(id) into 10 buckets;
 
insert overwrite table table_bucketed
select id from table1;

表table_bucketed按照id字段分成10个桶，下面的语句表示从10个桶中抽样第一个桶的数据:

select count(1) from table_bucketed tablesample(bucket 1 out of 10 on id);

结果差不多是源表记录的1/10.

如果从源表中直接分桶抽样，也能达到一样的效果，比如：

select count(1) from table1 tablesample(bucket 1 out of 10 on id);

区别在于基于已经分桶的表抽样，查询只会扫描相应桶中的数据，而未分桶表的抽样，查询时候需要扫描整表数据，先分桶，再抽样。

hive实现随机抽样（附详解）-ag真人游戏

数据块取样（block sampling）

分桶表取样（sampling bucketized table）

相关推荐

菜鸟笔记

热门教程