Join中存在小表,一般这个小表在100M以内,可以用mapjoin,避免分发引起的长尾。拿上面的例子来说,假如商品表数据量只有几万条记录(这里只是打个比方,现实业务中商品表一般都是非常庞大的),但是IPV日志表中的商品id 80%值为0的无效值,且记录数有几十亿,如果采用上述SQL写法,数据倾斜是显而易见的,但利用mapjoin可以有效解决这个问题:
select /*+ MAPJOIN(b) */a.* from ipv_log_table a left outer join item_table b on a.item_id = cast(b.item_id as string)
select a.visitor_id ,b.seller_idfrom ( select from ipv_log_table where item_id > 0) a left outer join item_table b on a.item_id = b.item_idunion allselect a.visitor_id ,cast(null as bigint) seller_idfrom ipv_log_tablewhere item_id = 0
select a.visitor_id ,b.seller_id from ipv_log_table a left outer join item_table b on if(a.item_id > 0, cast(a.item_id as string), concat('rand',cast(rand() as string))) = cast(b.item_id as string)
select /*+ MAPJOIN(b) */ a.visitor_id ,b.seller_id from ipv_log_table a left outer join ( select /*+ MAPJOIN(log) */ itm.seller_id ,itm.item_id from ( select item_id from ipv_log_table where item_id > 0 group by item_id ) log join item_table itm on log.item_id = itm.item_id) b on a.item_id = b.item_id
select ... from ipv_log_table a join ( select auction_id as auction_id from auctions union all select auction_string_id as auction_id from auctions where auction_string_id is not null) bon a.auction_id = b.auction_id