11.4.4 统计视频观看数Top50所关联视频的所属类别Rank

目标:

  • 找到 top50 视频的关联视频 id, 然后对这些关联视频的类别分组统计视频的个数, 然后排序 Rank

分析:

  • 找到 top50 视频: 只相关视频就可以了. 记为t1
  • 炸裂相关视频id. 记为表t2
  • t2与原表 join 查找到每个相关视频的类别. 记为表t3
  • 炸裂类别, 按照类别分类,统计每个类别出现的次数
  • 添加rank
- 找到 top50 视频: 只view和相关视频就可以了.  // t1
    select 
        views,
        relatedId
    from gulivideo_orc
    order by views desc
    limit 50;   // t1
- 炸裂相关视频id. 记为表t1
    select 
        distinct(videoId)
    from t1
    lateral view explode(relatedId) temp as videoId;     // t2
- t2与原表 join 查找到每个相关视频的类别. 记为表t3
    select 
        t2.videoId,
        gulivideo_orc.category
    from t2 join gulivideo_orc
    on t2.videoId = gulivideo_orc.videoId;   // t3
- 炸裂类别, 统计每个类别出现的次数
    select
        category_name,
        count(*) num
    from t3
    lateral view explode(category) tmp as category_name
    group by category_name; // t4

- 添加 rank 排名
    select
        category_name,
        num,
        rank() over(order by num)
    from t4;

合并成一个语句:

select
    category_name,
    num,
    rank() over(order by num desc)
from(
    select
        category_name,
        count(*) num
    from(
        select 
            t2.videoId,
            gulivideo_orc.category
        from (
            select 
                distinct(videoId)
            from (
                select 
                    views,
                    relatedId
                from gulivideo_orc
                order by views desc
                limit 50
            ) t1
            lateral view explode(relatedId) temp as videoId
        )t2 join gulivideo_orc 
        on t2.videoId = gulivideo_orc.videoId    
    )t3
    lateral view explode(category) tmp as category_name
    group by category_name
)t4;
+----------------+------+----------------+
| category_name  | num  | rank_window_0  |
+----------------+------+----------------+
| Comedy         | 232  | 1              |
| Entertainment  | 216  | 2              |
| Music          | 195  | 3              |
| Blogs          | 51   | 4              |
| People         | 51   | 4              |
| Film           | 47   | 6              |
| Animation      | 47   | 6              |
| News           | 22   | 8              |
| Politics       | 22   | 8              |
| Games          | 20   | 10             |
| Gadgets        | 20   | 10             |
| Sports         | 19   | 12             |
| Howto          | 14   | 13             |
| DIY            | 14   | 13             |
| UNA            | 13   | 15             |
| Places         | 12   | 16             |
| Travel         | 12   | 16             |
| Animals        | 11   | 18             |
| Pets           | 11   | 18             |
| Autos          | 4    | 20             |
| Vehicles       | 4    | 20             |
+----------------+------+----------------+
21 rows selected (100.953 seconds)

注意:

如果抛异常: Java heap spaceyarn-site.xml 添加如下配置

<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2048</value>
</property>
<property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>2048</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
</property>
<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1024m</value>
</property>
Copyright © 尚硅谷大数据 2019 all right reserved,powered by Gitbook
该文件最后修订时间: 2019-03-16 07:15:56

results matching ""

    No results matching ""