数据仓库DWS与DWT


1. DWS-每日设备行为

dws_uv_detail_daycount

1.1 每日设备行为表存什么?

dws_uv_detail_daycount

存储 : 主要存储以设备ID为分组,统计每个设备的 基本信息 , 每日的登录次数页面访问的统计

来源 : 主要来源于dwd_start_log

1.1.1 dwd_start_log存什么?

data_warehouse_detail = 粒度最小

dwd_start_log是启动日志表 以每次启动信息为粒度

存储 : 启动日志表中每行数据对应一个启动记录,一个启动记录包含日志中的 公共信息(common) 和 启动信息。

来源 : 先将所有包含start字段的日志过滤出来,然后使用get_json_object函数从**ods_log(源日志表)**里解析每个字段(line,’$start’)。

1.2 建表

create external table dws_uv_detail_daycount
(
    `mid_id`      string COMMENT '设备id',--基本信息
    `brand`       string COMMENT '手机品牌',--基本信息
    `model`       string COMMENT '手机型号',--基本信息
    `login_count` bigint COMMENT '活跃次数',--每日的登录次数
    `page_stats`  array<struct<page_id:string,page_count:bigint>> COMMENT '页面访问统计'
) COMMENT '每日设备行为表'
partitioned by(dt string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_daycount'
tblproperties ("parquet.compression"="lzo");

1.3 字段分析

现在分析一下这个表格

字段 设备id Type
mid_id 设备id string
brand 手机品牌 string
model 手机型号 string
login_count 活跃次数 bigint
page_stats 页面访问统计 “array”
dt

活跃次数应该为启动次数统计 应该使用dwd_start_log进行统计

字段分析

页面统计应该为页面的信息,应该使用dwd_page_log进行统计

1605259370236

1.4 拉取sql书写

首先统计登录启动次数(dwd_start_log) login_count

select
       mid_id,
       brand,
       model,
       count(*) login_count--启动信息为粒度,启动一次就活跃一次,count(*)就可以
from dwd_start_log
group by mid_id, brand, model--按理说,设备ID就可以直接区别
--因为数据问题,所以要这样分组

分组后统计page_status字段 一个包含页面与次数 K-V 的数组

--page_stats 统计page_id:info  counts:info
select
       tmp.mid_id,
       tmp.brand,
       tmp.model,
       collect_set(named_struct('page_id',tmp.page_id,'page_count',tmp.page_count)) page_stats--dwd_page_log中的page_id为浏览的一次记录一次,分组统计
from (select 
            mid_id,
            brand,
            model,
            page_id,
            count(*) page_count--获取每个页面的访问次数
        from dwd_page_log
        group by mid_id,brand,model,page_id--按照page_id进行分组统计
    ) tmp
group by tmp.mid_id,tmp.brand,tmp.model;

with
tmp_start as
(
    select  
        mid_id,
        brand,
        model,
        count(*) login_count--登录次数
    from dwd_start_log
    where dt='2020-06-14'
    group by mid_id,brand,model
),
tmp_page as
(
    select
        mid_id,
        brand,
        model,        collect_set(named_struct('page_id',page_id,'page_count',page_count)) page_stats--页面访问次数统计
    from
    (
        select
            mid_id,
            brand,
            model,
            page_id,
            count(*) page_count
        from dwd_page_log
        where dt='2020-06-14'
        group by mid_id,brand,model,page_id
    )tmp
    group by mid_id,brand,model
)
insert overwrite table dws_uv_detail_daycount partition(dt='2020-06-14')
select
    nvl(tmp_start.mid_id,tmp_page.mid_id),
    nvl(tmp_start.brand,tmp_page.brand),
    nvl(tmp_start.model,tmp_page.model),
    tmp_start.login_count,
    tmp_page.page_stats
from tmp_start 
full outer join tmp_page
on tmp_start.mid_id=tmp_page.mid_id
and tmp_start.brand=tmp_page.brand
and tmp_start.model=tmp_page.model;

2. DWT-设备主题宽表

dwt_uv_topic

2.1 设备主题宽表存什么

dwt_uv_topic

主要围绕设备的每日行为进行解析拉取出一个跨时间维度的表

包括 设备的 首末时间, 累计活跃,周活,月活等

粒度是一个设备的整个生命周期,不需要分区

来源 : dws_uv_detail_daycount

2.2 建表

create external table dwt_uv_topic
(
    `mid_id` string comment '设备id',
    `brand` string comment '手机品牌',
    `model` string comment '手机型号',
    `login_date_first` string  comment '首次活跃时间',
    `login_date_last` string  comment '末次活跃时间',
    `login_day_count` bigint comment '当日活跃次数',
    `login_count` bigint comment '累积活跃天数'
) COMMENT '设备主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_uv_topic'
tblproperties ("parquet.compression"="lzo");

根据建表语句稍微分析一下字段

1605260488291

2.3 拉取sql书写

指定来源表:dws_uv_detail_daycount

废话不多说直接开整

select
       nvl(old.mid_id,new.mid_id),
       nvl(old.brand,new.brand),
       nvl(old.model,new.model),
       `if`(old.login_date_first is null,'2020-06-14',old.login_date_first) login_date_first,--首次活跃时间
       `if`(new.login_count>0,'2020-06-14',old.login_date_last) login_date_last,
       `if`(new.login_count>0,new.login_count,0) login_day_count,
       `if`(old.login_count is not null,old.login_count,0) + `if`(new.login_count > 0,1,0) login_count
from (
         select mid_id,
                brand,
                model,
                login_date_first,
                login_date_last,
                login_day_count,
                login_count
         from dwt_uv_topic--老表拉取
     ) old
full outer join
     (
         select mid_id,
                brand,
                model,
                login_count,
                page_stats
         from dws_uv_detail_daycount
         where dt = '2020-06-14'--新表直接拿过来
     ) new
on old.brand = new.brand
and old.mid_id = new.mid_id
and old.model = new.model;

3. DWS-每日会员行为

dws_user_action_daycount

3.1 每日会员行为存什么?

dws_user_action_daycount

主要是以每一个会员user_id为单位存储会员的登录 加购 下单 支付等动作的综合

来源 : 先建表,在分析来源把

3.2 建表

create external table dws_user_action_daycount
(   
    user_id string comment '用户 id',
    login_count bigint comment '登录次数',
    cart_count bigint comment '加入购物车次数',
    order_count bigint comment '下单次数',
    order_amount    decimal(16,2)  comment '下单金额',
    payment_count   bigint      comment '支付次数',
    payment_amount  decimal(16,2) comment '支付金额',
    order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> comment '下单明细统计'
) COMMENT '每日会员行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_action_daycount/'
tblproperties ("parquet.compression"="lzo");

3.3 字段分析

说实话这里面的字段挺蛋疼的

加购次数 : dwd_action_log 动作日志表

登录次数 : dwd_start_log 启动日志表

下单次数与下单金额 : dwd_fact_order_info 订单事实表(累计快照事实表)

支付次数与支付金额 : dwd_fact_payment_info 支付事实表(事务型事实表)

四个表确定下来,现在连连看,有没有需要特殊处理的

1605264501894

连起来,大体有个了解,下单明细统计还不知道是个啥

这是下单明细需要的数组

array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>>

显然这个信息来源于订单明细表(事务性事实表)

dwd_fact_order_detail

1605264746465

显然我们将dwd_fact_order_detail表中按照user_id进行分组,统计

3.4 拉取sql书写

书写之前先思考一个问题

每日会员行为是分区的,不需要新旧对比,直接拉取出来即可

拉取的表

  1. dwd_start_log
  2. dwd_action_log
  3. dwd_fact_order_info
  4. dwd_fact_payment_info
  5. dwd_fact_order_detail

架构采用

with

table_name as ()

开整

login_count登录次数

dwd_start_log

--dwd_start_log  启动 
--获取login_count bigint comment '登录次数'
select
	user_id,
	count(*) login_count--启动活跃次数
from dwd_start_log

cart_count 加购次数

dwd_action_log

--dwd_action_log 
select
       user_id,
       count(action_id) cart_count--简单
from dwd_action_log
where dt = '2020-06-14'
  and user_id is not null
  and action_id = 'cart_add'

dwd_fact_order_info 下单次数与下单金额

select
       user_id,
       count(*) order_count,
       sum(final_total_amount) order_amount--简单
from dwd_fact_order_info
where dt = '2020-06-14'
group by user_id

dwd_fact_payment_info 支付次数与支付金额

select
       user_id,
       count(*) payment_count,--支付次数
       sum(payment_amount) payment_amount
from
dwd_fact_payment_info
where dt = '2020-06-14'
group by user_id

dwd_fact_order_detail 下单明细统计

这是下单明细需要的数组,数组里面的还有字段,所以需要分两步

第一步先将数组中的字段(sku_id粒度)统计出来 group_by user_id,sku_id

第二步进行造结构(user_id) group_by user_id

array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>>

按照user_id分组,统计

 -----------------------------------------1-----------------------
 --先将需要的字段拿出来
 select --每个会员的购买的每个商品的
       user_id,
       sku_id,--商品名称
       sum(sku_num) sku_num,--此类商品购买的商品总数量
       count(*) order_count,--包含这个商品的订单总数
       cast(sum(final_amount_d) as decimal(20,2)) order_amount--商品的总价格
from
dwd_fact_order_detail-- 订单明细事实表,存储 订单里商品的明细
where dt = '2020-06-14'
group by user_id,sku_id
------------------------------------------2--------------------
 --转换结构
 select
        user_id,
 collect_set(named_struct('sku_id',sku_id,'sku_num',sku_num,'order_count',order_count,'order_amount',order_amount)) order_stats
        --name_struct是处理各个字段,变成struct,是以一行为粒度的
        --collect是将每行的处理结果返回成一个数组
        --想更好的理解这里可以看范例
 from
    (
        select
            user_id,
            sku_id,
            sum(sku_num) sku_num,
            count(*) order_count,--统计订单
            cast(sum(final_amount_d) as decimal(20,2)) order_amount
        from dwd_fact_order_detail
        where dt='2020-06-14'
        group by user_id,sku_id
    )tmp
    group by user_id
order by user_id;

其实走到这里这个named_struct有点懵的,为什么呢,数据有问题

比如

1号结果为19条

1605269449570

2号结果也为19条

1605269495672

每个都只有数组仅仅有一个元素,一个sku_id是一个元素

真真难以体会named_struct与collect_set()的区别.但是其实两者区别挺大

named_struct是横向处理,把横向的列变成一个KV的struct结构

collect_set把一组的数据纵向处理,把struct结构进行捏成一个set数组

测试一下吧

--感悟测试
--使用1号创建数组
create table tmp_test as
select --每个会员的购买的每个商品的
       user_id,
       sku_id,
       sum(sku_num) sku_num,--商品总数量
       count(*) order_count,--订单总数
       cast(sum(final_amount_d) as decimal(20,2)) order_amount--商品的总价格
from
dwd_fact_order_detail
where dt = '2020-06-14'
group by user_id,sku_id
order by user_id desc;

先把内部的拿出来,测试一下,

--插入一条跟原来重复sku的数据
insert into table tmp_test values(996,2,1,1,2232.00);
--查看一下数据
select * from  tmp_test;
--然后执行一下2
select
        user_id,
        collect_set() order_stats
    from
        tmp_test--1
    group by user_id
order by user_id;

查看结果

1605269895480

这样的话,我们array中就变成两个元素了,也就是说这个user_id买了两种sku_id,内部数组根据sku_id变成一个struct

--最后删除掉自己的测试表
drop table tmp_test;

五剑合璧,整合一下

with tmp_login as
         (
             select user_id,
                    count(*) login_count--启动活跃次数
             from dwd_start_log
             where dt = '2020-06-14'
         ),
    tmp_cart as
        (
            select user_id,
                   count(action_id) cart_count
            from dwd_action_log
            where dt = '2020-06-14'
              and user_id is not null
              and action_id = 'cart_add'
        ),
     tmp_order as
         (
             select user_id,
                    count(*)                order_count,
                    sum(final_total_amount) order_amount
             from dwd_fact_order_info
             where dt = '2020-06-14'
             group by user_id
         ),
     tmp_payment as
         (
             select user_id,
                    count(*)            payment_count,--支付次数
                    sum(payment_amount) payment_amount
             from dwd_fact_payment_info
             where dt = '2020-06-14'
             group by user_id
         ),
     tmp_order_detail as
         (
             select user_id,
                    collect_set(named_struct('sku_id', sku_id, 'sku_num', sku_num, 'order_count', order_count,
                                             'order_amount', order_amount)) order_stats
                    --name_struct是处理各个字段,变成struct,是以一行为粒度的
                    --collect是将每行的处理结果返回成一个数组
                    --想更好的理解这里可以看范例
             from (
                      select user_id,
                             sku_id,
                             sum(sku_num)                                sku_num,
                             count(*)                                    order_count,
                             cast(sum(final_amount_d) as decimal(20, 2)) order_amount
                      from dwd_fact_order_detail
                      where dt = '2020-06-14'
                      group by user_id, sku_id
                  ) tmp
             group by user_id
         )
insert overwrite table dws_user_action_daycount partition(dt='2020-06-14')
select
    tmp_login.user_id,
    login_count,
    nvl(cart_count,0),
    nvl(order_count,0),
    nvl(order_amount,0.0),
    nvl(payment_count,0),
    nvl(payment_amount,0.0),
    order_stats
from tmp_login--登录活跃
left join tmp_cart on tmp_login.user_id=tmp_cart.user_id--加购
left join tmp_order on tmp_login.user_id=tmp_order.user_id--下单
left join tmp_payment on tmp_login.user_id=tmp_payment.user_id--支付
left join tmp_order_detail on tmp_login.user_id=tmp_order_detail.user_id;--细节

4. DWT-会员主题宽表

主题表存累积量,不多比比,直接建表

存储会员相关的累积量

dwt_user_topic

4.1 建表语句

create external table dwt_user_topic
(
    user_id string  comment '用户id',
    login_date_first string  comment '首次登录时间',
    login_date_last string  comment '末次登录时间',
    login_count bigint comment '累积登录天数',
    login_last_30d_count bigint comment '最近30日登录天数',
    order_date_first string  comment '首次下单时间',
    order_date_last string  comment '末次下单时间',
    order_count bigint comment '累积下单次数',
    order_amount decimal(16,2) comment '累积下单金额',
    order_last_30d_count bigint comment '最近30日下单次数',
    order_last_30d_amount bigint comment '最近30日下单金额',
    payment_date_first string  comment '首次支付时间',
    payment_date_last string  comment '末次支付时间',
    payment_count decimal(16,2) comment '累积支付次数',
    payment_amount decimal(16,2) comment '累积支付金额',
    payment_last_30d_count decimal(16,2) comment '最近30日支付次数',
    payment_last_30d_amount decimal(16,2) comment '最近30日支付金额'
)COMMENT '会员主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_user_topic/'
tblproperties ("parquet.compression"="lzo");

4.2 字段分析

首先,想都要想,来源肯定是同等DWS

dws_user_action_daycount

说白了要想统计累积的不就靠的是每日的么

画画线

1605271068813

大体有这么意思就行,原理还是新旧表对比,跟设备主题宽表一样,关键在于30天,所以拉取新表的时候既要拉取一下30天的信息,还要拉取新增当天的信息进行对比,这个新30天要顶替原来的旧30天

注意这种新旧表体系

拉取新表因为要参与对比,需要直接拉取成旧表的样子

所以废话不说,直接照着旧表拉取新表把

4.3 拉取sql书写

书写之前也要明白

需要拉30天数据

因为从每日会员行为变成累积

所有要group by user_id

new表的拉取

注意分析粒度 : 每个会员每天的累积量

dws_user_action_daycount

select
       user_id,
       --xxx x1,--首次登录时间,这个对比的时候取老表就可以了
       '2020-06-14' login_date_last,--末次登录时间取今天
       sum(`if`(dt='2020-06-14',login_count,0)) login_count,--参与累加,只取一天,旧表+1累积登录天数
       sum(`if`(login_count>0,1,0)) login_last_30dcount,--最近30日登录天数
       --xxx x2,--首次下单时间,对比获得老表数据
       '2020-06-14' order_date_last,--末次下单时间,获取当前日期
       sum(`if`(dt='2020-06-14',order_count,0)) order_count,--累积下单次数,当前取1,加上老表
       sum(`if`(dt='2020-06-14',order_amount,0)) login_last_30dcount,--累积下单金额,取当前下单金额,加上老表的累积下单金额
       sum(`if`(order_count>0,order_count,0)) order_last_30d_count,--30日下单次数,发生订单取1
       sum(`if`(order_amount>0,order_amount,0)) order_last_30d_amount,--30日下单金额
       --xxx x3,--首次支付时间
       '2020-06-14' payment_date_last,--末次支付时间
       sum(`if`(dt='2020-06-14',payment_count,0)) payment_count,--参与累加,取1
       sum(`if`(dt='2020-06-14',payment_amount,0)) payment_amount,--参与累加,取1
       sum(`if`(payment_count>0,order_count,0)) payment_last_30d_count,--30日支付次数
       sum(`if`(payment_amount>0,order_amount,0)) payment_last_30d_amount--30日支付金额
from dws_user_action_daycount
where  dt > date_add('2020-06-14',-30)--取30天数据
group by user_id;

旧表拉取

--一个不分区表的寂寞
select * from dwt_user_topic;

倚天屠龙

select nvl(new.user_id, old.user_id)                                                                  user_id,             --很简单,如果新增会员,取新增会员
       `if`(old.login_date_first is null and new.login_count > 0, '2020-06-14', old.login_date_first) login_date_first,--没有首次登陆时间-新会员,有活动,取当前日期
       `if`(new.login_count > 0, '2020-06-14', old.login_date_last)                                   login_date_last,--末次登录天数
       nvl(old.login_count, 0) + if(new.login_count > 1, 1, 0)                                        login_count,--累积天数
       nvl(new.order_last_30d_count, 0)                                                               order_last_30d_count,--最近30日登录天数
       `if`(old.order_date_first is null and new.order_count > 0, '2020-06-14', old.order_date_first) order_date_first,--首次下单时间
       if(new.order_count > 0, '2020-06-14', old.order_date_last),--末次下单时间
       nvl(old.order_count, 0) + nvl(new.order_count, 0),--累积下单次数
       nvl(old.order_amount, 0) + nvl(new.order_amount, 0),--累积下单金额
       nvl(new.order_last_30d_count, 0),--最近30日下单次数
       nvl(new.order_last_30d_amount, 0),--最近30日下单金额
       if(old.payment_date_first is null and new.payment_count > 0, '2020-06-14', old.payment_date_first),--首次支付时间
       if(new.payment_count > 0, '2020-06-14', old.payment_date_last),--末次支付时间
       nvl(old.payment_count, 0) + nvl(new.payment_count, 0),--累积支付次数
       nvl(old.payment_amount, 0) + nvl(new.payment_amount, 0),--累积支付金额
       nvl(new.payment_last_30d_count, 0),--最近30日支付次数
       nvl(new.payment_last_30d_amount, 0)--最近30日支付金额
from dwt_user_topic old
         full outer join
     (select user_id,
             --xxx x1,--首次登录时间
             '2020-06-14'                                    login_date_last,
             sum(`if`(dt = '2020-06-14', login_count, 0))    login_count,--参与累加,只取一天,旧表+1累积登录天数
             sum(`if`(login_count > 0, 1, 0))                login_last_30dcount,--最近30日登录天数
             --xxx x2,--首次下单时间
             '2020-06-14'                                    order_date_last,--末次下单时间
             sum(`if`(dt = '2020-06-14', order_count, 0))    order_count,--参与累加,取1
             sum(`if`(dt = '2020-06-14', order_amount, 0))   order_amount,--参与累加,取1
             sum(`if`(order_count > 0, order_count, 0))      order_last_30d_count,--30日下单次数
             sum(`if`(order_amount > 0, order_amount, 0))    order_last_30d_amount,--30日下单金额
             --xxx x3,--首次支付时间
             '2020-06-14'                                    payment_date_last,--末次支付时间
             sum(`if`(dt = '2020-06-14', payment_count, 0))  payment_count,--参与累加,取1
             sum(`if`(dt = '2020-06-14', payment_amount, 0)) payment_amount,--参与累加,取1
             sum(`if`(payment_count > 0, order_count, 0))    payment_last_30d_count,--30日支付次数
             sum(`if`(payment_amount > 0, order_amount, 0))  payment_last_30d_amount--30日支付金额
      from dws_user_action_daycount
      where dt > date_add('2020-06-14', -30)
      group by user_id
     ) new
     on new.user_id = old.user_id;

5. DWS-每日商品行为

dws_sku_action_daycount

5.1 每日商品行为存什么?

其实我直接看这个表,我是懵的,每日商品还有行为?

其实是从商品角度去看待用户对它的操作,比如

  1. 被下单次数,金额
  2. 被支付金额,次数
  3. 被退款件数,金额
  4. …..

还有其他的被加购收藏等,都可以每日商品行为,就是从以sku_id为粒度,从每一个商品的角度去分析

5.2 建表语句

create external table dws_sku_action_daycount 
(   
    sku_id string comment 'sku_id',
    order_count bigint comment '被下单次数',
    order_num bigint comment '被下单件数',
    order_amount decimal(16,2) comment '被下单金额',
    payment_count bigint  comment '被支付次数',
    payment_num bigint comment '被支付件数',
    payment_amount decimal(16,2) comment '被支付金额',
    refund_count bigint  comment '被退款次数',
    refund_num bigint comment '被退款件数',
    refund_amount  decimal(16,2) comment '被退款金额',
    cart_count bigint comment '被加入购物车次数',
    favor_count bigint comment '被收藏次数',
    appraise_good_count bigint comment '好评数',
    appraise_mid_count bigint comment '中评数',
    appraise_bad_count bigint comment '差评数',
    appraise_default_count bigint comment '默认评价数'
) COMMENT '每日商品行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_sku_action_daycount/'
tblproperties ("parquet.compression"="lzo");

5.3 字段分析

1605772430257

这个每日商品行为直接从五个表中拉取

5.4 拉取sql

先拉取每日商品的下单字段段

被下单次数/被下单件数/被下单金额

from dwd_fact_order_detail

1605774126324

select
       sku_id,
       count(*) order_count,
       sum(sku_num) order_num, --统计所有订单中此商品的数量
       '2020-06-25' dt
from dwd_fact_order_detail --订单明细表
where dt='2020-06-25'--限定时间
group by sku_id --按照sku_id进行将所有的订单明细(单个sku_id在订单中的分摊)分组

1605774148325

--找出被支付次数,被支付件数,与被支付金额,需要找出订单明细事实表中被支付的订单
--被支付的订单
select
       order_id
from dwd_fact_payment_info
where dt = '2020-06-25';
--然后过滤出订单明细中被支付的订单
select
       sku_id,
       count(*) payment_count,
       sum(sku_num) payment_num,--统计所有订单中此商品的数量
       sum(final_amount_d) payment_amount,
       '2020-06-25' dt
from dwd_fact_order_detail --订单明细表
where dt='2020-06-25' and order_id in
                          (select order_id
                           from dwd_fact_payment_info
                           where dt = '2020-06-25'
                          )--限定时间
group by sku_id --按照sku_id进行将所有的订单明细(单个sku_id在订单中的分摊)分组

1605774789074

--退款相关的直接从退款事实表中拉取就可以了
select
       sku_id,
       count(*) refund_count,--被退款次数
       sum(refund_num) refund_num,--件数
       sum(refund_amount) refund_amount,--金额
       '2020-06-25' dt
from dwd_fact_order_refund_info
where dt='2020-06-25'
group by sku_id

1605775564212

--被加入购物车的次数,这个比较简单,直接从加购事实表拉取
select
       sku_id,
       count(*) cart_count,
       '2020-06-25' dt
from dwd_fact_cart_info
where dt='2020-06-25'
group by sku_id;

被收藏次数

1605775592012

select
       sku_id,
       count(*) cart_count,
       '2020-06-15' dt
from gmall.dwd_fact_favor_info
where dt='2020-06-25'
group by sku_id

好评字段

1201,好评
1202,中评
1203,差评
1204,自动

1605776263144

select
       sku_id,
       sum(`if`(appraise=1201,1,0)) appraise_good_count,
       sum(`if`(appraise=1202,1,0)) appraise_mid_count,
       sum(`if`(appraise=1203,1,0)) appraise_bad_count,
       sum(`if`(appraise=1204,1,0)) appraise_default_count,
       '2020-06-25' dt
from dwd_fact_comment_info
where dt='2020-06-25'
group by sku_id;

合并方案

这是比较好的方法,适合数字

with
     tmp_order as (
         select sku_id,
                count(*)            order_count,
                sum(sku_num)        order_num,--统计所有订单中此商品的数量
                sum(final_amount_d) order_amount,
                '2020-06-25'        dt
         from dwd_fact_order_detail --订单明细表
         where dt = '2020-06-25'--限定时间
         group by sku_id
--按照sku_id进行将所有的订单明细(单个sku_id在订单中的分摊)分组
--找出被支付次数,被支付件数,与被支付金额,需要找出订单明细事实表中被支付的订单
--然后过滤出订单明细中被支付的订单
     ),
     tmp_pay as (
         select sku_id,
                count(*)            payment_count,
                sum(sku_num)        payment_num,--统计所有订单中此商品的数量
                sum(final_amount_d) payment_amount,
                '2020-06-25'        dt
         from dwd_fact_order_detail --订单明细表
         where (dt='2020-06-25' or dt=date_add('2020-06-25',-1))--拿取两天的,下单与支付时间可能分开,今天支付可能是昨日下的单
           and order_id in
               (select order_id
                from dwd_fact_payment_info
                where dt = '2020-06-25'
               )--限定时间
         group by sku_id
--按照sku_id进行将所有的订单明细(单个sku_id在订单中的分摊)分组
     ),
     tmp_refund as (
--退款相关的直接从退款事实表中拉取就可以了
         select sku_id,
                count(*)           refund_count,--被退款次数
                sum(refund_num)    refund_num,--件数
                sum(refund_amount) refund_amount,--金额
                '2020-06-25'       dt
         from dwd_fact_order_refund_info
         where dt = '2020-06-25'
         group by sku_id
     ),
     tmp_cart as (
--被加入购物车的次数,这个比较简单,直接从加购事实表拉取
         select sku_id,
                count(*)     cart_count,
                '2020-06-25' dt
         from dwd_fact_cart_info
         where dt = '2020-06-25'
         group by sku_id
     ),
     tmp_favor as (
         select sku_id,
                count(*)     favor_count,
                '2020-06-15' dt
         from gmall.dwd_fact_favor_info
         where dt = '2020-06-25'
         group by sku_id
     ),
     tmp_appraise as (
         select sku_id,
                sum(`if`(appraise = 1201, 1, 0)) appraise_good_count,
                sum(`if`(appraise = 1202, 1, 0)) appraise_mid_count,
                sum(`if`(appraise = 1203, 1, 0)) appraise_bad_count,
                sum(`if`(appraise = 1204, 1, 0)) appraise_default_count,
                '2020-06-25'                     dt
         from dwd_fact_comment_info
         where dt = '2020-06-25'
         group by sku_id
     )
select
       sku_id,
       sum(order_count),
       sum(order_num),
       sum(order_amount),
       sum(payment_count),
       sum(payment_num),
       sum(payment_amount),
       sum(refund_count),
       sum(refund_num),
       sum(refund_amount),
       sum(cart_count),
       sum(favor_count),
       sum(appraise_good_count),
       sum(appraise_mid_count),
       sum(appraise_bad_count),
       sum(appraise_default_count)

from (
         select sku_id,
                order_count,
                order_num,
                order_amount,
                0 payment_count,
                0 payment_num,
                0 payment_amount,
                0 refund_count,
                0 refund_num,
                0 refund_amount,
                0 cart_count,
                0 favor_count,
                0 appraise_good_count,
                0 appraise_mid_count,
                0 appraise_bad_count,
                0 appraise_default_count
         from tmp_order
         union all
         select sku_id,
                0 order_count,
                0 order_num,
                0 order_amount,
                payment_count,
                payment_num,
                payment_amount,
                0 refund_count,
                0 refund_num,
                0 refund_amount,
                0 cart_count,
                0 favor_count,
                0 appraise_good_count,
                0 appraise_mid_count,
                0 appraise_bad_count,
                0 appraise_default_count
         from tmp_pay
         union all
         select sku_id,
                0 order_count,
                0 order_num,
                0 order_amount,
                0 payment_count,
                0 payment_num,
                0 payment_amount,
                refund_count,
                refund_num,
                refund_amount,
                0 cart_count,
                0 favor_count,
                0 appraise_good_count,
                0 appraise_mid_count,
                0 appraise_bad_count,
                0 appraise_default_count
         from tmp_refund
         union all
         select sku_id,
                0 order_count,
                0 order_num,
                0 order_amount,
                0 payment_count,
                0 payment_num,
                0 payment_amount,
                0 refund_count,
                0 refund_num,
                0 refund_amount,
                cart_count,
                0 favor_count,
                0 appraise_good_count,
                0 appraise_mid_count,
                0 appraise_bad_count,
                0 appraise_default_count
         from tmp_cart
         union all
         select sku_id,
                0 order_count,
                0 order_num,
                0 order_amount,
                0 payment_count,
                0 payment_num,
                0 payment_amount,
                0 refund_count,
                0 refund_num,
                0 refund_amount,
                0 cart_count,
                favor_count,
                0 appraise_good_count,
                0 appraise_mid_count,
                0 appraise_bad_count,
                0 appraise_default_count
         from tmp_favor
         union all
         select sku_id,
                0 order_count,
                0 order_num,
                0 order_amount,
                0 payment_count,
                0 payment_num,
                0 payment_amount,
                0 refund_count,
                0 refund_num,
                0 refund_amount,
                0 cart_count,
                0 favor_count,
                appraise_good_count,
                appraise_mid_count,
                appraise_bad_count,
                appraise_default_count
         from tmp_appraise
     ) t
group by sku_id;

6. DWT-商品主题宽表

每日商品行为存储的是每一个sku_id每天的行为

那么商品主题宽表存储的就是每一个sku_id累积的行为

6.1 建表语句

create external table dwt_sku_topic
(
    sku_id string comment 'sku_id',
    spu_id string comment 'spu_id',
    order_last_30d_count bigint comment '最近30日被下单次数',
    order_last_30d_num bigint comment '最近30日被下单件数',
    order_last_30d_amount decimal(16,2)  comment '最近30日被下单金额',
    order_count bigint comment '累积被下单次数',
    order_num bigint comment '累积被下单件数',
    order_amount decimal(16,2) comment '累积被下单金额',
    payment_last_30d_count   bigint  comment '最近30日被支付次数',
    payment_last_30d_num bigint comment '最近30日被支付件数',
    payment_last_30d_amount  decimal(16,2) comment '最近30日被支付金额',
    payment_count   bigint  comment '累积被支付次数',
    payment_num bigint comment '累积被支付件数',
    payment_amount  decimal(16,2) comment '累积被支付金额',
    refund_last_30d_count bigint comment '最近三十日退款次数',
    refund_last_30d_num bigint comment '最近三十日退款件数',
    refund_last_30d_amount decimal(16,2) comment '最近三十日退款金额',
    refund_count bigint comment '累积退款次数',
    refund_num bigint comment '累积退款件数',
    refund_amount decimal(16,2) comment '累积退款金额',
    cart_last_30d_count bigint comment '最近30日被加入购物车次数',
    cart_count bigint comment '累积被加入购物车次数',
    favor_last_30d_count bigint comment '最近30日被收藏次数',
    favor_count bigint comment '累积被收藏次数',
    appraise_last_30d_good_count bigint comment '最近30日好评数',
    appraise_last_30d_mid_count bigint comment '最近30日中评数',
    appraise_last_30d_bad_count bigint comment '最近30日差评数',
    appraise_last_30d_default_count bigint comment '最近30日默认评价数',
    appraise_good_count bigint comment '累积好评数',
    appraise_mid_count bigint comment '累积中评数',
    appraise_bad_count bigint comment '累积差评数',
    appraise_default_count bigint comment '累积默认评价数'
 )COMMENT '商品主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_sku_topic/'
tblproperties ("parquet.compression"="lzo");

DWT商品主题宽表不会分区


文章作者: Jinxin Li
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Jinxin Li !
  目录