cube.js 特性分析
本文尝试记录我在试用 cube.js 过程中,对其的理解和分析。
Data Modeling
我最感兴趣的是 cube.js 的模型建模能力。
数据集:NorthWind 为了测试多事实支持能力,对该数据集做了简单扩展:
create table northwind.Orders
(
OrderID int auto_increment
primary key,
CustomerID int null,
EmployeeID int null,
OrderDate datetime null,
ShipperID int null,
freight decimal(10, 2) null, -- 新增运费字段
);
create table northwind.Inventories
(
InventoryId int not null
primary key,
ProductId int not null,
quantity int not null comment '库存数量'
);
-
Query: ( products.productname -> order_detail.count, orders.count, order_detail.quantity)
SELECT `products`.`ProductName` `products__productname`, count(`order_details`.OrderDetailId) `order_details__count`, count(distinct `orders`.orderId) `orders__count`, sum(`order_details`.`Quantity`) `order_details__quantity` FROM northwind.`OrderDetails` AS `order_details` LEFT JOIN northwind.`Orders` AS `orders` ON `order_details`.`OrderID` = `orders`.`OrderID` LEFT JOIN northwind.`Products` AS `products` ON `order_details`.`ProductID` = `products`.`ProductID` GROUP BY 1 ORDER BY 2 DESC LIMIT 10000
- 查询限制在单个 star-model 中
- 对 star-model 中的非明细粒度的 count 会改写为 count(distinct) 符合预期。
-
Query: (product.productname -> order_details.quantity, orders.freight)
SELECT q_0.`products__productname`, `order_details__quantity` `order_details__quantity`, `orders__freight` `orders__freight` FROM ( SELECT `main__products`.`ProductName` `products__productname`, sum(`main__order_details`.`Quantity`) `order_details__quantity` FROM northwind.`OrderDetails` AS `main__order_details` LEFT JOIN northwind.`Orders` AS `main__orders` ON `main__order_details`.`OrderID` = `main__orders`.`OrderID` LEFT JOIN northwind.`Products` AS `main__products` ON `main__order_details`.`ProductID` = `main__products`.`ProductID` GROUP BY 1 ) as q_0 INNER JOIN ( SELECT `keys`.`products__productname`, sum(`orders_key__orders`.freight) `orders__freight` FROM ( SELECT DISTINCT `orders_key__products`.`ProductName` `products__productname`, `orders_key__orders`.orderId `orders__order_id` FROM northwind.`OrderDetails` AS `orders_key__order_details` LEFT JOIN northwind.`Orders` AS `orders_key__orders` ON `orders_key__order_details`.`OrderID` = `orders_key__orders`.`OrderID` LEFT JOIN northwind.`Products` AS `orders_key__products` ON `orders_key__order_details`.`ProductID` = `orders_key__products`.`ProductID` ) AS `keys` LEFT JOIN northwind.`Orders` AS `orders_key__orders` ON `keys`.`orders__order_id` = `orders_key__orders`.orderId GROUP BY 1 ) as q_1 ON (q_0.`products__productname` = q_1.`products__productname` OR (q_0.`products__productname` IS NULL AND q_1.`products__productname` IS NULL)) ORDER BY 2 DESC LIMIT 10000
结论:
- cube.js 支持在同一个 star-model 中在不同层级上的度量(类似于多事实的概念),但生成的 SQL 不够优化,性能可能不佳。
-
Query: ( product.productname -> inventories.quantity, order_details.quantity)
查询错误:
Can't find join path to join 'inventories', 'order_details', 'products'
结论:- cube.js 查询中,限定只能使用单个查询图(限定一个 star-model ),并不支持多事实的概念。
- 根据 query2:cube.js 支持在一个 star-model 中访问不同粒度的度量,能正确处理查询,但未能生成优化的 SQL。
-
YTD Query
cube.js 支持的复杂计算相对较为有限,主要是 https://cube.dev/docs/product/data-modeling/concepts/multi-stage-calculations#period-to-date 文中介绍的:
- Rolling Window, 包括 YTD, QTD, MTD 都是通过 rolling window 来实现的
- Ranking 这里以一个简单的 YTD 为例来查看其执行过程:
SELECT q_0.`orders__orderdate_day`,
`order_details__quantity` `order_details__quantity`,
`order_details__quantity_ytd` `order_details__quantity_ytd`
FROM (
SELECT CAST(DATE_FORMAT(CONVERT_TZ(`main__orders`.`OrderDate`, @@session.time_zone, '+00:00'),
'%Y-%m-%dT00:00:00.000') AS DATETIME) `orders__orderdate_day`,
sum(`main__order_details`.`Quantity`) `order_details__quantity`
FROM northwind.`OrderDetails` AS `main__order_details`
LEFT JOIN northwind.`Orders` AS `main__orders`
ON `main__order_details`.`OrderID` = `main__orders`.`OrderID`
WHERE (`main__orders`.`OrderDate` >=
TIMESTAMP(convert_tz('1996-07-01T00:00:00.000', '+00:00', @@session.time_zone)) AND
`main__orders`.`OrderDate` <=
TIMESTAMP(convert_tz('1996-12-31T23:59:59.999', '+00:00', @@session.time_zone)))
GROUP BY 1
) as q_0
INNER JOIN (
SELECT `orders.orderdate_series`.`date_from` `orders__orderdate_day`,
sum(`order_details__quantity_ytd`) `order_details__quantity_ytd`
FROM (
SELECT TIMESTAMP(dates.f) date_from, TIMESTAMP(dates.t) date_to
FROM (
select '1996-07-01T00:00:00.000' f, '1996-07-01T23:59:59.999' t
UNION ALL
... -- 这里省略掉几百行类似的 SQL 代码
UNION ALL
select '1996-12-31T00:00:00.000' f, '1996-12-31T23:59:59.999' t
) AS dates
) AS `orders.orderdate_series`
LEFT JOIN (
SELECT CAST(DATE_FORMAT(CONVERT_TZ(`order_details_quantity_ytd_cumulative__orders`.`OrderDate`,@@session.time_zone, '+00:00'), '%Y-%m-%dT00:00:00.000')
AS DATETIME) `orders__orderdate_day`,
sum(`order_details_quantity_ytd_cumulative__order_details`.Quantity) `order_details__quantity_ytd`
FROM northwind.`OrderDetails` AS `order_details_quantity_ytd_cumulative__order_details`
LEFT JOIN northwind.`Orders` AS `order_details_quantity_ytd_cumulative__orders`
ON `order_details_quantity_ytd_cumulative__order_details`.`OrderID` =
`order_details_quantity_ytd_cumulative__orders`.`OrderID`
WHERE (CONVERT_TZ(`order_details_quantity_ytd_cumulative__orders`.`OrderDate`,@@session.time_zone, '+00:00') >=
CAST(DATE_FORMAT(TIMESTAMP('1996-07-01T00:00:00.000'), '%Y-01-01T00:00:00.000') AS DATETIME) AND
CONVERT_TZ(`order_details_quantity_ytd_cumulative__orders`.`OrderDate`,@@session.time_zone, '+00:00') <= TIMESTAMP('1996-12-31T23:59:59.999'))
GROUP BY 1
) AS `order_details_quantity_ytd_cumulative__base`
ON (`order_details_quantity_ytd_cumulative__base`.`orders__orderdate_day` >=
CAST(DATE_FORMAT(`orders.orderdate_series`.`date_from`, '%Y-01-01T00:00:00.000') AS DATETIME) AND
`order_details_quantity_ytd_cumulative__base`.`orders__orderdate_day` <= `orders.orderdate_series`.`date_to`)
GROUP BY 1
) as q_1 ON (q_0.`orders__orderdate_day` = q_1.`orders__orderdate_day` OR
(q_0.`orders__orderdate_day` IS NULL AND
q_1.`orders__orderdate_day` IS NULL))
ORDER BY 1 ASC
LIMIT 10000
分析结论:
- cube.js 对计算度量的额表达能力相比 MDX 来说非常有限,其更接近于 Smartbi 支持的快速计算,都是内置支持的,不太具备了良好的扩展能力。
- 在 YTD 计算时,目前的实现方式是通过 conditional join 来实现的,其效果与我们的基于窗口函数的实现相似,但执行效率估计会低一些
- 目前进行 YTD 计算时,对查询条件有一定的限制,必须指定 开始日期、结束日期。
- CubeJS 有一个 试验性的 Tesseract 引擎(估计与 powerbi 的 VertiPaq 类似)。
总体来看:CubeJS 的建模能力相比 PowerBI, SmartBi 目前的建模能力,要弱很多。