cube.js 特性分析

本文尝试记录我在试用 cube.js 过程中,对其的理解和分析。

Data Modeling

我最感兴趣的是 cube.js 的模型建模能力。

数据集:NorthWind 为了测试多事实支持能力,对该数据集做了简单扩展:

create table northwind.Orders ( OrderID int auto_increment primary key, CustomerID int null, EmployeeID int null, OrderDate datetime null, ShipperID int null, freight decimal(10, 2) null, -- 新增运费字段 ); create table northwind.Inventories ( InventoryId int not null primary key, ProductId int not null, quantity int not null comment '库存数量' );
  1. Query: ( products.productname -> order_detail.count, orders.count, order_detail.quantity)

    SELECT `products`.`ProductName` `products__productname`, count(`order_details`.OrderDetailId) `order_details__count`, count(distinct `orders`.orderId) `orders__count`, sum(`order_details`.`Quantity`) `order_details__quantity` FROM northwind.`OrderDetails` AS `order_details` LEFT JOIN northwind.`Orders` AS `orders` ON `order_details`.`OrderID` = `orders`.`OrderID` LEFT JOIN northwind.`Products` AS `products` ON `order_details`.`ProductID` = `products`.`ProductID` GROUP BY 1 ORDER BY 2 DESC LIMIT 10000
    • 查询限制在单个 star-model 中
    • 对 star-model 中的非明细粒度的 count 会改写为 count(distinct) 符合预期。
  2. Query: (product.productname -> order_details.quantity, orders.freight)

    SELECT q_0.`products__productname`, `order_details__quantity` `order_details__quantity`, `orders__freight` `orders__freight` FROM ( SELECT `main__products`.`ProductName` `products__productname`, sum(`main__order_details`.`Quantity`) `order_details__quantity` FROM northwind.`OrderDetails` AS `main__order_details` LEFT JOIN northwind.`Orders` AS `main__orders` ON `main__order_details`.`OrderID` = `main__orders`.`OrderID` LEFT JOIN northwind.`Products` AS `main__products` ON `main__order_details`.`ProductID` = `main__products`.`ProductID` GROUP BY 1 ) as q_0 INNER JOIN ( SELECT `keys`.`products__productname`, sum(`orders_key__orders`.freight) `orders__freight` FROM ( SELECT DISTINCT `orders_key__products`.`ProductName` `products__productname`, `orders_key__orders`.orderId `orders__order_id` FROM northwind.`OrderDetails` AS `orders_key__order_details` LEFT JOIN northwind.`Orders` AS `orders_key__orders` ON `orders_key__order_details`.`OrderID` = `orders_key__orders`.`OrderID` LEFT JOIN northwind.`Products` AS `orders_key__products` ON `orders_key__order_details`.`ProductID` = `orders_key__products`.`ProductID` ) AS `keys` LEFT JOIN northwind.`Orders` AS `orders_key__orders` ON `keys`.`orders__order_id` = `orders_key__orders`.orderId GROUP BY 1 ) as q_1 ON (q_0.`products__productname` = q_1.`products__productname` OR (q_0.`products__productname` IS NULL AND q_1.`products__productname` IS NULL)) ORDER BY 2 DESC LIMIT 10000

    结论:

    1. cube.js 支持在同一个 star-model 中在不同层级上的度量(类似于多事实的概念),但生成的 SQL 不够优化,性能可能不佳。
  3. Query: ( product.productname -> inventories.quantity, order_details.quantity)

    查询错误:Can't find join path to join 'inventories', 'order_details', 'products' 结论:

    1. cube.js 查询中,限定只能使用单个查询图(限定一个 star-model ),并不支持多事实的概念。
    2. 根据 query2:cube.js 支持在一个 star-model 中访问不同粒度的度量,能正确处理查询,但未能生成优化的 SQL。
  4. YTD Query

    cube.js 支持的复杂计算相对较为有限,主要是 https://cube.dev/docs/product/data-modeling/concepts/multi-stage-calculations#period-to-date 文中介绍的:

    • Rolling Window, 包括 YTD, QTD, MTD 都是通过 rolling window 来实现的
    • Ranking 这里以一个简单的 YTD 为例来查看其执行过程:
SELECT q_0.`orders__orderdate_day`, `order_details__quantity` `order_details__quantity`, `order_details__quantity_ytd` `order_details__quantity_ytd` FROM ( SELECT CAST(DATE_FORMAT(CONVERT_TZ(`main__orders`.`OrderDate`, @@session.time_zone, '+00:00'), '%Y-%m-%dT00:00:00.000') AS DATETIME) `orders__orderdate_day`, sum(`main__order_details`.`Quantity`) `order_details__quantity` FROM northwind.`OrderDetails` AS `main__order_details` LEFT JOIN northwind.`Orders` AS `main__orders` ON `main__order_details`.`OrderID` = `main__orders`.`OrderID` WHERE (`main__orders`.`OrderDate` >= TIMESTAMP(convert_tz('1996-07-01T00:00:00.000', '+00:00', @@session.time_zone)) AND `main__orders`.`OrderDate` <= TIMESTAMP(convert_tz('1996-12-31T23:59:59.999', '+00:00', @@session.time_zone))) GROUP BY 1 ) as q_0 INNER JOIN ( SELECT `orders.orderdate_series`.`date_from` `orders__orderdate_day`, sum(`order_details__quantity_ytd`) `order_details__quantity_ytd` FROM ( SELECT TIMESTAMP(dates.f) date_from, TIMESTAMP(dates.t) date_to FROM ( select '1996-07-01T00:00:00.000' f, '1996-07-01T23:59:59.999' t UNION ALL ... -- 这里省略掉几百行类似的 SQL 代码 UNION ALL select '1996-12-31T00:00:00.000' f, '1996-12-31T23:59:59.999' t ) AS dates ) AS `orders.orderdate_series` LEFT JOIN ( SELECT CAST(DATE_FORMAT(CONVERT_TZ(`order_details_quantity_ytd_cumulative__orders`.`OrderDate`,@@session.time_zone, '+00:00'), '%Y-%m-%dT00:00:00.000') AS DATETIME) `orders__orderdate_day`, sum(`order_details_quantity_ytd_cumulative__order_details`.Quantity) `order_details__quantity_ytd` FROM northwind.`OrderDetails` AS `order_details_quantity_ytd_cumulative__order_details` LEFT JOIN northwind.`Orders` AS `order_details_quantity_ytd_cumulative__orders` ON `order_details_quantity_ytd_cumulative__order_details`.`OrderID` = `order_details_quantity_ytd_cumulative__orders`.`OrderID` WHERE (CONVERT_TZ(`order_details_quantity_ytd_cumulative__orders`.`OrderDate`,@@session.time_zone, '+00:00') >= CAST(DATE_FORMAT(TIMESTAMP('1996-07-01T00:00:00.000'), '%Y-01-01T00:00:00.000') AS DATETIME) AND CONVERT_TZ(`order_details_quantity_ytd_cumulative__orders`.`OrderDate`,@@session.time_zone, '+00:00') <= TIMESTAMP('1996-12-31T23:59:59.999')) GROUP BY 1 ) AS `order_details_quantity_ytd_cumulative__base` ON`order_details_quantity_ytd_cumulative__base`.`orders__orderdate_day` >= CAST(DATE_FORMAT(`orders.orderdate_series`.`date_from`, '%Y-01-01T00:00:00.000') AS DATETIME) AND `order_details_quantity_ytd_cumulative__base`.`orders__orderdate_day` <= `orders.orderdate_series`.`date_to`GROUP BY 1 ) as q_1 ON (q_0.`orders__orderdate_day` = q_1.`orders__orderdate_day` OR (q_0.`orders__orderdate_day` IS NULL AND q_1.`orders__orderdate_day` IS NULL)) ORDER BY 1 ASC LIMIT 10000

分析结论:

  1. cube.js 对计算度量的额表达能力相比 MDX 来说非常有限,其更接近于 Smartbi 支持的快速计算,都是内置支持的,不太具备了良好的扩展能力。
  2. 在 YTD 计算时,目前的实现方式是通过 conditional join 来实现的,其效果与我们的基于窗口函数的实现相似,但执行效率估计会低一些
  3. 目前进行 YTD 计算时,对查询条件有一定的限制,必须指定 开始日期、结束日期。
  4. CubeJS 有一个 试验性的 Tesseract 引擎(估计与 powerbi 的 VertiPaq 类似)。

总体来看:CubeJS 的建模能力相比 PowerBI, SmartBi 目前的建模能力,要弱很多。