跳到主要内容

DataOps博客

欢迎改变的地方

电子游戏厅变压器:
为缓慢变化的维度设计模式

By 张贴在 工程 2019年11月19日

在这篇博客, we will look at a few design patterns for 缓慢变化维度 (SCD) Type 2 和 see how 电子游戏厅变压器是最新加入的 电子游戏厅 DataOps平台,使它们易于实现.

而实体的位置和地址等相对静态的数据, 如客户, 随着时间的推移很少改变(如果有的话), 在大多数情况下,维护所有更改的历史是至关重要的. This refers to the concept of dimensions 和 缓慢变化维度 which are important components of DataOps by way of management 和 automation of such datasets.

“Dimensions in data management 和 data warehousing contain relatively static data about such entities as geographical locations, 客户, 或产品. Data captured by 缓慢变化维度 (SCDs) change slowly but unpredictably, 而不是按照一个固定的时间表”——维基百科.

SCD操作有六种类型(1型到6型) 电子游戏厅变压器 使您能够处理和实现两种常见类型—类型1和类型2.

1型化合物 — Doesn’t require history of dimension changes to be maintained 和 the old dimension value is simply overwritten with the new one. This type of operation is easy to implement (similar to a normal SQL update) 和 is often used for things like removing special characters, 更正记录字段值中的打字和拼写错误.

2型化合物 — Requires maintaining history of all changes made to each key in a dimensional table. 以下是手动处理2型SCD时遇到的一些挑战:

  • Every process that updates these tables has to honor the 2型化合物 pattern of expiring old records 和 replacing them with new ones
  • There might not be a built-in constraint to prevent overlapping start 和 end dates for a given dimension key
  • 将现有表转换为2型化合物时, it will most likely require you to update every single query that reads from or writes to that table
  • Every query against that table will need to account for the historical 2型化合物 pattern by filtering only for current data or for a specific point in time

As you can imagine, 2型化合物 operations can become complex 和 h和-written code, SQL queries, etc. 可能不具有伸缩性并且难以维护.

满足 缓慢变化维度 处理器. This 处理器 makes it easy to implement 2型化合物 operations by enabling data engineers to centralize all the “logic” (via configuration; not SQL queries or code!在一个地方.

电子游戏网址大全来看看一些常见的设计模式.

模式1:一次性迁移—基于文件(批处理模式)

Let’s first take a very simple yet concrete example of managing customer records (with updates to addresses) for existing 和 new 客户. 在这种情况下, the assumption is that the destination is empty so it’s more of a one-time migration scenario for ingesting “master” 和 “change” records from respective origins to a new file destination.

这个场景包括:

  • 为“主”源中的每一行创建一条记录
  • 为“change”origin中的每一行创建一条记录
    • 新客户:版本设置为1,其中客户id在“主”来源中不存在
    • Existing 客户: Version set to current value in “master” origin + 1 where customer id exists in “master” origin

样品管道

注:关于配置属性的详细信息, 点击这里.

主来源输入: 现有客户的主记录样本

customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,版本
1、理查德·埃尔南德斯、XXXXXXXXX XXXXXXXXX, 6303希瑟广场,布朗斯维尔,TX, 78521年,1
2,玛丽,巴雷特,XXXXXXXXX,XXXXXXXXX,9526 Noble ember Ridge,Littleton,CO,80126,1
3、安,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725,1

改变输入起源: 对现有和新客户进行样品变更记录

customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode
2、玛丽·巴雷特,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,帕克市,UT,80126
3、安,史密斯,XXXXXXXXX,XXXXXXXXX,1991 Margo Pl,旧金山,00725
11、马克,巴雷特,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,帕克城,UT,80126

最后的输出: 给定上面的两个数据集,结果输出将如下所示

customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,版本
1、理查德·埃尔南德斯、XXXXXXXXX XXXXXXXXX, 6303希瑟广场,布朗斯维尔,TX, 78521年,1
2,玛丽,巴雷特,XXXXXXXXX,XXXXXXXXX,9526 Noble ember Ridge,Littleton,CO,80126,1
2、玛丽·巴雷特,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,帕克城,UT,80126,2
3、安,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725,1
3、安,Smith,XXXXXXXXX,XXXXXXXXX,1991 Margo Pl,San francisco,00725,2
11、马克,巴雷特,XXXXXXXXX,XXXXXXXXX,4963 Ponderosa Ct,帕克城,UT,80126,1

Notice that the total number of output records is 6; 3 records from master origin for existing 客户 和 3 records from the change origin–where two records are for existing 客户 玛丽 with their updated address 和 版本 incremented to 2 和 one record for new customer Mark 版本设置为1.

模式2:增量更新——基于JDBC(流模式)

现在让电子游戏网址大全假设有一个启用JDBC连接的数据库(例如, 它有一个维度表“客户”,复合主键- customer_id,版本. 在这种情况下, the goal is still the same as pattern 1 和 2 where we’d like to capture 和 maintain history of updates for new 和 existing customer records.

样品管道

注:关于配置属性的详细信息, 点击这里.

此模式与模式1的主要区别如下:

  • Pattern 1 is designed to run in batch mode 和 terminate automatically after all the data has been processed; whereas pipeline in pattern 2 is configured to run in streaming mode–continuously till the pipeline is stopped manually–which means it will “listen” for customer updates being dropped in S3 bucket 和 process them as soon as they’re available without user intervention.
  • Pattern 1 can only h和le up to one additional update for any given customer record because of the fact that the master origin is not updated with new 版本 number for every corresponding change record — which effectively means every update record coming in via change origin will get assigned 版本 2.
  • 与模式1, the master gets updated with the latest 版本 in pattern 2 (via JDBC Producer destination) so every update record coming in via change origin will get a new 版本 assigned to it.

查询 客户 在MySQL中

SELECT * FROM customer 其中customer_id = 1

模式3:增量更新-数据ricks Delta Lake(流模式)

这与模式2非常相似. 主要区别是:

  • 单一的起源 
  • Delta Lake Lookup — For every update/change record coming in a lookup against the current Delta Lake will be performed based on dimension key customer_id. 如果有匹配,值 customer_id 版本 将返回并传递给SCD处理器. The SCD 处理器 will increment the 版本 number based on the lookup value 和 a new record with updated 版本 will be inserted into the Delta Lake table.

样品管道

注:关于配置属性的详细信息, 点击这里.

查询 客户 在Delta Lake DBFS

SELECT * FROM delta.`/ DeltaLake /客户 其中customer_id在(1)中

模式4:Upserts -数据ricks三角洲湖和时间旅行(流模式)

如果你使用三角洲湖, another option is to leverage Delta Lake’s built-in upserts using merge functionality. Here the underlying concept is the same as SCD which is to maintain 版本s of dimensions, 但是它的实现要简单得多.

样品管道

注:关于配置属性的详细信息, 点击这里.

在这个模式中, 对于通过(S3)源传入的每个记录, an insert or an update operation is performed in Delta Lake based on the conditions configured for new (“When Not Matched”) 和 existing 产品 (“When Matched”) respectively. 而且由于三角洲湖存储层支持ACID事务, it is able to create new (parquet) files for updates — while allowing to query for the most recent record with simple SQL without explicitly requiring tracking field (for example, “版本)出现在表和where子句中.

例如,考虑以下原始记录:

product_id、product_category_id product_name、product_description product_price
1 2,“任务Q64 10英尺. x 10英尺. 斜腿瞬间U","",59.98

而这个变化记录与更新的价格来自 59.98 to 69.99

product_id、product_category_id product_name、product_description product_price
1 2,“任务Q64 10英尺. x 10英尺. 斜腿瞬间U","",69.98

查询 产品 三角洲湖表

SELECT * FROM 产品 在product_id = 1

注意,表 产品 没有跟踪类型字段(例如,“版本”) while the query still retrieves the most “current” 版本 of the record with product price of $69.98.  

To query older 版本s of the data, Delta Lake provides a feature called “Time Travel”. 在电子游戏网址大全的例子中, 检索产品价格的前一个(0)版本, SQL查询看起来像:

从0的产品版本中选择* 在product_id = 1

注意产品价格 $59.98. 想要了解更多关于三角洲湖时间旅行的细节和选择, 点击这里.

结论

This blog post highlighted some common patterns of h和ling SCD Type 2 和 also illustrated how easy it is to implement those patterns using 缓慢变化维度 (SCD)处理器 电子游戏厅变压器.

如果你想了解更多 电子游戏厅变压器,这里有一些有用的资源供你开始: 产品概述 | 技术文档 | 概述视频 | 数据表 | 博客.

回到顶部

电子游戏网址大全使用cookie来改善您对电子游戏网址大全网站的体验. 单击“允许所有人同意”并继续访问电子游戏网址大全的网站. 隐私政策