CDC in BODS
See more in our ebook
Changed-Data Capture
This post introduces the concept of changed-data capture (CDC). You use CDC techniques to identify changes in a source table at a given point in time (such as since the previous data extraction). CDC captures changes such as inserting a row, updating a row, or deleting a row. CDC can involve variables, parameters, custom (user-defined) functions, and scripts.
Exercise overview
You will create two jobs in this exercise. The first job (Initial) initially loads all of the rows from a source table. You will then introduce a change to the source table. The second job (Delta) identifies only the rows that have been added or changed and loads them into the target table. You will create the target table from a template.
Both jobs contain the following objects.
• An initialization script that sets values for two global variables: $GV_STARTTIME and $GV_ENDTIME
• A data flow that loads only the rows with dates that fall between $GV_STARTTIME and
$GV_ENDTIME
• A termination script that updates a database table that stores the last $GV_ENDTIME
The source table in this exercise, Customer, contains a column called Cust_timestamp that a user would modify when creating or updating a row. The job checks this datetime, determines if it is later than the last time the job ran, and if so, passes only that row to the target template table.
The target database contains a job status table called CDC_time that stores the last value of $GV_ENDTIME. When you execute the delta-load job, it updates that value to use for the next execution.
Therefore, this post introduces the following new concepts:
• Changed-data capture
• Global variables
• Scripts
• Template tables
Adding and defining the initial-load job
First create the job that initially loads the target table.
JOB_CDC_Initial will contain an initialization script, a data flow, and a termination script.
Adding the job and defining global variables
Variables are symbolic placeholders for values. You can increase the flexibility and reusability of work flows and data flows using variables when you design your jobs.
In this exercise, you will create two global variables that serve as placeholders for time stamps (a start time and an end time). These time stamps will identify which rows in a source have been updated and therefore will be captured as being changed.
Global variables are exclusive within the context of the job in which they are created. Add and define the job and job-level global variables as follows.
Create a variable-
Click Tools > Variables.
The "Variables and Parameters" window appears. Notice that the job name displays in the Context box.
Right-click Global Variables and click Insert.
Data Services adds a variable with an automatically generated name. A focus box appears around the name cell and the cursor shape changes to an arrow with a yellow pencil.
Click the name cell to edit the name of the new global variable. Name the variable $GV_STARTTIME.
Click the data type cell and select datetime.
Repeat this procedure to create the global datetime variable $GV_ENDTIME.
Close the "Variables and Parameters" window.
To add and define the work flow
1. With the JOB_CDC_Initial definition open in the workspace, add a work flow named
WF_CDC_Initial.
2. Open the work flow definition for WF_CDC_Initial by clicking its name in the project area.
3. Click the script icon on the tool palette and place a script on the left side of the workspace. Name it SET_START_END_TIME.
4. Add a data flow to the right of the script and name it DF_CDC_Initial.
5. Add another script to the right of DF_CDC_Initial and name it UPDATE_CDC_TIME_TABLE.
6. Connect the two scripts to DF_CDC_Initial.
To define the set and update time scripts
When typing scripts, be sure to follow the date syntax required by your database.
1. In the object library, open the Customer source table by double-clicking it. Click the View Data tab.
To view the entire table, click the Open in new window icon.
Notice the oldest timestamp in the Cust_timestamp column is 2008.03.27 00:00:00.
2. In the project area, click WF_CDC_Inital to go back to its definition in the workspace.
3. Click the SET_START_END_TIME script's name to open the script editor. This script defines the initial start and end times. To capture all rows in the source, set the start time global variable as datetime 2008.01.01 00:00:000. Set the end time global variable as datetime sysdate (the current datetime).
For example:
$GV_STARTTIME = '2008.01.01 00:00:000'; $GV_ENDTIME = sysdate();
4. Click the Validate icon to validate the script.
5. Close the script editor for SET_START_END_TIME.
6. Click the UPDATE_CDC_TIME_TABLE script name to open the script editor. This script resets the $GV_ENDTIME value in the CDC_time table.
For example for MS SQL Server, type the following:
sql('Target_DS', 'DELETE FROM DBO.CDC_TIME'); sql('Target_DS', 'INSERT INTO DBO.CDC_TIME VALUES ({$GV_ENDTIME})');
Note:
Ensure that the user name in your script matches the user name defined for the table.
For example for Oracle, type the following:
sql('Target_DS', 'DELETE FROM TARGET.CDC_TIME'); sql('Target_DS', 'INSERT INTO TARGET.CDC_TIME VALUES (to_date({$GV_ENDTIME},'YYYY.MM.DD HH24:MI:SS'))');
7. Click the Validate icon to validate the script. The warning message that data types will be converted is acceptable (Data Services preserves the data type in the output schema).
Defining the data flow
Next define the data flow. The target table for this data flow will be a template table. When you use a template table, you do not have to specify the table's schema or import metadata. Instead, during job execution, Data Services has the DBMS create the table with the schema defined by the data flow.
Template tables appear in the object library under each datastore. To add a template table to a target datastore, drag the template table icon from that datastore to the workspace and name it.
1. In the project area, click the name of data flow DF_CDC_Initial.
2. In the object library from the ODS_DS datastore, drag the Customer table to the workspace and
click Make Source.
3. Add a query to right of the source table and name it QryCDC.
4. From the Target_DS datastore, drag the Template Tables icon to the right of the query in the workspace.
5. In the "Create template" dialog box, name the template table CUST_CDC.
6. Connect the source, query, and target together.
7. Click the name of the target table CUST_CDC to open its definition.
8. Click the Options tab.
9. Select the check box Delete data from table before loading.
10. Click the Back arrow on the tool bar to close the target table definition and return to the data flow view.
To define the data flow query
1. In the project area, click QryCDC to open the query editor.
2. Drag the following columns from the Schema In pane to the Schema Out pane:
CUST_ID
CUST_CLASSF
NAME1
ZIP
CUST_TIMESTAMP
3. On the Where tab, type:
(ODS_CUSTOMER.CUST_TIMESTAMP >= $GV_STARTTIME) and (ODS_CUSTOMER.CUST_TIMESTAMP <= $GV_ENDTIME)
Note:
You can drag the column CUSTOMER.CUST_TIMESTAMP from Schema In to the Where tab. When you type $, a list appears where you can select the $GV_STARTTIME and $GV_ENDTIME variable names.
4. Click the Validate icon on the toolbar to validate the query.
5. Correct any errors and close the "Output" window.
6. In the project area, click JOB_CDC_Initial.
7. Click the Validate All icon on the toolbar and correct any errors. As before, warning messages are ok.
8. Click the Save All toolbar icon to save the job.
9. Close the job, work flow, data flow, template table target editor, and query editor windows.
Adding and defining the delta-load job
To build the delta-load job more quickly, you will first replicate the initial-load data flow and modify it.
To add the delta-load data flow
1. In the object library, click the Data Flows tab.
2. Right-click DF_CDC_Initial and click Replicate.
3. Right-click the copy, click Rename, and name the data flow DF_CDC_Delta.
4. Double-click the name of the data flow in the object library to open it in the workspace.
5. Click the name of the target table CUST_CDC.
6. Click the Options tab.
7. Clear the check boxes for the Delete data from table before loading and Drop and re-create table options.
To add the job and define the global variables
1. In the project area, right-click the Class_Exercises project and click New Batch Job.
2. Name the job JOB_CDC_Delta.
3. Open JOB_CDC_Delta by clicking its name in the project area.
4. Click Tools > Variables.
The "Variables and Parameters" window appears. Notice that the job name displays in the Context box.
5. Right-click Global Variables and click Insert.
Data Services adds a variable with an automatically generated name. A focus box appears around the name cell and the cursor shape changes to an arrow with a yellow pencil.
6. Click the name cell to edit the name of the new global variable. Name the variable $GV_STARTTIME.
7. Click the data type cell and select datetime.
8. Repeat this procedure to create the global datetime variable $GV_ENDTIME.
9. Close the "Variables and Parameters" window.
To add and define the work flow
1. Add a work flow to JOB_CDC_Delta and name it WF_CDC_Delta.
2. Click the name of the work flow in the project area to open it.
3. Click the script icon in the tool palette and add it to the work flow. Name the script
SET_NEW_START_END_TIME.
4. From the object library Data Flows tab, drag DF_CDC_Delta to the work flow.
5. Add another script to the right of DF_CDC_Delta and name it UPDATE_CDC_TIME_TABLE.
6. Connect the scripts and data flow together.
To define the scripts
1. Click the name of the SET_NEW_START_END_TIME script to open the script editor.
2. For the delta-load job, you define the start-time global variable to be the last time stamp recorded in the CDC_time table.
For example for MS SQL Server:
$GV_STARTTIME = to_date(sql('Target_DS', 'SELECT LAST_TIME FROM DBO.CDC_TIME'), 'YYYY-MM-DD HH24:MI:SS'); $GV_ENDTIME = sysdate();
For example for Oracle:
$GV_STARTTIME = sql('Target_DS', 'SELECT to_char(LAST_TIME,'YYYY.MM.DD HH24:MI:SS') FROM TARGET.CDC_TIME'); $GV_ENDTIME = sysdate();
3. Click the UPDATE_CDC_TIME_TABLE script's name to open the script editor. This script resets the $GV_ENDTIME value in the CDC_time table.
For example for MS SQL Server:
sql('Target_DS', 'UPDATE DBO.CDC_TIME SET LAST_TIME ={$GV_ENDTIME}');
For example for Oracle:
sql('Target_DS', 'INSERT INTO TARGET.CDC_TIME VALUES (to_date({$GV_ENDTIME},'YYYY.MM.DD HH24:MI:SS'))');
4. Click the job name and click the Validate All icon. Correct any errors (warnings are ok).
5. Click the Save All toolbar icon and click OK to save the job.
Executing the jobs
First you will execute the initial-load job and view the results to verify that the job returned all rows from the source. Then you will open the source table and introduce a change. Finally, you will run the delta-load job to extract only the changes and update the target.
See more in our ebook
Comments
Post a Comment