FTP copy and load customers extract


Load customers data into a data warehouse according to the business requirements.
The customers data details is very often changed in a source system and we want to reflect those changes in the data warehouse. We also want to be able to keep track of that changes.

Scenario overview and details

The customers data is extracted from the source system on a monthly basis and placed on a FTP server. The ETL process needs to get the file locally and load it into the data warehouse. There are several possible scenarios which need to be included in the process:
- a new customer is added
- a customer already exists and needs to be updated
- an existing customer remains unchanged
- a record may be invalid

Additionally, we want to keep track of the changes made to the customers data and be able to see all the changes to a given record in time. To handle historical data we will use the Type 4 Slowly Changing Dimension. SCD Type 4 means that the changed or deleted data is stored in a separate table. A timestamp and a change index will be added to handle records which change more than once.

Input data:
- D_CUSTOMER - data warehouse table with customers
- Dwh_cust_extract_ 1705.txt - customers extract for the current month

Output data:
- D_CUSTOMER - updated table which stores only current records.
- D_CUSTOMER_HIST table - table with historical data for the customers dimension. It stores deleted records and customers that have already been updated.
- Cust_errors_170523.txt - a log file with loading errors

Proposed solution

The design of an ETL process flow for the customers loading will be as follows:

  • A text file with customer extract is generated by a source system and placed on an FTP server
  • The file is retrieved to an ETL server
  • The customer extract is loaded into a temporary table
  • The existing DW customers file is loaded into a temporary lookup file.
  • Each record from the customers file is validated and looked up from the existing customers file. The transform needs to apply the following rules:
       a. If a record is malformed and does not pass validation, it is redirected to a reject flow.
       b. If the lookup does not match records, it means that this is a new customer and it needs to be loaded into the customers table.
       c. If the lookup matches, we need to compare the non-key fields to check if the customer details have changed. There are two options available: all fields remain the same (then we leave the record as it is and proceed to the next one) or a field has changed. In that case the current record needs to be insterted into the historical table and replaced by a new one in the main customers table.


    Loading customers ETL process implementation in various environments:
  • Teradata MultiLoad and an FTP shell script to load customers extract
  • Pentaho PDI 3.0 (Kettle)
  • SAS / BASE

    Back to the Data Warehousing tutorial home