Data warehouse for monitoring and analyzing site traffic

Goal

Design an ETL process to fill in a data warehouse structure and data marts with the statistics of the ETL Tools Info portal traffic and other pages owned by the GoliInfo company. The data needs to be flexible and easily accessible for the purpose of reporting and OLAP analysis.

Scenario overview and details

GoliInfo currently owns seven different web portals which are hosted on one server with a common logging database. One of the company's revenue generator is a cooperation with an advertisement company in publishing ads and banners.

Each site visit is logged on the web hosting server in the MySQL database and contains quite detailed information. The advertisement partner also provides an input for the data mart - the data about page impressions, users visits and revenue generated by them.

It is possible to generate some simple reports directly from the database using select statements, but it is slow and the information is incomplete and the reporting is not flexible.

In order to increase the revenue generated by the websites there is a need to analyze the traffic in more detail and find the sites which populate most of the revenue.

Input data:
- Site traffic log - an extract generated on the webserver by a PHP script. It uses start date and end date as a parameter and those dates are reflected in a filename
- A CSV summary file with the advertisement campaign data is provided via email

Output data:
- Data Warehouse fact tables with the processed data, based on a time of the visit.
- A few Data Marts which contain aggregations and are subject oriented. The subjects are: portal, visiting host, visited page, referrer, time of the day, top revenue generators, etc.

ETL Process step by step

Extraction

Extract generated on a web server.

  • The extraction process is invoked by a PHP script which places the extract in a specific folder and once the extraction is finished, it prints the filename in a web browser.
  • Ads CSV file provided by the advertisement company is sent via email. It is generated on request and this step involves manual intervention.

    Transformation

  • The key column for the extracts is a datetime value. As the files are generated in a different time zones, the time value needs to be adjusted in one of the extracts.
  • The extracts are linked together and get assigned the data warehouse surrogate keys.
  • Certain data quality tasks are performed to prepare string data for analysis (string operations on host name, referrer, cutting out unneeded characters, trimming long characters)
  • Other business rules are applied in this step

    Loading steps

  • 1. Load technical tables - tables with surrogate keys and reference tables
  • 2. Fill in the main data warehouse fact table and update dimensions
  • 3. Calculate and generate the site traffic data marts

    Information delivery

  • Reports generation scheduled reports batch generation and on-demand reporting. The reports may be accessible on a web portal, send via email or distributed across the organization
  • Feeding crosstabs and OLAP cubes which are accessible via a web browser
  • Data in data marts available for data mining analysis

    Implementation

    ETL process data flow for loading the Site Traffic Data Warehouse (Pentaho Kettle Spoon):
    ETL process data flow for loading the Site Traffic Data Warehouse (Pentaho Kettle Spoon)





    Back to the Data Warehousing tutorial home