Processing a header and trailer textfile in SAS / BASE



Example case

The input file in our scenario is a stream of records representing invoices. It is a text file in a header-trailer format and has the following structure:
- A header starts with a letter H which is followed by an invoice number, a customer number, a date and currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- Trailer starts with a T and contains two values and may be used for validation: total number of invoice lines and total net value.

SAS solution



SAS is a powerful application when dealing with unusually structured files. To prove this we will use only 2 simple SAS Base scripts to process the input file organized in header and trailer format.

  • In the first step the entire extract while the first character (letter H, I or T) is separated from other strings in the line

    /*

    Read the entire header and trailer input file into a ht_input dataset

    Two fields will be filled:

    - one with type of record (kind)

    - other with a complete dataline (starting from the second character)

    */

    data ht_input;

    format kind $1.;

    format dataline $256.;

     

    /*

    Despite the fact that there are no commas in the extract, we use it as a separator to tell SAS not to use a default blank space as a separator

    */

    infile 'D:\data\etl-examples\in_invoices.txt' MISSOVER dlm=',';

    input kind $ 1 dataline $;

    run;



    SAS header and trailer input loaded into a dataset:
    SAS header and trailer input loaded into a dataset


  • Now the loaded header and items dataset will be processed in the way that records will be placed in three outputs simultaneously: headers, items and trailers
  • RETAIN SAS statement will be used to save appropriate invoice header IDs accross all the input lines. We will also use it to assign corresponding invoice lines.

    * Columns definition for 3 output sas datasets: headers, items and trailers ;

     

    data headers(keep=invc_header_id cust_id dte crncy)

         items (keep=invc_header_id invc_line_id

    prod_id prod_age quantity net_value)

         trailers (keep=invc_header_id no_lines total_net_value);

     

    set ht_input;

     

    * format records ;

          format invc_header_id $8.;

          format cust_id $6.;

          format prod_id $8.;

          format dte $8.;

          format crncy $3.;

          format prod_age 2.0;

          format quantity 7.0;

          format net_value 8.2;

          format no_lines 2.0 ;

          format total_net_value 8.2;

     

          * processing in case the dataline is a header ;

          IF KIND = "H" THEN

          do;

          * populate fields;

          invc_header_id = substr(dataline,1,8);

          cust_id = substr(dataline,9,6);

          dte = substr(dataline,15,8);

          crncy = substr(dataline,23,3);

     

          * create record in the headers dataset;

          output headers;

          invc_line_id = 0; * reset invoice lines counter ;

          end;

     

          * invoice line processing (detail, item);

          IF KIND = "I" THEN

          do;

          * increase invoice line sequence;

          invc_line_id = invc_line_id + 1;

     

          * populate fields;

          prod_id = substr(dataline,1,7);

          prod_age = substr(dataline,8,2);

          quantity = substr(dataline,11,7);

          net_value = substr(dataline,19,11);

     

          * create new record in the items dataset;

          output items;

     

    * retain will make it possible to remember the current invoice line ID in the next processing loop;

          retain invc_line_id ;

          end;

     

          * process trailer ;

          IF KIND = "T" THEN

          do;

          no_lines = substr(dataline,1,2);

          total_net_value = substr(dataline,3,11);

     

    * create new trailer record;

          output trailers;

          end;

     

    * The processed invoice header ID needs to be stored for the next loop ;

    retain invc_header_id;

    run;



    Header and trailer format input file processed in SAS:
    Header and trailer format input file processed in SAS


  • Header and trailer example main page



    Back to the Data Warehousing tutorial home