Processing a header and trailer textfile in SAS / BASE
Example case
The input file in our scenario is a stream of records representing invoices. It is a text file in a header-trailer format and has the following structure:
- A header starts with a letter H which is followed by an invoice number, a customer number, a date and currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- Trailer starts with a T and contains two values and may be used for validation: total number of invoice lines and total net value.
SAS solution
SAS is a powerful application when dealing with unusually structured files. To prove this we will use only 2 simple SAS Base scripts to process the input file organized in header and trailer format.
/*
Read the entire header and trailer
input file into a ht_input dataset
Two fields will be filled:
- one with
type of record (kind)
- other with
a complete dataline (starting from the second
character)
*/
data ht_input;
format kind $1.;
format dataline $256.;
/*
Despite the fact that there are no
commas in the extract, we use it as a separator to tell SAS not to use a default
blank space as a separator
*/
infile 'D:\data\etl-examples\in_invoices.txt' MISSOVER
dlm=',';
input kind $ 1
dataline $;
run;
* Columns definition for 3 output sas datasets: headers, items and trailers
;
data headers(keep=invc_header_id
cust_id dte crncy)
items (keep=invc_header_id invc_line_id
prod_id prod_age
quantity net_value)
trailers (keep=invc_header_id no_lines total_net_value);
set ht_input;
* format
records ;
format invc_header_id
$8.;
format cust_id $6.;
format prod_id $8.;
format dte $8.;
format crncy $3.;
format prod_age 2.0;
format quantity 7.0;
format net_value 8.2;
format no_lines 2.0 ;
format total_net_value
8.2;
* processing in case the dataline is a header ;
IF KIND = "H" THEN
do;
* populate
fields;
invc_header_id = substr(dataline,1,8);
cust_id = substr(dataline,9,6);
dte = substr(dataline,15,8);
crncy = substr(dataline,23,3);
* create
record in the headers dataset;
output headers;
invc_line_id = 0; *
reset invoice lines counter ;
end;
* invoice line processing (detail,
item);
IF KIND = "I" THEN
do;
* increase
invoice line sequence;
invc_line_id = invc_line_id
+ 1;
* populate
fields;
prod_id = substr(dataline,1,7);
prod_age = substr(dataline,8,2);
quantity = substr(dataline,11,7);
net_value = substr(dataline,19,11);
* create
new record in the items dataset;
output items;
* retain will make it
possible to remember the current invoice line ID in the next processing loop;
retain invc_line_id ;
end;
* process
trailer ;
IF KIND = "T" THEN
do;
no_lines = substr(dataline,1,2);
total_net_value = substr(dataline,3,11);
* create
new trailer record;
output trailers;
end;
* The processed invoice header ID
needs to be stored for the next loop ;
retain
invc_header_id;
run;