ETL – Extract Transform and Load

Extract Transform Load

There are several common use case scenarios where we need to EXTRACT data from a given source and TRANSFORM the data that we extracted so that it can be LOADED into a database or another system. In this article, I am going to talk about a CSV upload and a few of the considerations one needs to make to Extract, Transform and Load the CSV data into a database.

Extract

To put it in simple words, a CSV file is a plain text file with data separated by commas, tabs, or semi-colons. When a spreadsheet application opens a CSV, it displays the data as a table. Independently of your choice of language (Java, Python, JavaScript, PHP, etc), you will find a CSV library code to deal with the data.

Below an example of a comma-separated CSV data:

"Date","Pupil","Grade"
"25 May","Bloggs, Fred","C"
"25 May","Doe, Jane","B"
"15 July","Bloggs, Fred","A"
"15 April","Muniz, Alvin ""Hank""","A"

Transform

Usually, data will have to be transformed, for example by combining two columns into a single one (first name + middle name), or some data needs to be split into several individual pieces, i.e. when addresses come as a single cell record but displayed into several lines. The two tables below show an example of this case.

Here we can already predict some difficulties as a UK address can vary from two to six lines in length. Some clever code needs to be placed to “guess” what each line actually means, which is not in scope for this article.

Load

To keep performance high, the CSV is read line by line and the data in every line is inserted into a database or printed onto the screen so that the user can select one or more roles to be imported. You want to process the single row straight away and discard it from memory as soon as you finish.

In many cases, the first row contains or displays column names, which can then be used to match pre-existent data in the system. This allows flexibility when parsing the data in the order it is presented.


A more complex case is described in this case study, where the data is requested from the source system, then transformed before finally loaded into the destination system.