plsql

Parsing a string with a CSV into multiple columns

Posted on Updated on

The Problem

We have a lot of data to process (lots of rows), while one of the columns contains a character string that contains a concatenation of values. The values are separated by a comma, or some different separator character.

An example could be 1,2,55,3 or 55_63_88_42_22

We want to split this string into individual columns. The typical approach would be to use a combination of INSTR and SUBSTR calls as follows:

SELECT
  SUBSTR(bins,1,INSTR(bins,',',1,1) - 1) AS bin_1,
  SUBSTR(bins,INSTR(bins,',',1,1) + 1,INSTR(bins,',',1,2) - INSTR(bins,',',1,1) - 1) AS bin_2,
  SUBSTR(bins,INSTR(bins,',',1,2) + 1,INSTR(bins,',',1,3) - INSTR(bins,',',1,2) - 1) AS bin_3,
  SUBSTR(bins,INSTR(bins,',',1,3) + 1,INSTR(bins,',',1,4) - INSTR(bins,',',1,3) - 1) AS bin_4,
  SUBSTR(bins,INSTR(bins,',',1,4) + 1) AS bin_5
FROM
  (
    SELECT '2,33,55,43,32' AS bins FROM DUAL
  ) m

This works nicely, but the problem is with processing huge amounts of data. SUBSTR, and INSTR are already slow parsers, but for above to work, they need to start searching every time from the beginning of the String (position 1). Additionally, it is complex to repeat this for every column, and so makes it error prone.

Attempt 1: Hide complexity in a Stored Procedure

CREATE OR REPLACE FUNCTION GET_ELEMENT(source_string VARCHAR2,element_index PLS_INTEGER,separator CHAR := ',')
RETURN VARCHAR2
AS
  pos1 PLS_INTEGER := CASE element_index WHEN 1 THEN 0 ELSE INSTR(source_string,separator,1,element_index-1) END;
  pos2 PLS_INTEGER := INSTR(source_string,separator,pos1+1,1);
BEGIN
  CASE
    WHEN pos1 = 0 AND element_index > 1 THEN RETURN NULL;
    WHEN pos2 = 0 THEN RETURN SUBSTR(source_string,pos1+1);
    ELSE RETURN SUBSTR(source_string,pos1+1,pos2-1-pos1);
  END CASE;
END;

This is already a nice first attempt. When looking for each ending separator (pos2), it starts at the position of the beginning separator (pos1). However the calculation of each beginning separator (pos1) always starts at position 1. So for bulk operations, this is again going to cause overhead in processing.

But as can be seen in SQL below, it also simplified the parsing by a lot:

SELECT
  get_element(bins,1) AS bin_1,
  get_element(bins,2) AS bin_2,
  get_element(bins,3) AS bin_3,
  get_element(bins,4) AS bin_4,
  get_element(bins,5) AS bin_5,
  get_element(bins,6) AS bin_6
FROM
  (
    SELECT '2,33,55,43,32' AS bins FROM DUAL
  ) m;

Attempt 2: Fully Iterative Function

CREATE OR REPLACE PACKAGE splitter AS
  TYPE n5_r IS RECORD (n1 VARCHAR2(50),n2 VARCHAR2(50),n3 VARCHAR2(50),n4 VARCHAR2(50),n5 VARCHAR2(50));
  TYPE n5_t IS TABLE OF n5_r;
  FUNCTION split_n5(source_string IN VARCHAR2,separator IN CHAR) RETURN n5_t PIPELINED;
END splitter;
/

CREATE OR REPLACE PACKAGE BODY splitter AS
  FUNCTION safe_instr(source_string IN VARCHAR2,separator IN CHAR,start_position IN PLS_INTEGER) RETURN PLS_INTEGER IS
    pos PLS_INTEGER := INSTR(source_string,separator,start_position,1);
  BEGIN
    RETURN CASE pos WHEN 0 THEN NULL ELSE pos END;
  END;

FUNCTION safe_substr(source_string IN VARCHAR2,start_position IN PLS_INTEGER,next_position IN PLS_INTEGER) RETURN PLS_INTEGER IS
  BEGIN
    IF next_position IS NULL THEN RETURN SUBSTR(source_string,start_position); END IF;
    RETURN SUBSTR(source_string,start_position,next_position-start_position);
  END;

FUNCTION split_n5(source_string IN VARCHAR2,separator IN CHAR) RETURN n5_t PIPELINED IS
    pos1 PLS_INTEGER := 0;
    pos2 PLS_INTEGER;
    n n5_r;
  BEGIN
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n1 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n2 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n3 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n4 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n5 := safe_substr(source_string,pos1+1,pos2);
    --
    PIPE ROW(n);
  END;
END splitter;
/

Now every next delimiter is always based on the position of the previous delimiter. This will perform well for bigger batches of data. It’s use is however a little trickier, because we are dealing with a PIPELINED function. It requires to be wrapped inside a TABLE( ... ) function:

SELECT /*+ CARDINALITY(bin 1) USE_NL(m bin) */
  bin.n1 AS bin_1,
  bin.n2 AS bin_2,
  bin.n3 AS bin_3,
  bin.n4 AS bin_4,
  bin.n5 AS bin_5
FROM
  (
    SELECT '2,33,55,43,32' AS bins FROM DUAL
  ) m
  CROSS JOIN TABLE(splitter.split_n5(m.bins, ',')) bin;

It is clear however that it did not hurt readability, in fact, it made it probably easier to read.

Note that the Optimizer might have problems with identifying what to do with the PIPELINED (pickler) function, hence hints were added. Be warned however that the cardinality hint is not supported by Oracle.

Issues

Note that with given methods, we just split and parse very simple strings. It does not deal with nested comma’s, and quotes. If this is your aim, then I would suggest the reading onĀ http://ora-00001.blogspot.com/2010/04/select-from-spreadsheet-or-how-to-parse.html.

Also note that none of the methods in PL/SQL and SQL are really going to be high performers. When parsing millions of records, I have much better results parson results using Java. In Java I found it also much easier to distribute my workload over multiple CPU cores using multithreading.