oracle SQL
Parsing a string with a CSV into multiple columns
The Problem
We have a lot of data to process (lots of rows), while one of the columns contains a character string that contains a concatenation of values. The values are separated by a comma, or some different separator character.
An example could be 1,2,55,3
or 55_63_88_42_22
We want to split this string into individual columns. The typical approach would be to use a combination of INSTR
and SUBSTR
calls as follows:
SELECT SUBSTR(bins,1,INSTR(bins,',',1,1) - 1) AS bin_1, SUBSTR(bins,INSTR(bins,',',1,1) + 1,INSTR(bins,',',1,2) - INSTR(bins,',',1,1) - 1) AS bin_2, SUBSTR(bins,INSTR(bins,',',1,2) + 1,INSTR(bins,',',1,3) - INSTR(bins,',',1,2) - 1) AS bin_3, SUBSTR(bins,INSTR(bins,',',1,3) + 1,INSTR(bins,',',1,4) - INSTR(bins,',',1,3) - 1) AS bin_4, SUBSTR(bins,INSTR(bins,',',1,4) + 1) AS bin_5 FROM ( SELECT '2,33,55,43,32' AS bins FROM DUAL ) m
This works nicely, but the problem is with processing huge amounts of data. SUBSTR
, and INSTR
are already slow parsers, but for above to work, they need to start searching every time from the beginning of the String (position 1). Additionally, it is complex to repeat this for every column, and so makes it error prone.
Attempt 1: Hide complexity in a Stored Procedure
CREATE OR REPLACE FUNCTION GET_ELEMENT(source_string VARCHAR2,element_index PLS_INTEGER,separator CHAR := ',') RETURN VARCHAR2 AS pos1 PLS_INTEGER := CASE element_index WHEN 1 THEN 0 ELSE INSTR(source_string,separator,1,element_index-1) END; pos2 PLS_INTEGER := INSTR(source_string,separator,pos1+1,1); BEGIN CASE WHEN pos1 = 0 AND element_index > 1 THEN RETURN NULL; WHEN pos2 = 0 THEN RETURN SUBSTR(source_string,pos1+1); ELSE RETURN SUBSTR(source_string,pos1+1,pos2-1-pos1); END CASE; END;
This is already a nice first attempt. When looking for each ending separator (pos2
), it starts at the position of the beginning separator (pos1
). However the calculation of each beginning separator (pos1
) always starts at position 1. So for bulk operations, this is again going to cause overhead in processing.
But as can be seen in SQL below, it also simplified the parsing by a lot:
SELECT get_element(bins,1) AS bin_1, get_element(bins,2) AS bin_2, get_element(bins,3) AS bin_3, get_element(bins,4) AS bin_4, get_element(bins,5) AS bin_5, get_element(bins,6) AS bin_6 FROM ( SELECT '2,33,55,43,32' AS bins FROM DUAL ) m;
Attempt 2: Fully Iterative Function
CREATE OR REPLACE PACKAGE splitter AS TYPE n5_r IS RECORD (n1 VARCHAR2(50),n2 VARCHAR2(50),n3 VARCHAR2(50),n4 VARCHAR2(50),n5 VARCHAR2(50)); TYPE n5_t IS TABLE OF n5_r; FUNCTION split_n5(source_string IN VARCHAR2,separator IN CHAR) RETURN n5_t PIPELINED; END splitter; / CREATE OR REPLACE PACKAGE BODY splitter AS FUNCTION safe_instr(source_string IN VARCHAR2,separator IN CHAR,start_position IN PLS_INTEGER) RETURN PLS_INTEGER IS pos PLS_INTEGER := INSTR(source_string,separator,start_position,1); BEGIN RETURN CASE pos WHEN 0 THEN NULL ELSE pos END; END; FUNCTION safe_substr(source_string IN VARCHAR2,start_position IN PLS_INTEGER,next_position IN PLS_INTEGER) RETURN PLS_INTEGER IS BEGIN IF next_position IS NULL THEN RETURN SUBSTR(source_string,start_position); END IF; RETURN SUBSTR(source_string,start_position,next_position-start_position); END; FUNCTION split_n5(source_string IN VARCHAR2,separator IN CHAR) RETURN n5_t PIPELINED IS pos1 PLS_INTEGER := 0; pos2 PLS_INTEGER; n n5_r; BEGIN pos2 := safe_instr(source_string,separator,pos1+1); n.n1 := safe_substr(source_string,pos1+1,pos2); pos1 := pos2; -- pos2 := safe_instr(source_string,separator,pos1+1); n.n2 := safe_substr(source_string,pos1+1,pos2); pos1 := pos2; -- pos2 := safe_instr(source_string,separator,pos1+1); n.n3 := safe_substr(source_string,pos1+1,pos2); pos1 := pos2; -- pos2 := safe_instr(source_string,separator,pos1+1); n.n4 := safe_substr(source_string,pos1+1,pos2); pos1 := pos2; -- pos2 := safe_instr(source_string,separator,pos1+1); n.n5 := safe_substr(source_string,pos1+1,pos2); -- PIPE ROW(n); END; END splitter; /
Now every next delimiter is always based on the position of the previous delimiter. This will perform well for bigger batches of data. It’s use is however a little trickier, because we are dealing with a PIPELINED
function. It requires to be wrapped inside a TABLE( ... )
function:
SELECT /*+ CARDINALITY(bin 1) USE_NL(m bin) */ bin.n1 AS bin_1, bin.n2 AS bin_2, bin.n3 AS bin_3, bin.n4 AS bin_4, bin.n5 AS bin_5 FROM ( SELECT '2,33,55,43,32' AS bins FROM DUAL ) m CROSS JOIN TABLE(splitter.split_n5(m.bins, ',')) bin;
It is clear however that it did not hurt readability, in fact, it made it probably easier to read.
Note that the Optimizer might have problems with identifying what to do with the PIPELINED
(pickler) function, hence hints were added. Be warned however that the cardinality
hint is not supported by Oracle.
Issues
Note that with given methods, we just split and parse very simple strings. It does not deal with nested comma’s, and quotes. If this is your aim, then I would suggest the reading on http://ora-00001.blogspot.com/2010/04/select-from-spreadsheet-or-how-to-parse.html.
Also note that none of the methods in PL/SQL and SQL are really going to be high performers. When parsing millions of records, I have much better results parson results using Java. In Java I found it also much easier to distribute my workload over multiple CPU cores using multithreading.
Generate a List of Dates
There are many methods to accomplish the same. A few are listed. It probably serves more as a demo to some of the oracle features.
Method 1 – PLSQL Loop
You can also simply build a loop in PL/SQL to generate the list of dates. However for this to work in regular SQL, it will have to be defined using an Oracle ‘Pickler’ or PIPELINED
Function.
More info can be found in “Developing and Optimizing Pipelined Functions”.
Method 2 – MODEL Queries
Oracle has the lesser known feature of Model Queries.
Those type of queries allow accessing and modifying oracle SQL output like it is Excel.
The Model allows you to modify the generated data by accessing cells like excel cells to:
- access it for read,
- access existing cells/rows for changing (UPDATE)
- access existing + non existing (UPDATE + INSERT = UPSERT), allowing you to create additional data that didn’t exist before.
SELECT i,stamp FROM (SELECT 1 AS i,TRUNC(SYSDATE) AS tstamp FROM DUAL) MODEL DIMENSION BY (i) MEASURES (tstamp) RULES UPSERT SEQUENTIAL ORDER ( tstamp[FOR i FROM 2 TO 10 INCREMENT 1] = tstamp[i=cv(i)-1]+1 );
MEASURES
defines the columns you want to access in an array fashion (like you would access a multidimensional array in java or other programming languages.
DIMENSION BY
lists out what the indexes are you want to use for accessing the array.
The RULES defines what the edits are that need to take place.
So with above example we can in combination with a loop build a range of dates, starting from one single preexisting date.
Method 3 – Recursive Subquery Factoring
WITH t AS ( -- Anchor Query (first Row) SELECT TRUNC(SYSDATE) AS tstamp FROM DUAL -- Recursive Query (subsequent Rows) UNION ALL SELECT t.tstamp+1 AS stamp FROM t WHERE t.tstamp+1 <= TRUNC(SYSDATE)+10 ) SELECT * FROM t
Method 4 – Hierarchical Queries using CONNECT BY
SELECT TRUNC(SYSDATE) + (LEVEL-1) AS tstamp FROM DUAL CONNECT BY LEVEL <= 10
Elegance in Simplicity …
References
- Oracle-Developer.net article on improving performance with pipelined table functions
- Oracle 11gR2 Documentation on
pivot
Ever saw oracle’s official documentation on the SQL PIVOT ? Couldn’t find the matching documentation for it, and no examples? I have turned the PIVOT inside-out, and will share you what it does.
Starting with pages of unrelated data points
SELECT student, subject, minor, quarter, score, score_max FROM metric
student | subject | minor | quarter | score | score_max |
Jeff | Math | Algebra | 1 | 14 | 20 |
Jeff | Math | Trigonometry | 1 | 14 | 20 |
Jeff | English | Grammar | 2 | 9 | 10 |
Jeff | English | Literature | 2 | 9 | 10 |
… |
This table basically stores for each student his scores on a couple of subjects, further subdivided into minor subject, and for each individual quarter of the year.
Now, lets try to apply a full fledged PIVOT, including all whistles and bells …
SELECT * FROM ( SELECT student, subject, minor, score, score_max FROM metric ) PIVOT ( COUNT(DISTINCT quarter) AS "CNT", SUM(score) AS num, SUM(score_max) AS den FOR (subject,minor) IN ( ('Math','Trigonometry') AS "M_TRI", ('Math','Algebra') AS "M_ALG", ('English','Grammar') AS "E_GRA", ('English','Literature') AS "E_LIT" ) )
And again the output … maybe to some surprises …
student | m_tri_cnt | m_tri_num | m_tri_den | m_alg_cnt | m_alg_num | m_alg_den | e_gra_cnt | e_gra_num | e_gra_den | e_lit_cnt | e_lit_num | e_lit_den |
Jeff | 2 | 28 | 40 | 2 | 14 | 20 | 4 | 69 | 100 | 3 | 356 | 400 |
Mieke | 3 | 38 | 40 | 2 | 18 | 20 | 4 | 69 | 100 | 4 | 301 | 400 |
… |
Ok, we all expected rows be turned into columns based on some determinator. But a lot more happened, that is obvious. I will list out –
FOR (determinators)
identifies the determinators used to identify the pivoted columns, in this case a combination of two columns(subject,minor)
.IN (...)
lists out any combination of the determinators, and the associated alias (the new column name).- The first part within the
PIVOT (aggregators FOR ...)
, the aggregators (MIN
,MAX
,COUNT
, …). These aggregators basically describe a method to be used to collapse multiple rows into one. In the simplified case, the chosen aggregation method wont matter, But in this example it does. Any column not used in the aggregator sector, will be part of an implicitGROUP BY
. When providing multiple aggregators, you need to associate each aggregator with it’s own alias. This alias will be concatenated at the end of the alias provided for the determinator alias.
English is a very ambiguous medium to try to explain something like this, but this is why the provided examples should shed light on the obscurities. Don’t forget to keep the Oracle SQL Reference Guide handy.
We started with the most complex example. Now to show that it does not need to be always that verbose, we will simplify this example. We will assume:
- All scores are 100-based, and to come to a total score we can just
SUM(...)
- There are no minor subjects anymore
- No quarters either, just a list of students and their score for each main subject
SELECT student, subject, score FROM metric
student | subject | score |
Jeff | Math | 80 |
Jeff | English | 93 |
… |
Now, if we want to pivot this, it becomes much more simpler, and less verbose:
SELECT * FROM metric PIVOT ( SUM(score) FOR (subject) IN ('Math','English') )
What simplified this is the following:
- With a single aggregator, no more need for a separate alias suffix.
- Single determinator.
- The list of determinator combinations does not have an alias defined either. However Columns will appear literally as the expression, showing ‘Math’, including the single quotes.
- The pivoted query is now just the table itself, no subquery.
student | ‘Math’ | ‘English’ |
Jeff | 212 | 534 |
Joe | 365 | 653 |
John | 456 | 543 |
Jo-Ann | 383 | 645 |
… |