Parsing a string with a CSV into multiple columns

Posted on November 4, 2013 Updated on January 5, 2014

The Problem

We have a lot of data to process (lots of rows), while one of the columns contains a character string that contains a concatenation of values. The values are separated by a comma, or some different separator character.

An example could be 1,2,55,3 or 55_63_88_42_22

We want to split this string into individual columns. The typical approach would be to use a combination of INSTR and SUBSTR calls as follows:

SELECT
  SUBSTR(bins,1,INSTR(bins,',',1,1) - 1) AS bin_1,
  SUBSTR(bins,INSTR(bins,',',1,1) + 1,INSTR(bins,',',1,2) - INSTR(bins,',',1,1) - 1) AS bin_2,
  SUBSTR(bins,INSTR(bins,',',1,2) + 1,INSTR(bins,',',1,3) - INSTR(bins,',',1,2) - 1) AS bin_3,
  SUBSTR(bins,INSTR(bins,',',1,3) + 1,INSTR(bins,',',1,4) - INSTR(bins,',',1,3) - 1) AS bin_4,
  SUBSTR(bins,INSTR(bins,',',1,4) + 1) AS bin_5
FROM
  (
    SELECT '2,33,55,43,32' AS bins FROM DUAL
  ) m

This works nicely, but the problem is with processing huge amounts of data. SUBSTR, and INSTR are already slow parsers, but for above to work, they need to start searching every time from the beginning of the String (position 1). Additionally, it is complex to repeat this for every column, and so makes it error prone.

Attempt 1: Hide complexity in a Stored Procedure

CREATE OR REPLACE FUNCTION GET_ELEMENT(source_string VARCHAR2,element_index PLS_INTEGER,separator CHAR := ',')
RETURN VARCHAR2
AS
  pos1 PLS_INTEGER := CASE element_index WHEN 1 THEN 0 ELSE INSTR(source_string,separator,1,element_index-1) END;
  pos2 PLS_INTEGER := INSTR(source_string,separator,pos1+1,1);
BEGIN
  CASE
    WHEN pos1 = 0 AND element_index > 1 THEN RETURN NULL;
    WHEN pos2 = 0 THEN RETURN SUBSTR(source_string,pos1+1);
    ELSE RETURN SUBSTR(source_string,pos1+1,pos2-1-pos1);
  END CASE;
END;

This is already a nice first attempt. When looking for each ending separator (pos2), it starts at the position of the beginning separator (pos1). However the calculation of each beginning separator (pos1) always starts at position 1. So for bulk operations, this is again going to cause overhead in processing.

But as can be seen in SQL below, it also simplified the parsing by a lot:

SELECT
  get_element(bins,1) AS bin_1,
  get_element(bins,2) AS bin_2,
  get_element(bins,3) AS bin_3,
  get_element(bins,4) AS bin_4,
  get_element(bins,5) AS bin_5,
  get_element(bins,6) AS bin_6
FROM
  (
    SELECT '2,33,55,43,32' AS bins FROM DUAL
  ) m;

Attempt 2: Fully Iterative Function

CREATE OR REPLACE PACKAGE splitter AS
  TYPE n5_r IS RECORD (n1 VARCHAR2(50),n2 VARCHAR2(50),n3 VARCHAR2(50),n4 VARCHAR2(50),n5 VARCHAR2(50));
  TYPE n5_t IS TABLE OF n5_r;
  FUNCTION split_n5(source_string IN VARCHAR2,separator IN CHAR) RETURN n5_t PIPELINED;
END splitter;
/

CREATE OR REPLACE PACKAGE BODY splitter AS
  FUNCTION safe_instr(source_string IN VARCHAR2,separator IN CHAR,start_position IN PLS_INTEGER) RETURN PLS_INTEGER IS
    pos PLS_INTEGER := INSTR(source_string,separator,start_position,1);
  BEGIN
    RETURN CASE pos WHEN 0 THEN NULL ELSE pos END;
  END;

FUNCTION safe_substr(source_string IN VARCHAR2,start_position IN PLS_INTEGER,next_position IN PLS_INTEGER) RETURN PLS_INTEGER IS
  BEGIN
    IF next_position IS NULL THEN RETURN SUBSTR(source_string,start_position); END IF;
    RETURN SUBSTR(source_string,start_position,next_position-start_position);
  END;

FUNCTION split_n5(source_string IN VARCHAR2,separator IN CHAR) RETURN n5_t PIPELINED IS
    pos1 PLS_INTEGER := 0;
    pos2 PLS_INTEGER;
    n n5_r;
  BEGIN
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n1 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n2 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n3 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n4 := safe_substr(source_string,pos1+1,pos2);
    pos1 := pos2;
    --
    pos2 := safe_instr(source_string,separator,pos1+1);
    n.n5 := safe_substr(source_string,pos1+1,pos2);
    --
    PIPE ROW(n);
  END;
END splitter;
/

Now every next delimiter is always based on the position of the previous delimiter. This will perform well for bigger batches of data. It’s use is however a little trickier, because we are dealing with a PIPELINED function. It requires to be wrapped inside a TABLE( ... ) function:

SELECT /*+ CARDINALITY(bin 1) USE_NL(m bin) */
  bin.n1 AS bin_1,
  bin.n2 AS bin_2,
  bin.n3 AS bin_3,
  bin.n4 AS bin_4,
  bin.n5 AS bin_5
FROM
  (
    SELECT '2,33,55,43,32' AS bins FROM DUAL
  ) m
  CROSS JOIN TABLE(splitter.split_n5(m.bins, ',')) bin;

It is clear however that it did not hurt readability, in fact, it made it probably easier to read.

Note that the Optimizer might have problems with identifying what to do with the PIPELINED (pickler) function, hence hints were added. Be warned however that the cardinality hint is not supported by Oracle.

Issues

Note that with given methods, we just split and parse very simple strings. It does not deal with nested comma’s, and quotes. If this is your aim, then I would suggest the reading on http://ora-00001.blogspot.com/2010/04/select-from-spreadsheet-or-how-to-parse.html.

Also note that none of the methods in PL/SQL and SQL are really going to be high performers. When parsing millions of records, I have much better results parson results using Java. In Java I found it also much easier to distribute my workload over multiple CPU cores using multithreading.

This entry was posted in plsql, sql and tagged oracle, oracle SQL.

Working with Timezones in Oracle

Posted on November 2, 2013 Updated on November 24, 2013

Calculate the Timezone Offset

SELECT TZ_OFFSET('US/Pacific') FROM DUAL;

The timezone offset can change based on the current time, as this takes into account if the current date is Daylight Saving (DT) or Standard (ST).

So result of TZ_OFFSET is never constant, it depends on the current date as well!

Working with the `TIMESTAMP WITH TIMEZONE` datatype

You can represent the same time in different timezones. The TIMESTAMP WITH TIMEZONE datatype is tagged already for a particular timezone, you can convert it to move it into a different timezone. Allthough the time is in essence the same, it will look very different when output.

Moving a TIMESTAMP WITH TIMEZONE to a different Time Zone can be simply done using the following syntax:

datetimetz AT TIME ZONE tz

Where datetimetz is any value of type TIMESTAMP WITH TIMEZONE, and tz is a string indicating the TZ you want to convert over.

Example:

SELECT tz, TO_CHAR(ts AT TIME ZONE tz,format) AS ts
FROM
(SELECT SYSTIMESTAMP AS ts FROM dual),
(SELECT 'UTC' AS tz FROM dual),
(SELECT 'DD-MON-YYYY HH24:MI:SSxFF TZH:TZM (TZD, TZR)' AS format FROM dual);

Note we are using TO_CHAR, to control the output of the time, by explicitly converting it to a VARCHAR.

Time Zones

Time Zones can be expressed in two ways:

As on Offset: '-06:00', which also matches the 'TZH:TZM' TO_CHAR format.
or, as Region Name: America/Denver‘

It is important to note that if you need to automatically adjust for Daylight Saving, to make sure to use the Region Name rather than just an offset. An offset is always fixed at the offset provided, and will never adjust for Daylight Saving time.

More information can be found at:

Note that even for the Region Name, in Oracle, there are multiple options for the same region. However sometimes the names and abreviations are conflicting. It is highly recomended to use the official long form as in 'America/Los_Angeles'. If you would use the short form 'PST', it is first of all dubious whether this means only Standard Time, or if it should adjust for Daylight Saving Time. Also the same abreviations might be custom in two totally different areas.

Oracle considders PDT, PST, and America/Los_Angeles to be all synonyms. This means they will all adjust for Daylight Saving Time.

To see a list of official Timezone Names, and their abbreviated alias in Oracle, you can use following Query:

SELECT *
FROM gv$timezone_names

Or you can extend this query to lear the exact Offset

SELECT<
tzname,
tzabbrev,
tz_offset_now,
tz_offset_later,
CASE WHEN tz_offset_now != tz_offset_later THEN 'Y' ELSE 'N' END AS supports_dst
FROM (
SELECT
tzname,
tzabbrev,
TO_CHAR((SYSTIMESTAMP AT TIME ZONE tzname)                     ,'TZH:TZM (TZD)') AS tz_offset_now,
TO_CHAR((SYSTIMESTAMP AT TIME ZONE tzname) + INTERVAL '6' MONTH,'TZH:TZM (TZD)') AS tz_offset_later
FROM gv$timezone_names
)

We did not use the TZ_OFFSET function to figure out the Timezone Offset, because it is not parameterized, and as a consequence it only works on the current time (now).

Used SQL Functions: SYSTIMESTAMP, TO_CHAR

One More example:

SELECT
  tz,
  TO_CHAR(ts AT TIME ZONE tz                     ,format) AS ts_now,
  TO_CHAR((ts AT TIME ZONE tz) + INTERVAL '6' MONTH,format) AS ts_later
FROM
  (SELECT SYSTIMESTAMP AS ts FROM dual),
  (
              SELECT 'UTC'              AS tz FROM dual
    UNION ALL SELECT 'GMT'              AS tz FROM dual
    UNION ALL SELECT 'MST'              AS tz FROM dual
    UNION ALL SELECT 'PST'              AS tz FROM dual
    UNION ALL SELECT 'US/Mountain'      AS tz FROM dual
    UNION ALL SELECT 'America/Denver'   AS tz FROM dual
    UNION ALL SELECT 'America/Phoenix'  AS tz FROM dual
  ),
  (
     SELECT
       'DD-MON-YYYY HH24:MI:SSxFF TZH:TZM (TZD, TZR)' AS format
     FROM dual
  );

SYSTIMESTAMP is a TIMESTAMP WITH TIMEZONE, while SYSDATE is just of type DATE which does not carry Timezone information. SYSDATE is always in the timezone of the Database.

Date Calculations

If the datatype is of type DATE, you can operate date calculations on it by simply adding +1 for one day, or +1/24 for one hour (and other fractions). If you are working with a TIMESTAMP WITH TIMEZONE data type however, the correct way is to use an INTERVAL.

The interval was already used in previous examples too:

SELECT
TO_CHAR((SYSTIMESTAMP AT TIME ZONE 'UTC')+INTERVAL '6' MONTH,'DD-MON-YYYY HH24:MI:SSxFF TZH:TZM (TZD, TZR)')
FROM dual

Conversion

It was already shown how you can convert either DATE or TIMESTAMP WITH TIMEZONE to a VARCHAR by means of the TO_CHAR function.

You can also convert a SYSDATE into a TIMESTAMP WITH TIMEZONE:

First Cast it to a TIMESTAMP, and then apply the appropriate TZ using the oracle FROM_TZ function to get a TIMESTAMP WITH TIMEZONE. After having it in a TIMESTAMP WITH TIMEZONE data type, you move it to different Timezones using the AT TIME ZONE syntax.

A TIMESTAMP (with or without TZ), you can convert back to simple type of DATE using a Cast.

SELECT
  CAST(
    FROM_TZ(
      CAST( (TRUNC(SYSDATE)-1) AS TIMESTAMP),
      'US/Pacific'
    ) AT TIME ZONE 'UTC'
    AS DATE
  )
FROM DUAL

Used SQL Functions SYSDATE, TRUNC, FROM_TZ

Functions returning current time or date

The details are in the fine print of the documentation. Take a look at the Return Type, and the actual Timezone the DATE or TIMESTAMP is calculated in.

SYSDATE
- Return Type: DATE
- Time Zone: Host OS of Database Server
CURRENT_DATE
- Return Type: DATE
- Time Zone: Session
SYSTIMESTAMP
- Return Type: TIMESTAMP WITH TIME ZONE
- Time Zone: Host OS of Database Server
CURRENT_TIMESTAMP
- Return Type: TIMESTAMP WITH TIME ZONE
- Time Zone: Session
LOCALTIMESTAMP
- Return Type: TIMESTAMP
- Time Zone: Session
DBTIMEZONE
- Time Zone: DB Time Zone. Inherits from DB Server OS, but can be overridden using set at DB Creation or Alter using Database Parameter SET TIME_ZONE=.... This affects the time zone used for TIMESTAMP WITH LOCAL TIME ZONE datatypes.
SESSIONTIMEZONE
- Time Zone: Session Time zone. Inherits from Session hosting OS, but can be overridden using Session Parameter ALTER SESSION SET TIME_ZONE=....

Return Type, indicates whether or not the Timezone is available within the Datatype. If you try to print TZR if datatype does not carry TimeZone, then it will just show up as +00:00 (doesn’t mean it is GMT). Otherwise It will show the TimeZone matching either the Database or Session as indicated.

Time Zone, indicates in which Timezone the time is calculated. For matching TimeZone, the same Date/Time will be shown (HH24:MI).

Note that none of the Functions return the time in the Time Zone set with the Database Time Zone (or as returned by the DBTIMEZONE function). That is, none of the functions also return a datatype of TIMESTAMP WITH LOCAL TIME ZONE. However you can convert the output of any of the functions that does return a timezone into a different timezone (including DBTIMEZONE) as follows:

SELECT SYSTIMESTAMP AT TIME ZONE DBTIMEZONE FROM DUAL;

Conversion methods for Timestamp With Time Zone (memoryisolation.wordpress.com)

This entry was posted in sql and tagged oracle, plsql, SQL, TimeZone, tz.

Generate a List of Dates

Posted on November 2, 2013 Updated on December 2, 2013

There are many methods to accomplish the same. A few are listed. It probably serves more as a demo to some of the oracle features.

Method 1 – PLSQL Loop

You can also simply build a loop in PL/SQL to generate the list of dates. However for this to work in regular SQL, it will have to be defined using an Oracle ‘Pickler’ or PIPELINED Function.
More info can be found in “Developing and Optimizing Pipelined Functions”.

Method 2 – MODEL Queries

Oracle has the lesser known feature of Model Queries.
Those type of queries allow accessing and modifying oracle SQL output like it is Excel.

The Model allows you to modify the generated data by accessing cells like excel cells to:

access it for read,
access existing cells/rows for changing (UPDATE)
access existing + non existing (UPDATE + INSERT = UPSERT), allowing you to create additional data that didn’t exist before.

SELECT i,stamp
FROM (SELECT 1 AS i,TRUNC(SYSDATE) AS tstamp FROM DUAL)
MODEL
DIMENSION BY (i)
MEASURES (tstamp)
RULES UPSERT SEQUENTIAL ORDER (
tstamp[FOR i FROM 2 TO 10 INCREMENT 1] = tstamp[i=cv(i)-1]+1
);

MEASURES defines the columns you want to access in an array fashion (like you would access a multidimensional array in java or other programming languages.
DIMENSION BY lists out what the indexes are you want to use for accessing the array.
The RULES defines what the edits are that need to take place.

So with above example we can in combination with a loop build a range of dates, starting from one single preexisting date.

Method 3 – Recursive Subquery Factoring

WITH t AS (
  -- Anchor Query (first Row)
  SELECT TRUNC(SYSDATE) AS tstamp FROM DUAL
  -- Recursive Query (subsequent Rows)
  UNION ALL SELECT t.tstamp+1 AS stamp FROM t WHERE t.tstamp+1 <= TRUNC(SYSDATE)+10
)
SELECT * FROM t

Method 4 – Hierarchical Queries using `CONNECT BY`

SELECT TRUNC(SYSDATE) + (LEVEL-1) AS tstamp FROM DUAL
CONNECT BY LEVEL <= 10

Elegance in Simplicity …

References

Oracle-Developer.net article on improving performance with pipelined table functions
Oracle 11gR2 Documentation on
- Model Queries
- Hierarchical Queries

This entry was posted in sql and tagged connect by, model queries, oracle, oracle SQL, SQL.

pivot

Posted on November 2, 2013 Updated on November 4, 2013

Ever saw oracle’s official documentation on the SQL PIVOT ? Couldn’t find the matching documentation for it, and no examples? I have turned the PIVOT inside-out, and will share you what it does.

Starting with pages of unrelated data points

SELECT
student,
subject,
minor,
quarter,
score,
score_max
FROM metric

output
student	subject	minor	quarter	score	score_max
Jeff	Math	Algebra	1	14	20
Jeff	Math	Trigonometry	1	14	20
Jeff	English	Grammar	2	9	10
Jeff	English	Literature	2	9	10
…

This table basically stores for each student his scores on a couple of subjects, further subdivided into minor subject, and for each individual quarter of the year.
Now, lets try to apply a full fledged PIVOT, including all whistles and bells …

SELECT * FROM (
  SELECT student, subject, minor, score, score_max FROM metric
) PIVOT (
  COUNT(DISTINCT quarter) AS "CNT",
  SUM(score) AS num,
  SUM(score_max) AS den
  FOR (subject,minor) IN (
    ('Math','Trigonometry') AS "M_TRI",
    ('Math','Algebra') AS "M_ALG",
    ('English','Grammar') AS "E_GRA",
    ('English','Literature') AS "E_LIT" )
  )

And again the output … maybe to some surprises …

After the Pivot
student	m_tri_cnt	m_tri_num	m_tri_den	m_alg_cnt	m_alg_num	m_alg_den	e_gra_cnt	e_gra_num	e_gra_den	e_lit_cnt	e_lit_num	e_lit_den
Jeff	2	28	40	2	14	20	4	69	100	3	356	400
Mieke	3	38	40	2	18	20	4	69	100	4	301	400
…

Ok, we all expected rows be turned into columns based on some determinator. But a lot more happened, that is obvious. I will list out –

FOR (determinators) identifies the determinators used to identify the pivoted columns, in this case a combination of two columns (subject,minor).
IN (...) lists out any combination of the determinators, and the associated alias (the new column name).
The first part within the PIVOT (aggregators FOR ...), the aggregators (MIN,MAX, COUNT, …). These aggregators basically describe a method to be used to collapse multiple rows into one. In the simplified case, the chosen aggregation method wont matter, But in this example it does. Any column not used in the aggregator sector, will be part of an implicit GROUP BY. When providing multiple aggregators, you need to associate each aggregator with it’s own alias. This alias will be concatenated at the end of the alias provided for the determinator alias.

English is a very ambiguous medium to try to explain something like this, but this is why the provided examples should shed light on the obscurities. Don’t forget to keep the Oracle SQL Reference Guide handy.
We started with the most complex example. Now to show that it does not need to be always that verbose, we will simplify this example. We will assume:

All scores are 100-based, and to come to a total score we can just SUM(...)
There are no minor subjects anymore
No quarters either, just a list of students and their score for each main subject

SELECT
  student,
  subject,
  score
FROM metric

output
student	subject	score
Jeff	Math	80
Jeff	English	93
…

Now, if we want to pivot this, it becomes much more simpler, and less verbose:

SELECT *
FROM
  metric
  PIVOT (
    SUM(score) FOR (subject) IN ('Math','English')
  )

What simplified this is the following:

With a single aggregator, no more need for a separate alias suffix.
Single determinator.
The list of determinator combinations does not have an alias defined either. However Columns will appear literally as the expression, showing ‘Math’, including the single quotes.
The pivoted query is now just the table itself, no subquery.

output
student	‘Math’	‘English’
Jeff	212	534
Joe	365	653
John	456	543
Jo-Ann	383	645
…

This entry was posted in sql and tagged oracle, oracle SQL, pivot, SQL.

Adventures in PLSQL

Nurture the Language

Parsing a string with a CSV into multiple columns

The Problem

Attempt 1: Hide complexity in a Stored Procedure

Attempt 2: Fully Iterative Function

Issues

Working with Timezones in Oracle

Calculate the Timezone Offset

Working with the `TIMESTAMP WITH TIMEZONE` datatype

Time Zones

Date Calculations

Conversion

Functions returning current time or date

Generate a List of Dates

Method 1 – PLSQL Loop

Method 2 – MODEL Queries

Method 3 – Recursive Subquery Factoring

Method 4 – Hierarchical Queries using `CONNECT BY`

References

pivot

Starting with pages of unrelated data points

	hiflitetm on pivot
	David on pivot

Parsing a string with a CSV into multiple columns

The Problem

Attempt 1: Hide complexity in a Stored Procedure

Attempt 2: Fully Iterative Function

Issues

Working with Timezones in Oracle

Calculate the Timezone Offset

Working with the TIMESTAMP WITH TIMEZONE datatype

Time Zones

Date Calculations

Conversion

Functions returning current time or date

Generate a List of Dates

Method 1 – PLSQL Loop

Method 2 – MODEL Queries

Method 3 – Recursive Subquery Factoring

Method 4 – Hierarchical Queries using CONNECT BY

References

pivot

Starting with pages of unrelated data points

Working with the `TIMESTAMP WITH TIMEZONE` datatype

Method 4 – Hierarchical Queries using `CONNECT BY`