Tally 'Em Up

How tally tables facilitate set-based solutions

@feaselkl

feasel@catallaxyservices.com

Who Am I? What Am I Doing Here?

	Catallaxy Services	@feaselkl
	Curated SQL
	We Speak Linux

What Is A Tally Table?

N
0
1
2
3
...
19999
20000

A classic tally table is nothing more than a table of numbers.

There is no fixed limit on size. Keep in mind that a SQL Server page contains approximately 8K worth of data, and an integer is 4 bytes. Factoring in overhead, we can put approximately 2000 values on a single page.

In my experience, I find that a tally table needn't be much larger than 20,000 records.

How To Create A Tally Table

There are a number of methods which you can use to generate the numbers in a tally table. Here is a relatively efficient method:


--Suppress the auto-display of rowcounts for appearance/speed
SET NOCOUNT ON 

IF OBJECT_ID('dbo.Tally') IS NOT NULL 
        DROP TABLE dbo.Tally

SELECT TOP 20000
	IDENTITY(INT,0,1) AS N
INTO 
	dbo.Tally
FROM 
	Master.dbo.SysColumns sc1
	CROSS JOIN Master.dbo.SysColumns sc2

ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

How To Create A (Faster) Tally Table

For small numbers (less than approximately 10,000), it is often faster to generate your tally table as a common table expression rather than storing it on disk. The following code shows you how to generate 10,000 numbers quickly.


WITH E1(N) AS 
(
	SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
	SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
	SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),					--10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b),	--10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b),	--10E+4 or 10,000 rows max
cteTally(N) AS 
(
     SELECT 0 UNION ALL
     SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) 
		FROM E4
)
SELECT N FROM cteTally;

Reasons To Use Tally Tables

Eliminate common cases of RBAR (Row By Agonizing Row) processing

In the procedural programming world, loops are outstanding ways of iterating over a number of items. When procedural programmers learn SQL, they apply the same lesson to SQL and typically get poor performance. This is because the WHILE and CURSOR operators in T-SQL are slow.

We can use tally tables to replace looping in a large number of common problem areas, and that gets rid of a RBAR headache.

Reasons To Use Tally Tables

Eliminate common cases of RBAR (Row By Agonizing Row) processing
Solve some common (or sometimes uncommon) problems

The rest of this presentation will cover a number of problems. Most of these are common enough that you'll recognize the pattern quickly, but some are a bit more esoteric in nature.

This presentation is not the end-all, be-all of what tally tables can do; they are quite a bit more useful than even this presentation can show.

Tally Table Use Cases

Splitting text
Splitting out camel case
Generating random strings
Generating multiple rows from a single row
Determining unique ranges
Date checking and filtering
Splitting out 2D arrays dynamically

Splitting Text

The Problem Description

We need to find a way to split some text. This might be to strip out titles from a name or to normalize a flat file that somebody dumped into a one-column table.

Suppose that we have a table with information on former Congressmen. This information came in from a comma-delimited file, and the end result looks a little like:


create table #CongressInformation
(
	Id int IDENTITY(1,1),
	String varchar(500)
);
 
insert into #CongressInformation(String) values
('Bob,Dole,Kansas,Senate,1969,1996'),
('Robert,Taft,Ohio,House,1920,1930'),
('Robert,Taft,Ohio,Senate,1938,1953');

Splitting Text

The Bad Way

When we need to loop through a string, looking for commas, the obvious answer is a cursor or WHILE loop.

Sample code to parse text using a WHILE loop:


WHILE LEN(@StringInput) > 0
BEGIN
	SET @Value = LEFT(
		@StringInput, 
		ISNULL(NULLIF(CHARINDEX(@pDelimiter, @StringInput) - 1, -1),
		LEN(@StringInput))
			);
	SET @StringInput = SUBSTRING(
		@StringInput,
		ISNULL(NULLIF(CHARINDEX(@pDelimiter, @StringInput), 0), 
			LEN(@StringInput)) + 1, 
		LEN(@StringInput)
				);
	SET @ColumnNumber = @ColumnNumber + 1;

	INSERT INTO @OutputTable ( [ID], [Value], [ColumnNumber] )
	VALUES ( @CurrentID, @Value, @ColumnNumber )
END

How The Tally Split Works

Splitting Text

The Tally Table Way

Instead of using a WHILE loop (a RBAR technique), we can use a tally table to create a set-based solution and give us ALL of the results simultaneously, rather than one row at a time.

Sample code to parse text using a tally table:


SELECT
	Id,
	ROW_NUMBER() over (partition by Id order by N) as ColumnNumber,
	SUBSTRING
	(
	    ',' + p.String + ',', 
	    N+1, 
	    CHARINDEX(',', ',' + p.String + ',', N+1 ) - N-1
	) as Value
FROM
	dbo.Tally t
	CROSS JOIN #CongressInformation p
WHERE
	N < LEN(',' + p.String + ',')
	AND SUBSTRING(',' + p.String + ',', N, 1) = ','

Demo Time

In order to show which solution is truly better, let's go to Management Studio.

Open and run Demo 1 - Splitting Text.sql.

Lessons Learned

The tally table solution scales much better than RBAR.

The tally table solution is more concise than a WHILE loop solution.

Both solutions return the same results. Using a tally table is therefore a free performance boost and a way to simplify code.

Tally Table Use Cases

Splitting text
Splitting out camel case
Generating random strings
Generating multiple rows from a single row
Determining unique ranges
Date checking and filtering
Splitting out 2D arrays dynamically

Splitting Out Camel Case

The Problem Description

We have a flat set of questions following a common format. We want to pivot this question set dynamically (because it may change later) with as little hassle as possible. Unfortunately, we don't have good descriptions for each question, so we're going to use the column text to describe the question.


CREATE TABLE dbo.AllQuestions
(
	PersonID int,
	IsFeelingOkay bit,
	IsHungry bit,
	IsTired bit,
	IsWearingBlue bit
);

Splitting Out Camel Case

Getting The Question List

PersonID	IsFeelingOkay	IsHungry	IsTired	IsWearingBlue
Bob	1	0	1	0

PersonID	Question	Answer
Bob	Feeling Okay	1
Bob	Hungry	0
Bob	Tired	1
Bob	Wearing Blue	0

In our sample, all questions start with "Is" and are of type bit. We want to unpivot this list so that each column is now a row.

Splitting Out Camel Case

Before we describe how to use a tally table to solve this problem, first we need to get a valid set of questions. The following code uses the system tables to grab the correct columns.


SELECT
    sc.name as ColumnName,
    SUBSTRING(sc.name, 3, LEN(sc.name)) as SplitColumnName      
FROM
    sys.columns sc
    INNER JOIN sys.tables st ON sc.object_id = st.object_id
    INNER JOIN sys.schemas ss ON st.schema_id = ss.schema_id
    INNER JOIN sys.types styp ON sc.system_type_id = styp.system_type_id
WHERE
    ss.name = 'dbo'
    and st.name = 'AllQuestions'
    and sc.name like 'Is%'
    and styp.name = 'bit'

Splitting Out Camel Case

With SplitColumnName in a Questions CTE, we have enough to use a tally table. The BINARY_CHECKSUM function is case-sensitive.


SELECT
	q.ColumnName,
	q.SplitColumnName,
	SUBSTRING(q.SplitColumnName, n.N, 1) as Letter,
	n.N,
	CASE 
		WHEN BINARY_CHECKSUM(SUBSTRING(q.SplitColumnName, n.N, 1)) = BINARY_CHECKSUM(LOWER(SUBSTRING(q.SplitColumnName, n.N, 1))) THEN 0 
		ELSE 1 
	END AS IsUppercaseLetter
FROM
	dbo.Tally n
	CROSS JOIN Questions q
WHERE
	n.N <= LEN(q.SplitColumnName)

Now we can put a space before each upper-case letter (other than at the beginning) and use STUFF + FOR XML PATH to glue our letters back together.

Demo Time

Time to put everything together in Management Studio.

Open and run Demo 2 - Splitting Out Camel Case.sql.

Tally Table Use Cases

Splitting text
Splitting out camel case
Generating random strings
Generating multiple rows from a single row
Determining unique ranges
Date checking and filtering
Splitting out 2D arrays dynamically

Generating Random Strings

The Problem Description

We need to generate some random data for a user. This might go into a QA environment (giving us de-identified data). In our case, we want to generate a random salt of 20 characters for each user (with sys.all_objects standing in as a proxy for a user table).

Once again, we're going to contrast the RBAR method with the tally table method.

Generating Random Strings

The Bad Way

Our procedural brains tell us that this is a good place for a loop. This code generates 20 characters per record using a cursor:


DECLARE @counter tinyint;
DECLARE @nextChar char(1);
DECLARE @randomString varchar(20);
DECLARE @objectID bigint;
declare @randomStrings table 
( 
	id bigint,
	string varchar(20) 
);

DECLARE userCursor CURSOR FOR
SELECT Object_ID
FROM sys.all_objects;

OPEN userCursor
FETCH NEXT FROM userCursor INTO @objectID;

WHILE @@FETCH_STATUS = 0
BEGIN	
	SET @counter = 1
	SET @randomString = ''

	WHILE @counter <= 20
	BEGIN
	  SELECT @nextChar = CHAR(ROUND(RAND(cast(abs(checksum(newid())) as int)) * 93 + 33, 0))
	  SELECT @randomString = @randomString + @nextChar
	  SET @counter = @counter + 1
	END
	
	insert into @randomStrings(id, string)
	select @objectID, @randomString;

	FETCH NEXT FROM userCursor INTO @objectID;
END
CLOSE userCursor;
DEALLOCATE userCursor;

select * from @randomStrings;

Splitting Text

The Tally Table Way

Instead of using a cursor loop (a RBAR technique), we can use a tally table to create a set-based solution and give us ALL of the results simultaneously, rather than one row at a time.

Sample code to parse text using a tally table:


with saltchars as
(
	select
	    object_id as Id,
	    N,
	    CHAR(ROUND(RAND(cast(abs(checksum(newid())) as int)) * 93 + 33, 0)) as randomvalue
	from 
	    sys.all_objects u
	    cross join dbo.Tally t
	where
	    t.N between 1 and 20
)
select distinct
	Id,
	substring((select ('' + randomvalue )
			    from saltchars s2 
			    where s1.Id = s2.Id
			    order by N 
			    for xml path ( '' )
	            ), 1, 20) as RandomValue
from
	saltchars s1;

Demo Time

In order to show which solution is truly better, let's go to Management Studio.

Open and run Demo 3 - Generating Random Text.sql.

Lessons Learned

The tally table solution scales much better tha RBAR.

The tally table solution is more concise than a cursor-based solution.

Both solutions return the same results. Using a tally table is therefore a free performance boost and a way to simplify code.

Doesn't the above text look familiar?

Tally Table Use Cases

Splitting text
Splitting out camel case
Generating random strings
Generating multiple rows from a single row
Determining unique ranges
Date checking and filtering
Splitting out 2D arrays dynamically

Generating Multiple Rows

The Problem Description

We have a data set that we need to validate. There are a number of potential errors, and we want to report on every error we find, rather than stopping at the first and quitting.

In this case, we'll contrast two set-based methods: a tally table and a UNION ALL combination.

Generating Multiple Rows

The Sample Data And Rules

Here is our data table structure:


CREATE TABLE #Data
(
    Id int NOT NULL,
    Field1 varchar(50) NOT NULL,
    Field2 varchar(50) NULL,
    SomeNum1 int NOT NULL,
    SomeNum2 int NULL
);

And here are the scenarios that would trigger an error:

Field2 is NULL
Field1 has a greater value than Field2
SomeNum2 is NULL
SomeNum1 is larger than SomeNum2

Generating Multiple Rows

The (Sort Of) Bad Way

No RBAR here, but we have four table scans.


SELECT Id, 'Field2 is NULL!' as ErrorMessage FROM #Data WHERE Field2 IS NULL
UNION ALL
SELECT Id, 'SomeNum2 is NULL!' FROM #Data WHERE SomeNum2 IS NULL
UNION ALL
SELECT Id, 'Field1 is bigger than Field2!' FROM #Data WHERE Field1 > Field2
UNION ALL
SELECT Id, 'SomeNum1 is bigger than SomeNum2!' FROM #Data WHERE SomeNum1 > SomeNum2
ORDER BY Id, ErrorMessage;

Generating Multiple Rows

The Tally Table Way

This is the first time in which our code has become more complicated as a result of having a tally table. Cutting from N scans to 1 does make it worthwhile, though.


WITH tally as
(
    SELECT 1 as N, 'Field2 is NULL!' as ErrorMessage
    UNION ALL
    SELECT 2 as N, 'SomeNum2 is NULL!' as ErrorMessage
    UNION ALL
    SELECT 3 as N, 'Field1 is bigger than Field2!' as ErrorMessage
    UNION ALL
    SELECT 4 as N, 'SomeNum1 is bigger than SomeNum2!' as ErrorMessage
)
SELECT
    d.Id,
    t.ErrorMessage
FROM
    #Data d
    CROSS JOIN tally t
WHERE
    (
        (t.N = 1 AND d.Field2 IS NULL)
        OR (t.N = 2 AND d.SomeNum2 IS NULL)
        OR (t.N = 3 AND d.Field1 > d.Field2)
        OR (t.N = 4 AND d.SomeNum1 > d.SomeNum2)
    );

Demo Time

In order to show which solution performs better, let's go to Management Studio.

Open and run Demo 4 - Generating Multiple Rows.sql.

Lessons Learned

The tally table solution scales better than the UNION ALL technique, but the difference isn't as drastic as a RBAR solution.

The tally table solution is a bit more complex than the UNION ALL technique.

Sometimes a tally table can have a little bit more context--for example, we can include easy-to-understand error messages for our data loads.

Tally Table Use Cases

Splitting text
Splitting out camel case
Generating random strings
Generating multiple rows from a single row
Determining unique ranges
Date checking and filtering
Splitting out 2D arrays dynamically

Determining Unique Ranges

The Problem Description

We have ranges set up in a table. These ranges should be unique but aren't. Given a precedence rule, how do we generate guaranteed unique ranges?


create table #Data
(
	Id int identity(1,1),
	Person int,
	BeginValue int,
	EndValue int
);

Determining Unique Ranges

Understanding The Problem

Data
Person	Begin Value	End Value
Bob	1	8
Bob	10	15
Bob	13	17
Jane	2	9
Jane	3	6

Corrected
Person	Begin Value	End Value
Bob	1	8
Bob	10	15
Bob	16	17
Jane	2	9

In our sample, Bob's 10-15 overlaps with 13-17. We take the earlier range as "more correct" and update the latter range. We would eliminate any rows which are completely engulfed within an earlier range.

Determining Unique Ranges

We use a tally table to give us the first record which hit a particular value:


select
	t.N,
	d.Person,
	MIN(d.Id) as FirstId
from
	#Data d
	inner join dbo.Tally t
		on t.N between d.BeginValue and d.EndValue
group by
   t.N,
   d.Person

Determining Unique Ranges

We can now create our corrected ranges:


select
    Person,
    FirstId as Id,
    MIN(N) as NewBeginValue,
    MAX(N) as NewEndValue
from
(
	select
		t.N,
		d.Person,
		MIN(d.Id) as FirstId
	from
		#Data d
		inner join dbo.Tally t
			on t.N between d.BeginValue and d.EndValue
	group by
		t.N,
		d.Person
) arrangements
group by
    Person,
    FirstId

Demo Time

Time to put everything together in Management Studio.

Open and run Demo 5 - Determining Unique Ranges.sql.

Lessons And Warnings

Tally tables solve another problem caused by de-normalization.
This code can work for dates as well as numbers.
This code does not scale beyond ~1 million rows.

Tally Table Use Cases

Splitting text
Splitting out camel case
Generating random strings
Generating multiple rows from a single row
Determining unique ranges
Date checking and filtering
Splitting out 2D arrays dynamically

Date Checking And Filtering

The Problem Description

We receive date as a text field. They should come in as YYYYMMDD strings, but there is no input validation. We need to filter out all of the bad dates.


--Date validation
declare @somedates table
(
    adate varchar(8)
);
 
insert into @somedates(adate) values
('19990501'),
('20111499'),   /* Obviously bad. */
('20090101'),
('20111305'),   /* Bad...or is it? */
('ababyyyy'),   /* Obviously bad. */
('9901'),       /* Should be bad. */
('Jun 1 09'),   /* Should be bad. */
('20030419');

Date Checking And Filtering

We use a tally table to give us the set of valid dates in our date range:


DECLARE @StartDate date = '1991-01-01';
DECLARE @EndDate date = '2015-01-01';

SELECT 
	CONVERT(varchar(8), DATEADD(d, N, @StartDate), 112) as ValidDate
FROM 
	dbo.Tally
WHERE 
	DATEADD(d, N, @StartDate) <= @EndDate

Date Checking And Filtering

We can now determine which fields are valid input without having to perform a RBAR function:


with tallyDates as
(
	SELECT 
	    CONVERT(varchar(8), DATEADD(d, N, @StartDate), 112) as ValidDate
	FROM 
		dbo.Tally
	WHERE 
		DATEADD(d, N, @StartDate) <= @EndDate
)
SELECT 
	sd.*
FROM 
	@somedates sd
	left outer join tallyDates td on sd.adate = td.ValidDate
WHERE 
	td.ValidDate is null;

Date Checking And Filtering

ISDATE()

Why shouldn't we just use the ISDATE() function within SQL Server?

Input	ISDATE Result
19990501	1
20111499	0
20090101	1
20111305	0 in EN; 1 in FR!
ababyyyy	0
9901	1
Jun 1 09	1
20030419	1

Demo Time

Time to put everything together in Management Studio.

Open and run Demo 6 - Date Checking and Filtering.sql.

Lessons And Warnings

Tally tables can be part of your validation routines.
Collation and region may affect your results using ISDATE; this will always use YYYYMMDD (due to the CONVERT syntax).
Think about creating a permanent Date table, like in data warehousing.

Tally Table Use Cases

Splitting text
Splitting out camel case
Generating random strings
Generating multiple rows from a single row
Determining unique ranges
Date checking and filtering
Splitting out 2D arrays dynamically

Dynamic 2D Array Splitting

The Problem Description

We have an input "array" for a stored procedure. Because SQL Server doesn't handle array types, a developer created a 2D pseudo-array with two delimiters: comma for attributes and pipe for elements. Our job is to split out the elements in that array and re-create a table based on that string.


DECLARE @Parameter VARCHAR(8000) = 
	'Part#1,PartName1,1|Part#2,PartName2,2|Part#3,PartName3,3';

PartNumber	PartName	QuantityOrdered
Part#1	PartName1	1
Part#2	PartName2	2
Part#3	PartName3	3

Dynamic 2D Array Splitting

We use one tally table, but need a little bit of setup work first:


DECLARE @GroupCount INT; 
SELECT @GroupCount =
	  -- Find the length of the first group...
	  LEN(SUBSTRING(@Parameter,1,CHARINDEX('|',@Parameter)))
	  -- ... subtract the length of the first group without any commas...
	- LEN(REPLACE(SUBSTRING(@Parameter,1,CHARINDEX('|',@Parameter)),',',''))
	  -- ... and add 1 because there is always 1 element more than commas.
	+ 1;
 
SET @Parameter = ',' + REPLACE(@Parameter,'|',',') + ',';

Dynamic 2D Array Splitting

Now join in the tally table and set rows and columns based on GroupCount.


INSERT INTO Elements (ElementValue)
SELECT
	SUBSTRING(@Parameter, N+1, CHARINDEX(',',@Parameter, N+1) - N-1)
FROM 
	dbo.Tally
WHERE 
	N < LEN(@Parameter)
	AND SUBSTRING(@Parameter, N, 1) = ',';
 
--===== Calculate and update the "row" and "column" indices
 UPDATE Elements
 SET 
	RowNum = (ElementNumber-1) / @GroupCount, --"Zero" based row
	ColNum = (ElementNumber-1) % @GroupCount;  --"Zero" based col

Dynamic 2D Array Splitting

Reminder: How Does This Work?


--',Part#1,PartName1,1,Part#2,PartName2,2,Part#3,PartName3,3,'
SELECT
	SUBSTRING(@Parameter, N+1, CHARINDEX(',',@Parameter, N+1) - N-1)
FROM 
	dbo.Tally
WHERE 
	N < LEN(@Parameter)
	AND SUBSTRING(@Parameter, N, 1) = ',';

N < LEN(@Parameter): only get needed pages.
SUBSTRING(@Parameter, N, 1) = ',': get the beginning of each item.
SUBSTRING(@Parameter...): split out the actual text.
N+1: start AFTER the leading comma.
CHARINDEX(',', @Parameter, N+1): get NEXT comma.
CHARINDEX(.) - N - 1: get only the characters BETWEEN the commas.

Dynamic 2D Array Splitting

Constructing A Table

To begin constructing a table, we need column names and numbers. We can join the Elements table to our columns table.


declare @columns table
(
    ColumnNumber int,
    Name nvarchar(30)
);

insert into @columns(ColumnNumber, Name) values
(0, 'PartNumber'), (1, 'PartName'), (2, 'QuantityOrdered');

select
		e.ColNum,
		'max(case when colnum = '
		+ cast(e.ColNum as nvarchar(2))
		+ ' then ElementValue end) as '
		+ c.Name
from
	Elements e
	inner join @columns c on e.ColNum = c.ColumnNumber
where
	e.RowNum = 0

Dynamic 2D Array Splitting

Constructing A Table

To begin constructing a table, we need column names and numbers. We can join the Elements table to our columns table.

Expected Results After Joining Elements Table To Columns Table
ColumnNumber	ColumnText
0	max(case when colnum = 0 then ElementValue end) as PartNumber
1	max(case when colnum = 1 then ElementValue end) as PartName
2	max(case when colnum = 2 then ElementValue end) as QuantityOrdered

Dynamic 2D Array Splitting

Unpivoting

With the columns generated by records, we can put it back together as one SQL statement.


select distinct
	RecordText = STUFF((
				select ( ', ' + ColumnText )
				from records r2
				order by ColumnNumber
				for xml path ( '' )
			), 1, 1, '')
from
	records r1

/* RESULTS
max(case when colnum = 0 then ElementValue end) as PartNumber, 
max(case when colnum = 1 then ElementValue end) as PartName, 
max(case when colnum = 2 then ElementValue end) as QuantityOrdered
*/

Demo Time

Time to put everything together in Management Studio.

Open and run Demo 7 - Dynamic 2D Array Splitting.sql.

Lessons And Warnings

Tally tables can solve complicated messes while still performing well.
We can break complex problems down into simpler problems and apply known solutions.
This code can be adapted to fix Entity-Attribute-Value (EAV) designs.
This is an anti-pattern, not a pattern! Do not use this as a replacement for arrays!

Wrapping Up

Tally tables excel in certain administrative and data cleansing activities:

Splitting strings
Normalizing data
Error checking
Data generation

They also work great with pattern search problems:

Filtering dates
Finding sequences
Maximum chains of common values

Although tally tables are extremely useful when dealing with denormalized data, they are NOT a replacement for sound database design and normalization techniques.

Resources

Slides and code are available at http://www.catallaxyservices.com/presentations/Tally-Em-Up

@feaselkl

feasel@catallaxyservices.com