Created by Kevin Feasel
@feaselkl
feasel@catallaxyservices.com
Catallaxy Services | @feaselkl |
|
Curated SQL | ||
We Speak Linux |
N |
0 |
1 |
2 |
3 |
... |
19999 |
20000 |
A classic tally table is nothing more than a table of numbers.
There is no fixed limit on size. Keep in mind that a SQL Server page contains approximately 8K worth of data, and an integer is 4 bytes. Factoring in overhead, we can put approximately 2000 values on a single page.
In my experience, I find that a tally table needn't be much larger than 20,000 records.
There are a number of methods which you can use to generate the numbers in a tally table. Here is a relatively efficient method:
--Suppress the auto-display of rowcounts for appearance/speed
SET NOCOUNT ON
IF OBJECT_ID('dbo.Tally') IS NOT NULL
DROP TABLE dbo.Tally
SELECT TOP 20000
IDENTITY(INT,0,1) AS N
INTO
dbo.Tally
FROM
Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
For small numbers (less than approximately 10,000), it is often faster to generate your tally table as a common table expression rather than storing it on disk. The following code shows you how to generate 10,000 numbers quickly.
WITH E1(N) AS
(
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT 0 UNION ALL
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4
)
SELECT N FROM cteTally;
In the procedural programming world, loops are outstanding ways of iterating over a number of items. When procedural programmers learn SQL, they apply the same lesson to SQL and typically get poor performance. This is because the WHILE and CURSOR operators in T-SQL are slow.
We can use tally tables to replace looping in a large number of common problem areas, and that gets rid of a RBAR headache.
The rest of this presentation will cover a number of problems. Most of these are common enough that you'll recognize the pattern quickly, but some are a bit more esoteric in nature.
This presentation is not the end-all, be-all of what tally tables can do; they are quite a bit more useful than even this presentation can show.
We need to find a way to split some text. This might be to strip out titles from a name or to normalize a flat file that somebody dumped into a one-column table.
Suppose that we have a table with information on former Congressmen. This information came in from a comma-delimited file, and the end result looks a little like:
create table #CongressInformation
(
Id int IDENTITY(1,1),
String varchar(500)
);
insert into #CongressInformation(String) values
('Bob,Dole,Kansas,Senate,1969,1996'),
('Robert,Taft,Ohio,House,1920,1930'),
('Robert,Taft,Ohio,Senate,1938,1953');
When we need to loop through a string, looking for commas, the obvious answer is a cursor or WHILE loop.
Sample code to parse text using a WHILE loop:
WHILE LEN(@StringInput) > 0
BEGIN
SET @Value = LEFT(
@StringInput,
ISNULL(NULLIF(CHARINDEX(@pDelimiter, @StringInput) - 1, -1),
LEN(@StringInput))
);
SET @StringInput = SUBSTRING(
@StringInput,
ISNULL(NULLIF(CHARINDEX(@pDelimiter, @StringInput), 0),
LEN(@StringInput)) + 1,
LEN(@StringInput)
);
SET @ColumnNumber = @ColumnNumber + 1;
INSERT INTO @OutputTable ( [ID], [Value], [ColumnNumber] )
VALUES ( @CurrentID, @Value, @ColumnNumber )
END
Instead of using a WHILE loop (a RBAR technique), we can use a tally table to create a set-based solution and give us ALL of the results simultaneously, rather than one row at a time.
Sample code to parse text using a tally table:
SELECT
Id,
ROW_NUMBER() over (partition by Id order by N) as ColumnNumber,
SUBSTRING
(
',' + p.String + ',',
N+1,
CHARINDEX(',', ',' + p.String + ',', N+1 ) - N-1
) as Value
FROM
dbo.Tally t
CROSS JOIN #CongressInformation p
WHERE
N < LEN(',' + p.String + ',')
AND SUBSTRING(',' + p.String + ',', N, 1) = ','
In order to show which solution is truly better, let's go to Management Studio.
Open and run Demo 1 - Splitting Text.sql.
We have a flat set of questions following a common format. We want to pivot this question set dynamically (because it may change later) with as little hassle as possible. Unfortunately, we don't have good descriptions for each question, so we're going to use the column text to describe the question.
CREATE TABLE dbo.AllQuestions
(
PersonID int,
IsFeelingOkay bit,
IsHungry bit,
IsTired bit,
IsWearingBlue bit
);
PersonID | IsFeelingOkay | IsHungry | IsTired | IsWearingBlue |
Bob | 1 | 0 | 1 | 0 |
PersonID | Question | Answer |
Bob | Feeling Okay | 1 |
Bob | Hungry | 0 |
Bob | Tired | 1 |
Bob | Wearing Blue | 0 |
Before we describe how to use a tally table to solve this problem, first we need to get a valid set of questions. The following code uses the system tables to grab the correct columns.
SELECT
sc.name as ColumnName,
SUBSTRING(sc.name, 3, LEN(sc.name)) as SplitColumnName
FROM
sys.columns sc
INNER JOIN sys.tables st ON sc.object_id = st.object_id
INNER JOIN sys.schemas ss ON st.schema_id = ss.schema_id
INNER JOIN sys.types styp ON sc.system_type_id = styp.system_type_id
WHERE
ss.name = 'dbo'
and st.name = 'AllQuestions'
and sc.name like 'Is%'
and styp.name = 'bit'
With SplitColumnName in a Questions CTE, we have enough to use a tally table. The BINARY_CHECKSUM function is case-sensitive.
SELECT
q.ColumnName,
q.SplitColumnName,
SUBSTRING(q.SplitColumnName, n.N, 1) as Letter,
n.N,
CASE
WHEN BINARY_CHECKSUM(SUBSTRING(q.SplitColumnName, n.N, 1)) = BINARY_CHECKSUM(LOWER(SUBSTRING(q.SplitColumnName, n.N, 1))) THEN 0
ELSE 1
END AS IsUppercaseLetter
FROM
dbo.Tally n
CROSS JOIN Questions q
WHERE
n.N <= LEN(q.SplitColumnName)
Now we can put a space before each upper-case letter (other than at the beginning) and use STUFF + FOR XML PATH to glue our letters back together.
Time to put everything together in Management Studio.
Open and run Demo 2 - Splitting Out Camel Case.sql.
We need to generate some random data for a user. This might go into a QA environment (giving us de-identified data). In our case, we want to generate a random salt of 20 characters for each user (with sys.all_objects standing in as a proxy for a user table).
Once again, we're going to contrast the RBAR method with the tally table method.
Our procedural brains tell us that this is a good place for a loop. This code generates 20 characters per record using a cursor:
DECLARE @counter tinyint;
DECLARE @nextChar char(1);
DECLARE @randomString varchar(20);
DECLARE @objectID bigint;
declare @randomStrings table
(
id bigint,
string varchar(20)
);
DECLARE userCursor CURSOR FOR
SELECT Object_ID
FROM sys.all_objects;
OPEN userCursor
FETCH NEXT FROM userCursor INTO @objectID;
WHILE @@FETCH_STATUS = 0
BEGIN
SET @counter = 1
SET @randomString = ''
WHILE @counter <= 20
BEGIN
SELECT @nextChar = CHAR(ROUND(RAND(cast(abs(checksum(newid())) as int)) * 93 + 33, 0))
SELECT @randomString = @randomString + @nextChar
SET @counter = @counter + 1
END
insert into @randomStrings(id, string)
select @objectID, @randomString;
FETCH NEXT FROM userCursor INTO @objectID;
END
CLOSE userCursor;
DEALLOCATE userCursor;
select * from @randomStrings;
Instead of using a cursor loop (a RBAR technique), we can use a tally table to create a set-based solution and give us ALL of the results simultaneously, rather than one row at a time.
Sample code to parse text using a tally table:
with saltchars as
(
select
object_id as Id,
N,
CHAR(ROUND(RAND(cast(abs(checksum(newid())) as int)) * 93 + 33, 0)) as randomvalue
from
sys.all_objects u
cross join dbo.Tally t
where
t.N between 1 and 20
)
select distinct
Id,
substring((select ('' + randomvalue )
from saltchars s2
where s1.Id = s2.Id
order by N
for xml path ( '' )
), 1, 20) as RandomValue
from
saltchars s1;
In order to show which solution is truly better, let's go to Management Studio.
Open and run Demo 3 - Generating Random Text.sql.
We have a data set that we need to validate. There are a number of potential errors, and we want to report on every error we find, rather than stopping at the first and quitting.
In this case, we'll contrast two set-based methods: a tally table and a UNION ALL combination.
Here is our data table structure:
CREATE TABLE #Data
(
Id int NOT NULL,
Field1 varchar(50) NOT NULL,
Field2 varchar(50) NULL,
SomeNum1 int NOT NULL,
SomeNum2 int NULL
);
And here are the scenarios that would trigger an error:
No RBAR here, but we have four table scans.
SELECT Id, 'Field2 is NULL!' as ErrorMessage FROM #Data WHERE Field2 IS NULL
UNION ALL
SELECT Id, 'SomeNum2 is NULL!' FROM #Data WHERE SomeNum2 IS NULL
UNION ALL
SELECT Id, 'Field1 is bigger than Field2!' FROM #Data WHERE Field1 > Field2
UNION ALL
SELECT Id, 'SomeNum1 is bigger than SomeNum2!' FROM #Data WHERE SomeNum1 > SomeNum2
ORDER BY Id, ErrorMessage;
This is the first time in which our code has become more complicated as a result of having a tally table. Cutting from N scans to 1 does make it worthwhile, though.
WITH tally as
(
SELECT 1 as N, 'Field2 is NULL!' as ErrorMessage
UNION ALL
SELECT 2 as N, 'SomeNum2 is NULL!' as ErrorMessage
UNION ALL
SELECT 3 as N, 'Field1 is bigger than Field2!' as ErrorMessage
UNION ALL
SELECT 4 as N, 'SomeNum1 is bigger than SomeNum2!' as ErrorMessage
)
SELECT
d.Id,
t.ErrorMessage
FROM
#Data d
CROSS JOIN tally t
WHERE
(
(t.N = 1 AND d.Field2 IS NULL)
OR (t.N = 2 AND d.SomeNum2 IS NULL)
OR (t.N = 3 AND d.Field1 > d.Field2)
OR (t.N = 4 AND d.SomeNum1 > d.SomeNum2)
);
In order to show which solution performs better, let's go to Management Studio.
Open and run Demo 4 - Generating Multiple Rows.sql.
We have ranges set up in a table. These ranges should be unique but aren't. Given a precedence rule, how do we generate guaranteed unique ranges?
create table #Data
(
Id int identity(1,1),
Person int,
BeginValue int,
EndValue int
);
Data | ||
Person | Begin Value | End Value |
Bob | 1 | 8 |
Bob | 10 | 15 |
Bob | 13 | 17 |
Jane | 2 | 9 |
Jane | 3 | 6 |
Corrected | ||
Person | Begin Value | End Value |
Bob | 1 | 8 |
Bob | 10 | 15 |
Bob | 16 | 17 |
Jane | 2 | 9 |
We use a tally table to give us the first record which hit a particular value:
select
t.N,
d.Person,
MIN(d.Id) as FirstId
from
#Data d
inner join dbo.Tally t
on t.N between d.BeginValue and d.EndValue
group by
t.N,
d.Person
We can now create our corrected ranges:
select
Person,
FirstId as Id,
MIN(N) as NewBeginValue,
MAX(N) as NewEndValue
from
(
select
t.N,
d.Person,
MIN(d.Id) as FirstId
from
#Data d
inner join dbo.Tally t
on t.N between d.BeginValue and d.EndValue
group by
t.N,
d.Person
) arrangements
group by
Person,
FirstId
Time to put everything together in Management Studio.
Open and run Demo 5 - Determining Unique Ranges.sql.
We receive date as a text field. They should come in as YYYYMMDD strings, but there is no input validation. We need to filter out all of the bad dates.
--Date validation
declare @somedates table
(
adate varchar(8)
);
insert into @somedates(adate) values
('19990501'),
('20111499'), /* Obviously bad. */
('20090101'),
('20111305'), /* Bad...or is it? */
('ababyyyy'), /* Obviously bad. */
('9901'), /* Should be bad. */
('Jun 1 09'), /* Should be bad. */
('20030419');
We use a tally table to give us the set of valid dates in our date range:
DECLARE @StartDate date = '1991-01-01';
DECLARE @EndDate date = '2015-01-01';
SELECT
CONVERT(varchar(8), DATEADD(d, N, @StartDate), 112) as ValidDate
FROM
dbo.Tally
WHERE
DATEADD(d, N, @StartDate) <= @EndDate
We can now determine which fields are valid input without having to perform a RBAR function:
with tallyDates as
(
SELECT
CONVERT(varchar(8), DATEADD(d, N, @StartDate), 112) as ValidDate
FROM
dbo.Tally
WHERE
DATEADD(d, N, @StartDate) <= @EndDate
)
SELECT
sd.*
FROM
@somedates sd
left outer join tallyDates td on sd.adate = td.ValidDate
WHERE
td.ValidDate is null;
Why shouldn't we just use the ISDATE() function within SQL Server?
Input | ISDATE Result |
19990501 | 1 |
20111499 | 0 |
20090101 | 1 |
20111305 | 0 in EN; 1 in FR! |
ababyyyy | 0 |
9901 | 1 |
Jun 1 09 | 1 |
20030419 | 1 |
Time to put everything together in Management Studio.
Open and run Demo 6 - Date Checking and Filtering.sql.
We have an input "array" for a stored procedure. Because SQL Server doesn't handle array types, a developer created a 2D pseudo-array with two delimiters: comma for attributes and pipe for elements. Our job is to split out the elements in that array and re-create a table based on that string.
DECLARE @Parameter VARCHAR(8000) =
'Part#1,PartName1,1|Part#2,PartName2,2|Part#3,PartName3,3';
PartNumber | PartName | QuantityOrdered |
Part#1 | PartName1 | 1 |
Part#2 | PartName2 | 2 |
Part#3 | PartName3 | 3 |
We use one tally table, but need a little bit of setup work first:
DECLARE @GroupCount INT;
SELECT @GroupCount =
-- Find the length of the first group...
LEN(SUBSTRING(@Parameter,1,CHARINDEX('|',@Parameter)))
-- ... subtract the length of the first group without any commas...
- LEN(REPLACE(SUBSTRING(@Parameter,1,CHARINDEX('|',@Parameter)),',',''))
-- ... and add 1 because there is always 1 element more than commas.
+ 1;
SET @Parameter = ',' + REPLACE(@Parameter,'|',',') + ',';
Now join in the tally table and set rows and columns based on GroupCount.
INSERT INTO Elements (ElementValue)
SELECT
SUBSTRING(@Parameter, N+1, CHARINDEX(',',@Parameter, N+1) - N-1)
FROM
dbo.Tally
WHERE
N < LEN(@Parameter)
AND SUBSTRING(@Parameter, N, 1) = ',';
--===== Calculate and update the "row" and "column" indices
UPDATE Elements
SET
RowNum = (ElementNumber-1) / @GroupCount, --"Zero" based row
ColNum = (ElementNumber-1) % @GroupCount; --"Zero" based col
--',Part#1,PartName1,1,Part#2,PartName2,2,Part#3,PartName3,3,'
SELECT
SUBSTRING(@Parameter, N+1, CHARINDEX(',',@Parameter, N+1) - N-1)
FROM
dbo.Tally
WHERE
N < LEN(@Parameter)
AND SUBSTRING(@Parameter, N, 1) = ',';
To begin constructing a table, we need column names and numbers. We can join the Elements table to our columns table.
declare @columns table
(
ColumnNumber int,
Name nvarchar(30)
);
insert into @columns(ColumnNumber, Name) values
(0, 'PartNumber'), (1, 'PartName'), (2, 'QuantityOrdered');
select
e.ColNum,
'max(case when colnum = '
+ cast(e.ColNum as nvarchar(2))
+ ' then ElementValue end) as '
+ c.Name
from
Elements e
inner join @columns c on e.ColNum = c.ColumnNumber
where
e.RowNum = 0
To begin constructing a table, we need column names and numbers. We can join the Elements table to our columns table.
Expected Results After Joining Elements Table To Columns Table | |
ColumnNumber | ColumnText |
0 | max(case when colnum = 0 then ElementValue end) as PartNumber |
1 | max(case when colnum = 1 then ElementValue end) as PartName |
2 | max(case when colnum = 2 then ElementValue end) as QuantityOrdered |
With the columns generated by records, we can put it back together as one SQL statement.
select distinct
RecordText = STUFF((
select ( ', ' + ColumnText )
from records r2
order by ColumnNumber
for xml path ( '' )
), 1, 1, '')
from
records r1
/* RESULTS
max(case when colnum = 0 then ElementValue end) as PartNumber,
max(case when colnum = 1 then ElementValue end) as PartName,
max(case when colnum = 2 then ElementValue end) as QuantityOrdered
*/
Time to put everything together in Management Studio.
Open and run Demo 7 - Dynamic 2D Array Splitting.sql.
Tally tables excel in certain administrative and data cleansing activities:
They also work great with pattern search problems:
Although tally tables are extremely useful when dealing with denormalized data, they are NOT a replacement for sound database design and normalization techniques.
Slides and code are available at http://www.catallaxyservices.com/presentations/Tally-Em-Up
@feaselkl
feasel@catallaxyservices.com