From Exponential to Efficient: Building Commafree Codes with Knuth's Algorithm C

Introduction

What if you could create a set of codes that can be uniquely decoded without any commas or delimiters – what would that mean for DNA sequencing, data compression, and error correction?

This is not purely academic curiosity – commafree codes have real-world applications in all of these fields. In my previous post, we explored a recursive algorithm with exponential complexity. Today, we’ll dive into Knuth’s Algorithm C, which achieves remarkable efficiency through clever backtracking and sparse data structures.

What you’ll learn in this post:

How Algorithm C reduces complexity from O(2^n) to manageable levels
The ingenious “poison array” technique for pruning the search space
Practical implementation tips and performance optimizations

Whether you’re working through TAOCP 4B or simply love elegant algorithms, this deep dive will change how you think about combinatorial search.

Algorithm C

Framework

The algorithm framework consists of five steps:

Initialize arrays
Select the primary codeword
Propagate the effects of the selection
Recurse and print results
Backtrack

Initialization

This step deserves its own section because, beyond the data structures introduced in the previous post, there are no fewer than 10 arrays to initialize. Their main purposes are:

Tracking colors and candidate sets
Recording recursion state for backtracking

I tried describing this in pseudocode but found it far too verbose, and single-letter variable names like x and c are a headache to follow. Here we’ll focus only on the MEM array.

MEM Array Initialization Example

Let’s return to the example from the end of the previous post:

alt text

This is the MEM array we need to initialize when m=2. It happens to have exactly 16 entries per row, storing all 4-bit binary values.

The first row records the color of each code:

#0000 #0100 #1000 #1010 #1111 are already red
Three of these being red makes sense – they follow the abab pattern and are inherently excluded

Here’s the puzzle: why are #0100 and #1000 also colored red?

Try to figure it out yourself first – I’ll reveal the answer at the end of the post.

We’ll skip the remaining auxiliary arrays and variables and move on to the next step.

Trying Primary Codewords

Next, we boldly pick a blue-colored code from the candidate set. Once selected, we turn it green. Suppose we pick #0010 this time. We record this choice and proceed to the next step.

Update & Mark

1. Updating Sparse Sets

After selecting #0010, we can quickly update the 7 sparse sets according to their definitions, which in turn updates the colors in the state array.

Handling P sets:

We simply “close” the prefix linked lists corresponding to #0010
For example, P1 in MEM[20-24] records codes starting with \(0\star \star \star\)
We just set MEM[30] to the list head minus 1, i.e., MEM[30]=1f
P2 works the same way – “close” \(00\star \star\)
S sets are handled identically

Handling CL sets:

The CL set records the cyclic shift set of blue codewords
After selecting #0010, we need to remove all entries in its shift set
That is, the set represented by MEM[140-141], setting MEM[150]=13f

Before and after comparison:

Before: before cl

After: after cl

State Marking

You may have noticed that #1001 is also missing from the updated CL set. Why is that? This is where we introduce yet another array: poison. After selecting #0010, we add

\(\)(\star001, 0\star \star \star), (\star \star00,10\star \star), (\star \star \star 0,010\star)\(\)

to the poison array. For the first pair, since \(0\star \star \star\) contains the green #0010, \(\star001\) must be red. This means #1001 becomes red, which explains why it was also removed from the CL set above (and from the other 6 sets as well). In other words, every prefix-suffix pair must contain at least one green and one red. Similarly, for the third pair, since \(\star \star \star0\) already contains a green code, everything matching \(010\star\) must be red. For the second pair, since both components contain only blue codes, we can’t eliminate anything yet, so we leave it for now.

Backtracking

When we reach a dead end with no blue-colored codes remaining, it’s time to backtrack. Thanks to the elegant data structures, we simply reverse the Update & Mark operations. For entries recorded in the poison array, we just add back the individually removed codes to their respective sets. A special note here: since the sparse set “deletion” only swaps the target element to the end without physically removing it, restoring it is as simple as adjusting the TAIL pointer. For the selected candidate code itself, we simply “open” its set back up – again, just a TAIL pointer adjustment.

Summary

Key Takeaways

While search algorithms inevitably have exponential complexity, clever data structures and search strategies can still dramatically improve efficiency.

Key optimization techniques:

Sparse sets: Fast updates and backtracking
Poison array: Intelligent search space pruning
Color marking: Efficient state management

Answer to the Earlier Puzzle

Why are #0001 and #0010 red during initialization?

Because the problem definition requires the solution to include either #0001 or #0010. Regardless of which one is chosen, #0100 and #1000 belong to the same primary code as #0001 or #0010, so they are excluded.

🚀 Try It Yourself

Want to experience Algorithm C firsthand? Here's a simplified implementation skeleton:

```python # Simplified Algorithm C core logic def algorithm_c_skeleton(): """ Add your implementation code here. Start with m=2 and work your way up to understand the algorithm flow. """ pass ```

Challenge: Can you implement the poison array update logic?

Exercises and Food for Thought

The most complex part is arguably the tables in the book. If you happen to have a copy, you can find Table 1 and Table 2 for Algorithm C, which illustrate Steps 1 and 2 respectively. But when I worked through them hands-on, I discovered some interesting things. Here are a few questions to help you better understand the inner workings of MEM.

Challenge Problems

The Sorting Mystery: In Table 1, row 20 is split into two halves with 5 entries each. Why does the first half list #0001 #0010 #0011 in ascending order, but the first entry of the second half is #1100 rather than #1001?
Address Resolution: Following up on the previous question, can your answer explain MEM[5d] and MEM[5e]? What about MEM[bc] and MEM[bd]?
Assignment Order: In Table 1, what is the assignment order for the last 3 rows of the cl array? Why is MEM[146] set to #1100 while MEM[147] is #1001, placed after it?
Initialization Logic: Why are #0100 and #1000 excluded first during initialization? (Were you paying attention earlier? 😉)

💡 Hint

These questions all point to the same core concept: the storage strategy of sparse sets. Think carefully about list heads, TAIL pointers, and element swapping logic.

Afterword

In truth, these 3 posts are merely a surface-level interpretation of Algorithm C from Knuth’s TAOCP 4B. While the book lays out the algorithm in numbered steps 1-2-3-4-5, I don’t think presenting branching code as a numbered list is a great idea. Pseudocode is far clearer, so in my exposition I completely abandoned the book’s tedious step-by-step format in favor of an approach that feels more natural to a programmer’s intuition.

A Fun Discovery

PS: In Ex41 and the implementation of Algorithm C, Knuth occasionally writes MEM addresses as 40d when they should be 4d. Unfortunately, someone has already spotted this error. If you want to get your hands on one of Knuth’s signed checks from the Bank of San Serriffe, you’d better hurry!

🎯 What did you think of this post?

Feel free to share in the comments:

Your experience implementing Algorithm C
Insights on sparse set optimizations
Other interesting backtracking algorithms

Like + Bookmark to help more people discover this deep dive!

MoreFreeze's Sanctuary

算法 · Go · Vibe Coding

From Exponential to Efficient: Building Commafree Codes with Knuth's Algorithm C 中文

Introduction

Algorithm C

Framework

Initialization

MEM Array Initialization Example

Trying Primary Codewords

Update & Mark

1. Updating Sparse Sets

State Marking

Backtracking

Summary

Key Takeaways

Answer to the Earlier Puzzle

🚀 Try It Yourself

Exercises and Food for Thought

Challenge Problems

💡 Hint

Further Reading

Advanced Resources

Hands-on Projects

Afterword

A Fun Discovery

🎯 What did you think of this post?

About Me

Recent Posts

Categories

Tags

Blogroll

Archives

Copyright Notice

Introduction

Algorithm C

Framework

Initialization

MEM Array Initialization Example

Trying Primary Codewords

Update & Mark

1. Updating Sparse Sets

State Marking

Backtracking

Summary

Key Takeaways

Answer to the Earlier Puzzle

🚀 Try It Yourself

Exercises and Food for Thought

Challenge Problems

💡 Hint

Further Reading

Advanced Resources

Hands-on Projects

Afterword

A Fun Discovery

🎯 What did you think of this post?

推荐阅读

About Me

订阅更新

Recent Posts

Categories

Tags

Blogroll

Archives

Copyright Notice