Talk:Huffman coding
This is the talk page for discussing improvements to the Huffman coding article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1 |
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||||||
|
Optimality
[edit]This section has several inaccuracies and would benefit from links to other pure entropy-encoding algorithms.
"Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e., a stream of unrelated symbols) with a known input probability distribution,"
This is incorrect. Huffman is optimal for a prefix code algorithm, but there are non-prefix code coders that give better compression so clearly Huffman is not optimal out of all of them. (What it has in advantage is simplicity, especially in hardware implementations.) I also don't understand why Arithmetic coding and LZW are grouped together here. That is muddling up modelling with entropy encoding. Many of the most widely used LZ algorithms also utilise huffman (eg gzip / Deflate) - these are totally orthogonal issues and it is confusing to the reader to imply otherwise. I would therefore drop LZW from this statement and add in Asymmetric Numeral System instead as it fits along side Arithmetic Coding as a direct alternative to entropy encoding and we should be unbiased in our listing of alternatives.
"Prefix codes tend to have inefficiency on small alphabets, where probabilities often fall between these optimal points. The worst case for Huffman coding can happen when the probability of a symbol exceeds 2−1 = 0.5, making the upper limit of inefficiency unbounded. Combining symbols together ("blocking") can be used to make Huffman coding as efficient as theoretically possible."
Agreed on the principle of symbol combining to improve efficiency, but it cannot make huffman "as efficient as theoretically possible" unless we want to encode the entire stream as a single combined symbol (and then how do we encode the table?). Rather it reduces the inefficiency. Nothing quite matches the theoretical shannon limit. Huffman isn't even as efficient as most arithmetic coders, even with blocking. Also if we have a large alphabet with one very common symbol (eg byte 0 out of 256 byte values - all used) then symbol combining cannot solve this unless we move to a larger alphabet size than 256 which brings its own set of inefficiencies when implementing. I see therefore why RLE is added here too, but again it starts to feel like we're muddling up modelling with entropy encoding at this point.
Symbol combining and RLE are valid and well used techniques (I've used both myself, even with other more advanced entropy encoders) so it's good to have their inclusion here to educate readers on how to practically work around the limitations if and when they crop up. However it could perhaps do with some practical examples. Eg the most obvious is a stream of just two symbols with no inter-symbol correlation; eg A A A B A A B A B A. p(A)=7/10, p(B)=3/10 giving around 0.88 bits per symbol, but huffman can only ever assign 0 and 1 giving exactly 1 bit per symbol and 0% compression. We could combine them into AA AB AA BA BA symbols, which on a larger data set would offer some compression using huffman bit lengths 1(AA), 2(AB) and 3(BA, BB) offering .905 bits per symbol. Further improvements can be made by combining more symbols together into successively large alphabets. Such techniques do not work with an already large alphabet, requiring alternative methods such as Run Length Encoding.
80.229.230.181 (talk) 09:32, 24 July 2017 (UTC) James
- Regarding your last paragraph, what you want sounds very similar to Arithmetic coding#Huffman coding, so we could copy that text or have a "see also." As far as what entropy-approaching alternatives we should list, they should be those the reader would be more likely to be familiar with. (The idea to only have one or two is to avoid confusion by grouping together many arcane compression methods. Googling the various compression methods shows that arithmetic coding is far more commonly known than similar alternatives, so it's best to stick with that.) It wouldn't hurt to name only arithmetic coding (leaving out mention of LZW).
- I don't think there's any inherent problem with mixing entropy coding techniques with modeling, though, since both are approaches to achieving/improving compression, especially with blocking and run-length coding, which turn Huffman from something that can be useless (e.g., for compressing binary symbols) to something that approaches entropy. And "approaches entropy" would be more accurate than "as efficient as theoretically possible," but the latter is more readable. A lot of this is about readability versus completeness and strict accuracy. We could say, "The Huffman code approaches entropy - i.e., optimal compression - using blocking. However, the practicality of such an approach is usually limited by the size of the resulting coding trees, in terms of both storage and complexity. There is one form of blocking, run-length encoding, without this drawback, at least on binary inputs." That's more accurate.
- My concern is that this section will explode into something far too big for a simple encyclopedia article, or become a place for people to plug their favorite compression technique. But there is certainly room for improvement. Calbaer (talk) 14:49, 24 July 2017 (UTC)
- Your revised summary is better, but still inaccurate. You are mistaking pattern removal with entropy coding by implying that run-length encoding has no drawbacks and solves the issue of Huffman being unable to compress a bit stream. Consider the example of a random selection of heads and tails with a weighted coin that gives tails 2/3rds of the time and heads 1/3rd. RLE just adds noise and doesn't help here. Try it using zlib with Z_HUFFMAN_ONLY vs Z_RLE and you'll see the file grows with Z_RLE, and neither of them even get to 1 bit per symbol (due to EOF). Accuracy *does* matter - this is an encyclopedia and accuracy should always be paramount. If it's too complex to explain accurately then perhaps it's not appropriate for the site, but we should never slide into inaccuracy simply to make it easy to read. Therefore my suggestion here would be to simply state that huffman alone does not encode optimally in some cases and suggest that in *some* cases this can be worked around, but keep the specifics minimal. (James) — Preceding unsigned comment added by 193.62.202.135 (talk) 08:47, 21 August 2017 (UTC)
- Which part of the wording do you find inaccurate? Characterizing RLE as a form of "blocking"? Or something else? I don't think there's any implication that RLE has no drawbacks, since the text specifically says that it works, "For the simple case of Bernoulli processes," implying suboptimality elsewhere.
- Also, your example of a 2/3-1/3 coin seems wrong here. Without combining symbols, compression would be 0%. With RLE and Golomb coding (encoding run length using a unary code plus one bit per run of heads), it's about 6.7%. (That's equivalent to n=1, p=2/3 in the equation at Golomb_coding#Use_for_run-length_encoding.) That's not quite the 8.2% you could get with optimal entropy coding, but it is compression, not just noise. (As for whether a given library would implement this properly, that's another matter all together.)
- The idea of fixed-length blocks is largely of theoretical import, but is important for that reason. Using RLE instead of dumbly encoding input is an important demonstration of how doing the right thing before entropy coding can turn a coding method that looks "bad" (because it's being used poorly) to one that's "good" (because it's being used well). Your example appears to illustrate a common misconception here; you're not the first person to present me with a Bernoulli example as evidence of where Huffman coding doesn't compress, when proper use of RLE shows that it does.
- Still, I'm not sure how important it is to distinguish which part of compression is categorized as "entropy coding" and which part isn't. Like I said, I don't see how the current text - including omitting that distinction in favor of just linking to the RLE article - is inaccurate. Calbaer (talk) 14:18, 21 August 2017 (UTC)
n-ary Huffman coding
[edit]Under the n-ary Huffman coding section, it is mentioned that "The same algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are taken together, instead of just the 2 least probable." - but is that really the case?
Let's look at a simple example: trinary alphabet (n equals 3), with four symbols (call them w, x, y and z, in increasing order of probability). The algorithm as described will group w, x and y under a single node, and then (supposedly, as there are only two nodes remaining, not three, so this isn't explained explicitly, but I think this is the reasonable way to understand it) that new node and z under one more new node, which ends up as the root:
root / \ wxy z / | \ w x y
So the root only has two children (which is OK, the article does mention the possibility that "additional 0-probability place holders must be added"). But surely this can't be the optimum, as better compression would be achieved by moving one of the first three symbols (optimally y) to occupy the free position directly under the root, rather than its current, deeper position:
root / | \ wx y z / \ w x
This prompted me to look at Huffman's paper (http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf as referred to from the article), where I think the issue is actually addressed, by saying (left column of the fourth - and last - page, which is page 1101) "... it will be required that all these brackets, with the possible exception of one combining the least probable messages of the original ensemble, always combine a number of messages equal to D." (emphasis mine).
I think the article should be fixed to reflect this and to explain the algorithm more clearly. And I'd rather leave this task to someone more proficient than myself (but I might do it myself eventually if I notice in some time that this hasn't been addressed). 95.90.235.121 (talk) 12:00, 29 December 2020 (UTC)
¿Huffman tree?
[edit]What it's called Huffman tree it's called in bibliography as Kraft Tree--138.4.177.124 (talk) 15:20, 16 February 2021 (UTC)
Better applications section?
[edit]The applications section needs some work. The first paragraph compares Huffman codes to Arithmetic coding (see the rest of the drama in this talk page), which is not an application. The second paragraph describes some compression algorithms that use prefix codes which are specifically described as NOT Huffman codes.
So... the applications section has no applications in it.
J2kun (talk) 02:00, 4 May 2021 (UTC)
- Gof for it. Prefixed codes are not huffman codes for sure (judging by papers that I've seen so far). AXONOV (talk) ⚑ 15:14, 3 December 2021 (UTC)
Finite vs infinite alphabet
[edit]The definition speaks of sets, but doesn't say finite. Does the tree construction still work out OK for infinite alphabets and trees of infinite depth? Clearly one can't work from the leaves combining least-probably nodes if there are infinitely many. See Golomb code, the last part of this edit where I'm not so sure about what I said. Dicklyon (talk) 16:07, 12 July 2021 (UTC)
Simpler explanation
[edit]Current § Problem definition subsection is overly abstract. I got a paper described in much simpler terms how to get huffman code for a given set of symbols (probability values). No hard math involved, take a look: Huffman Offline.pdf AXONOV (talk) ⚑ 15:18, 3 December 2021 (UTC)
Wrong bit quantity/comparison in the example at the beginning
[edit]I think there´s some kind of "mistake" in the description: "this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used.". There are only 16 sympbols listed in the table given below the tree-graphics, so for a FAIR comparison, only 4 bits are required to binary code 16 different symbols. Please don´t compare most wasteful ASCII-Coding with the Huffman Code, please don´t limit Huffman Coding to text-data! If it comes to the GAIN everybody is greedily looking for, it´s 135 bits used by huffman-coding (without codebook and decoder) versus 36 chars * 4 binary code bits = 144 bits (NOT 288 or 180 given in the description!). I know it does look much less impressive, if you honestly say "only 144-135=9 bits gain in this example, and overhead for codebook and decoder is even discounted!". I ask for a fair comparison, because everybody should face the truth how hard it is to find powerful algorithms for data compression. Don´t try to make Huffman Coding better than it is, don´t feed the "magic function theory". And PLEASE also show an example where beloved Huffman Coding clearly FAILS! — Preceding unsigned comment added by 84.115.238.220 (talk) 05:47, 3 June 2023 (UTC)