2 thoughts on “The compression principle and implementation of ZIP”

  1. File compression principles

    We we use the computer to process the file. Each file will occupy a certain disk space. We hope that some files, especially the files that are not in use but not in use but more important, cannot be deleted (such as backup files, a bit like a chicken ribs), and occupy the disk space as little as possible. However, the storage format of many files is relatively loose, so that some valuable computer storage resources are wasted. At this time, we can use the compression tool to solve this problem. By compressing the original files, we can save it with less disk space. When used, the decompression operation is performed, which greatly saves the disk space. When you want to copy many small files, compression processing can improve execution efficiency. If there are many small files, it takes a lot of time to perform frequent file positioning operations. If these small files are compressed first, it will be convenient when copying into a compressed file. Because the computer processing information is expressed in the form of binary numbers, compression software is to mark the same character string in binary information by special characters to achieve the purpose of compression. In order to help understand the file compression, please imagine a picture of a blue sky and white clouds in your mind. For thousands of monotonous and repeated blue statues, instead of defining "blue, blue, blue ..." a long string of colors, it is better to tell the computer: "From this position, it is stored in 1117 blue. It is simply coming, and it can greatly save storage space. This is a very simple example of image compression. In fact, all the computer files are in the final analysis in the form of "1" and "0". Like blue like points, as long as the formula of reasonable mathematics calculation can be greatly compressed to achieve "data non -damaged data lossless data lossless data lossless damage is damaged. The effect of dense ". In general, compression can be divided into two types: loss and lossless compression. If the loss of individual data does not have much impact, at this time ignoring them is a good idea, which is damaged and compressed. Damaged compression is widely used in animation, sound and image files. The typical representative is MPEG, music file format MP3, and image file format JPG. However, in more cases, compression data must be accurate, and people have designed non -destructive compression formats, such as common ZIP and RAR. Software (Software) is naturally a tool that uses compression principles to compress data. The file generated after compression is called a compression package (Archive), and the volume is only one quarter or even smaller. Of course, the compressed package is already another file format. If you want to use the data, you must first use the compression software to restore the data. This process is called decompression. Common compressed software include winzip, winrar, etc.

  2. 1. The principle part:
    has two forms of repeated existence in computer data, and ZIP is compressed by repeating these two types. rn  一种是短语形式的重复,即三个字节以上的重复,对于这种重复,zip用两个数字:1.重复位置距当前压缩位置的距离;2.重复的长度,来Expressing this duplication, assuming that these two numbers take one byte, so the data is compressed, which is easy to understand.
    Mi bytes have 0-255 total 256 possible values, three bytes have 256 * 256 * 256 total of more than 16 million possible situations, longer phrases may take value The growth rate of repeatedness seems to be extremely low, but it is not. Various types of data have a tendency to repeat. In a paper, there are a few terms. It will appear repeatedly; a background picture of the upper and lower gradients, the pixels in the horizontal direction will appear repeatedly; in the source file of the program, the grammatical keywords will appear repeatedly (how many times when we write the program, how many times before and after In the non -compressed format data of the ten K, a large number of phrases are repeated. After compression mentioned above, the tendency of repeating phrases is completely destroyed, so the second phrase compression of the compression results is generally not effective.
    The second repeat is a single byte duplication. A byte only has 256 possible values, so this duplication is inevitable. Among them, some bytes may appear more times, and others are less. There is a tendency to be uniform in statistics. This is easy to understand. For example, in an ASCII text file, some symbols may be rarely used. The letters and numbers are used more, and the frequency of use of each letter is also different. It is said that the probability of use of letters E is the highest; many pictures are dark or light -colored, and the pixels of dark (or light) are used more (here here (here By the way: PNG picture format is a non -destructive compression. Its core algorithm is the ZIP algorithm. The main difference between it and the file format in the ZIP format is that as a picture format, it stores the size of the picture at the header of the file and the use of the picture. Information such as color numbers); the result mentioned above also has this tendency: repeat tendency to appear in a place closer to the current compression position, and the repeat length tends to be shorter (within 20 bytes). In this way, there is a possibility of compression: re -encoding 256 byte values, so that more bytes of bytes are used for short encoding, and less bytes are used for longer codes. In this way, it becomes shorter. The bytes of the bytes are more compared to the longer bytes, the total length of the file will be reduced, and the less uniform use of the byte use, the greater the compression ratio.
    Bet before discussing the requirements and methods of coding, first mention that coding compression must be performed after a phrase compression, because after encoding compression, the original eight -bit binary bytes must be destroyed. In this way The tendency to repeat medium short -language will also be destroyed (unless decoding first). In addition, the result of phrase compression: those remaining unicated orders, dual bytes and matching distances and length values ​​still have uneven value distribution. Therefore, the order of the two compression methods cannot be changed.
    After encoding compression, the eight consecutive digits are used as a byte. The original tendency of unevenly obtained the bytes in the original without compressed files is completely destroyed and becomes random value. According to statistical knowledge, according to statistical knowledge , Randomly withdrawal has a tendency to uniformity (such as throwing a coin test, throw a thousand times, and the number of times on the front and back faces is close to 500 times). Therefore, coding compression results cannot be compressed again.
    Maginary compression and coding compression are the only two non -destructive compression methods developed by the computer science community. The algorithm is unimaginable, because it will eventually compress to 0 bytes).
    The tendency of repeated tendencies of phrases and uniform distribution of bytes is the basis that can be compressed. The reason for the two compression order cannot be switched. :
    First of all, in order to use irregular codes to indicate a single character, the encoding must meet the requirements of the prefix coding, that is, the shorter coding must not be a long -coded prefix. The encoding is not composed of another character's encoding plus several bits 0 or 1, otherwise the decompression program will not be decoded.
    The simple example of the prefix encoding:
    symbol code
    a 0
    b 10
    c 110
    D 1110
    e 11110 n With the above code table, you can easily distinguish the real information content from the following binary stream:
    -
    Ideal choice. Examine the binary tree below:
    (root)
    0 | 1
    ------------------------------- 0 | 1
    ------ -------- ---- ----
    | | |
    a | d e
    0 | 1
    ------ ------
    | |
    b c to coded the characters always appear on the leaves, assume that from the root to the leaves to the leaves In the process of walking, turn left to 0 and right to 1, then the code of a character is the path of walking from the root to the leaf where the character is located. Because the characters can only appear on the leaves, the path of any character will not be the prefix path of another character path. D -10 E -11 Next to look at the process of encoding compression:
    In order to simplify the problem, it is assumed that only A, B, C, D, and E are assumed in a file.
    a: 6 times
    b: 15 times
    c: 2 times
    d: 9 times
    e: 1 time
    Character encoding: A: 000 B: 001 C: 010 D: 011 E: 100
    So the length of the entire file is 3*6 3*15 3*2 3*9 3*1 = 99
    indicate these four codes with a binary tree (of which the number on the leaf nodes is the number of usage times, and the number on the non -leaf node is the sum of the number of children's use):

    |
    -------------------------
    | |
    ---- 32 --- ---- 1- 1- -
    | | |
    -21- -11- -1-
    | | | |
    6 15 2 9 1 rn(如果某个节点只有一个子节点,可以去掉这个子节点。)rn         根rn         |rn        ------33------
    | |
    ------------------- 1
    | |
    -21- -
    | | |
    6 15 2 9
    The current encoding is: A: 000 B: 001 C: 010 D: 011 E: 1 still meets the requirements of the prefix coding.
    The first step: If the number of the lower nodes is greater than the number of the upper nodes, the position of their position and re -calculate the value of the non -leaf node.
    Stocks 11 and 1 first. As 11 bytes shortened one bit, 1 byte increased by one, and the total file was shortened by 10 digits.
    root
    |
    -------------------------- n | |
    ------------------------------------ -22 ---- ----------------
    | | |
    -21- 1 2 9
    |
    6 15
    exchanged 15 and 1, 6 and 2, and finally obtained such trees:
    root
    |
    --------- 33 ---- ------
    | |
    ----- 18 ----- ----------------
    | | |
    -3- 15 6 9
    | |
    1
    The value of all upper nodes is greater than the value of the lower node, which seems to be further compressed. But we combine the two smallest nodes of each layer, and we often find that there is still room for compression.
    Step 2: combine the two smallest nodes of each layer to re -calculate the values ​​of the related nodes. Among the trees above, there are only one or two nodes on the first, second, and fourth floors, and cannot be re -combined, but there are four nodes on the third layer. We combine the smallest 3 and 6 The value of the node becomes the tree below.
    root
    |
    -------------------------- n | |
    ------------------------------------ --- 9 ----- ------------------
    | | |
    -3- 6 15 9
    | |
    1
    and then repeat the first step.
    The on the second layer at this time is less than 15 on the third layer, so it can be swapped. 9 bytes have increased by one, 15 bytes shortened one bit, and the total length of the file is shortened by 6 digits. Essence Then re -calculate the value of the related nodes.
    root
    |
    ---------------------------
    |
    15 - -18 ----
    | |
    ----------------- 9
    | |
    -3- 6
    | |
    1
    In the found that all the upper nodes are greater than the lower nodes, and the smallest nodes on each layer are placed together. Other nodes at the same level are smaller parent nodes. At this time, the length of the entire file is 3*6 1*15 4*2 2*9 4*1 = 63. At this time, you can see a basic prerequisite for coding compression: the values ​​between the nodes are different. The disparity to make the two nodes and another node smaller than the same layer or lower layer, so that the exchange nodes have benefits.
    So in the final analysis, the frequency of bytes in the original file must be large, otherwise the frequency of the two nodes will not be less than the frequency of the two nodes than the frequency of other nodes of the same layer or lower layer, and it will not be compressed. Conversely, the difference between the difference, the more the frequency of the two nodes is smaller than the frequency of the same or lower nodes, and the greater the benefits after exchange nodes. In this example, after repeating the above two steps, the best binary tree is obtained, but in all cases, you can get the best binary tree through these two steps. n roots

    ----------------------------- n | |
    ------- 12 - —————————— 7
    | ›
    ------------------------
    | | |
    ------------- -3 - - 3 - -4 -
    | | | | | |
    11 2 2 2 2
    Both are more than equal to the lower layer nodes. The two smallest nodes of each layer are combined, but it can still be further optimized: root

    -------------- - - - ---
    | ›
    -------------------- 7

    -------- + ---------
    | | | |

    | | | | | | |
    11 1 1 2 2 2 2
    The replacement of the 4th node of the 4th layer of the lowest layer, the 8th of the 3rd layer is greater than 7 on the 2nd layer.
    The here, we have to conclude that a best binary coding tree (all the upper nodes cannot be exchanged with the lower node), which must meet the two conditions:
    1. All upper -layer nodes are greater than or equal to lower -level nodes.
    2. In a certain node, the larger sub -node is m, and the smaller sub -node is n. All nodes of any layer of any layer of m should be greater than all nodes of the layer under the layer of n. When these two conditions are met, no more nodes can be generated to exchange with the lower nodes, and a larger node cannot be generated to exchange with the upper node. The two examples above are relatively simple. In the actual document, there are 256 possible values ​​in a byte, so there are as many as 256 leaf nodes of the binary tree. There is a very delicate algorithm that can quickly build a optimal binary tree. This algorithm is proposed by D.Huffman (Dai Huffman). Let's introduce the steps of the Hoffman algorithm first, and then prove that passing through the passing The tree shape of such a simple step is indeed an optimal binary tree.
    The steps of the Huffman algorithm are: · Find out the smallest nodes from each node and build a parent node for them. The value is the sum of these two nodes.
    · Then remove these two nodes from the node sequence and add their parent nodes to the sequence. Repeat the above two steps until the only node is left in the node sequence. At this time, an optimal binary tree has been built, and its root is the remaining node. Looking at the above example of the above example.
    The initial node sequence is:
    a (6) b (15) c (2) d (9) E (1)
    n | (3)
    a (6) b (15) d (9) ------- --------------------- | |
    c e The finally obtained tree is like this:

    |
    -------------
    |
    15 ---- -18 ----
    | |
    9 -------- 9------------------
    |
    6 -3-
    | |
    1
    The code length of the encoding length of each character is the same as the method we mentioned earlier, so the total length of the file is the same: 3*6 1*15 4*2 2*9 4*1 = 63 Visit the change of the node sequence in each step in the establishment of the Hoffman tree:
    6 15 2 9 1
    6 15 9 3
    15 9 9
    15 18
    33
    It use the reverse push method to prove that for various nodes sequences, the trees established with the Hofman algorithm are always the most treated. You binary tree:
    This to the establishment of Hoffman Tree uses reverse push method:
    Is when the node sequence in this process is only two nodes (such as 15 and 18 in the previous case), it must be a tree The optimal binary tree, one coding is 0 and the other is 1, which cannot be further optimized.
    Then step forward, continuously reduce a node in the node sequence, add two nodes, and always maintain a best binary tree during the step of step. This is because:
    1. The establishment process of Fu Manshu, the two new nodes are the smallest in the current node sequence, and the parent nodes of any two nodes are greater than (or equal). The best binary tree, the parent nodes of any other two nodes must be at the upper or same layers of their parent nodes, so these two nodes must be at the lowest layer of the current binary tree.
    2. These two new nodes are the smallest, so they cannot be changed with other upper nodes. It meets the first condition of the best binary tree we mentioned earlier.
    3. As long as the previous step is the best binary tree, because these two new nodes are the smallest, even if there are other nodes on the same layer, it cannot be re -combined with other nodes of the same layer. The upper nodes are replaced with other nodes of the same layer. Their parent nodes are less than the parent nodes of other nodes, and they are less than all other nodes. As long as the previous step meets the second condition of the best binary tree, this step will still meet. This step -by -step push down, in this process, Hoffman tree has always maintained a best binary tree. Because each step deletes two nodes from the node sequence, a new node is added. Essence
    Attachment: For the Huffman tree, there are completely different proofs in "Computer Programming Arts". The number of internal nodes (non -leaf nodes) of binary coding trees is equal to the number of external nodes (leaf nodes) decreased by 1.
    2. The weighted path length (value multiplied by the path length) of the external node of the binary coding tree is equal to the sum of all internal nodes. (Both of them can be proven by using mathematical induction method for node numbers, leaving for everyone to practice.)
    3. The establishment of the Huffman tree is used to push up. When there is only one internal node, it must be a best binary tree.
    4. Step forward, add two smallest external nodes. They combine them to produce a new internal node, and when the original internal node collection is minimalized, it is still extremely extremely to add this new internal node after joining this new internal node. Small. (Because the smallest two nodes are combined and at the lowest level, compared to them with other same or upper -layer nodes, at least it will not increase the length of the weighted path.)
    5. As the number of internal nodes increases one by one, the collection of internal nodes has maintained to minimize.
    2. Implementation part of
    If there is never a compression program in the world, we look at the previous compression principle, and we will be confident that you can make a program that can compress most formats and content. When you find that there are many problems that need to be solved one by one. The following will describe these problems one by one, and analyze how the ZIP algorithm solves these problems in detail. Many of them have universal significance, such as finding matching, such as array sorting. Wait, these are endless topics, let us go deep into it and think about it.
    We we said earlier, for the duplicate of the phrase, we use the distance between the current position? and the length of repetition? These two numbers indicate this stage to achieve compression. The size of the byte energy represents is 0-255, but the position and the length of the repeated appear may exceed 255. In fact, the number of numbers that can be represented after the number of binary numbers is determined. The maximum value of the bit of the N bit is the maximum value of the N of the 2. If the number of digits is too large, for a large number of short matching, it may not only play a compression effect, but increase the final result. In response to this situation, there are two different algorithms to solve this problem. They are two different ideas. One is the LZ77 algorithm, which is a very natural idea: limit the size of these two numbers to obtain a compromised compression effect. For example, the distance is 15 bits and 8 bits in length. In this way, the maximum distance of the distance is 32 k -1, the maximum value of the length is 255. These two numbers account for 23 digits, which is one less than three bytes. Meet the requirements of compression. Let's imagine the situation when the LZ77 algorithm is compressed, and there will be interesting models:
    The farthest matching position-> Current processing location->
    ─┸────────────────────────────────────────────────────────────────────────────────────────────────────otadtern ─────────────────────────────——————— The direction of compression
    The compressed parts ┃ n The current processing location is a dictionary area that can be used to find matching. As compression proceeds, the dictionary and area continuously slide backwards from the head of the compressed file, until it reaches to the end of the file, and the phrase compression is also compressed. It's over.
    Clashing is also very simple:
    ┎ ─────────────────────┒
    ──Cattering length ──———————————————————————
    ─┸──────────────────────────────────────────e ──────────—————— Decadee direction
    The unlike part of the unzipped part
    Net continuously read the matching position value and matching length value from the compressed file, and the matching part of the unzipped part was matched. The content is copied to the tail of the decompression file, and those who fail to be matched when the compression file is compressed. Instead, the single and double -bytes that are directly saved. Just copy it to the tail of the file in order until the entire compression file is processed.
    LZ77 algorithm model is also called? Slider dictionary or? Sliding window? Model. Because it limits the maximum length of matching, for some large -length matches, this compromise algorithm is displayed. The defect was produced. Another LZW algorithm has a completely different algorithm design in the compressed file, and only one number is used to represent a short saying. Let's describe the compression decompression process of LZW, and then comprehensively compare the two. Application.
    LZW's compression process:
    1) Initialize a dictionary specified in size, add 256 bytes to the dictionary.
    2) Find the longest match that appears in the dictionary in the current processing position of the compressed file, and output the serial number that matches the dictionary.
    3) If the dictionary does not reach the maximum capacity, add the matching to the next byte to add it in the compressed file to add the dictionary.
    4) After moving the current processing position to the matching.
    5) Repeat 2, 3, 4 until the file output is complete.
    LZW's decompression process:
    1) Initialize a dictionary specified in size, add 256 bytes to the dictionary.
    2) Read a dictionary serial number from the compression file. According to the serial number, the corresponding data in the dictionary is copied to the tail of the decompression file.
    3) If the dictionary does not reach the maximum capacity, add the previous match with the first byte of the current match to add the dictionary.
    4) Repeat 2 or 3 steps until the compression file is processed.
    It's compression process from LZW, we can summarize some of its main features that are different from the LZ77 algorithm:
    1) For a phrase, it only outputs one number, that is, the serial number in the dictionary. (The number of this number determines the maximum capacity of the dictionary. When its digits are too large, such as more than 24 bits, the compression rate may be very low for the majority of the shortcoming. The capacity of the dictionary is limited. So the same needs to be selected.)
    2) For a phrase, such as ABCD, when it first appeared in the compressed file, AB was added to the dictionary. When joining the dictionary, the third appears, ABCD will be added to the dictionary. For some long matching, it must appear at high frequencies, and the dictionary has a large capacity before it will be finally added to the dictionary. Correspondingly, as long as the LZ77 is matched in the dictionary area?
    3) The process of a long match that is added to the dictionary is to start with two bytes and increase one byte one by one to determine the maximum capacity of the dictionary and indirectly determine the possible maximum length of matching. Compared to LZ77, two numbers are used to represent a phrase, and LZW only uses one number to represent a phrase. Therefore, the digits of the dictionary serial number? That is to say, the longest match can be longer than the LZ77. When some ultra -long matching high -frequency appear, until it is completely added to the dictionary, LZW will start to make up for the initial inefficiency and gradually show its own advantages.
    It can be seen that in most cases, the LZ77 has a higher compression rate, and the vast majority of the majority of the compressed documents is the most long match, and the same ultra -long matching high frequency repeatedly appears repeatedly. , LZW is more advantageous. GIF uses the LZW algorithm to compress the single -back and simple picture. ZIP is used to compress universal files, which is why it uses the LZ77 algorithm that has a higher compression rate for most files.

Leave a Comment