Text compression using a 4-bit coding scheme.
by J. Pike.
The Computer Journal, Vol 24, No 4, pp. 324-330, 1981.

Summary.
The most frequently used words in natural or printed English are found unexpectedly to contain only an average proportion of the most frequently used letters. This independence of the word and letter frequency distributions is used to minimise the number of bits necessary to code natural English text. It is shown that mean bit rates of less than 4 per character can be achieved for text using the full ASCII set of 96 characters, by combining a variable bit length representation of each character with a character combination dictionary of a 100 or more common words. A simple practical scheme is presented which uses 4, 8 or 12 bits to code the characters and dictionary words. Using this scheme with a 205 word dictionary, a mean code rate of 3.87 bits per character is achieved. It is indicated how even this rate might be improved with a larger dictionary or by basing the dictionary on the more common word prefixes.

Comment.
Minimum bit rates are analysised mathematically as a function of letter frequency and dictionary size. A table of the 240 most used words in 100,000 words of text is presented and it is shown how these can be incorporated into the simple coding scheme. Subsiquent use of the scheme has shown that an average of under 4 bits per character is achievable in practice compared with the 8 used for normal ASCII encoding.

Availability.
Download from The Computer Journal
or email J A C K @ J A C K P I K E . C O . U K

Return to Jack Pike Published Papers

Last amended:Dec 2013.