Information theory

Information theory is a framework for understanding the transmission of data and the effects of complexity and interference with these transmissions. The theory is often applied to genetics to show how information held within a genome can actually increase, despite the apparent randomness of mutations.

Shannon information
Claude Shannon developed a model of information transmission in terms of information entropy. It was developed to describe the transfer of information through a noisy channel.

Digitized information consists of bits with quantized amounts. Computers typically use a binary system, with 0 or 1 as allowed values. Genetic information can be thought of as digitized, with A, C, G, and T as allowed values. If each position has one specific possible value, it can be said to have low information content, or in more colloquial terms, "no news." As more values are possible at each point in the signal, it becomes less predictable, and hence the information content of any particular "message," or instance of the signal, increases.

Shannon developed his theory to provide a rigorous model of the transmission of information. Importantly, information entropy provides an operational and mathematical way to describe the amount of information that is transmitted, as well as the amount of redundancy required to get a certain amount of information reliably through a band-limited noisy channel. It also provides the mathematical foundation of data compression: for example, the mathematical reasoning behind why it is impossible to compress what has already been compressed is because data compression is, by definition, an encoding scheme whose goal already is to maximize the information entropy of a message by encoding it with as little bits as possible.

In genetics, a point mutation increases the information entropy of a DNA base pair. However, natural selection counteracts this increase through eliminating organisms with harmful mutations and consequent higher information entropy (or colloquially, lower information content). While information theory does not describe how a sequence of DNA bases is expressed into features for development, it clearly indicates that genetic information is transmitted from one generation to another mathematically. Any feature of a string that preserves fitness will have a lower information entropy or higher information content than a random string. Richard Dawkins's weasel program that investigates cumulative selection shows a lowering of information entropy.

While there are similarities between the mathematical form used to describe thermodynamic entropy and information entropy (the equations relating to each vary only because one has a minus sign where the other does not), the former refers exclusively to the distribution of energy. The Second Law describes the observation that entropy increases in an energetic system; there is no corresponding universal observation in information theory, and hence it is not a scientific law. At the least, natural selection influences the propagation of genetic coding.

Creationists really don't like this stuff, but won't say why. It's likely because they don't want an actual definition of information that can be argued against. It is also generally clear that, despite appeals to information theory, creationists understand neither the notions of information entropy nor information content.

To illustrate the concepts, consider the following:

A string of ten fair coin flips encodes ten bits of entropy (one bit being defined to be equivalent to the 50/50 uncertainty of a single fair coin flip). Knowing the outcomes of five of the coin flips will reduce the information entropy of the string significantly. Knowing both the position and outcome of those five coin flips eliminates their contribution to entropy entirely, giving you a string of ten coin flips which encompasses only five bits of entropy (the five whose positions and outcomes are unknown).


 * Notation:
 * $$N$$ represents a flip of unknown outcome.
 * $$H$$ represents a flip known to result in heads.
 * $$T$$ represents a flip known to result in tails.
 * (Note that N does not represent part of the output space, which is only H and T; it simply represents a yet-unknown choice of one of the two values from the output space)

Start with the 10 fair coin flips:


 * $$NNNNNNNNNN$$

The random variable representing this string of flips has 210 possible values (10 bits of entropy).

If we have five flips that are known to result in heads, but no knowledge of their position and no knowledge about the other five flips, the space of possible values shrinks considerably:


 * Some values that remain in the space:
 * $$HHHHHNNNNN$$
 * $$HHHHNHNNNN$$
 * $$HHNNNNNHHH$$
 * As for a value (stipulating some of the N's) that is now excluded:
 * $$TTTTTTHHHH$$ (more tails than are possible with the knowledge we have)

The ordering of the flips, both known and unknown, means that even the knowledge we have doesn't completely eliminate the five heads' contribution to entropy. But if we further stipulate that five known heads are also the first five flips in the string, the entire space of remaining possible strings is of the form $$HHHHHNNNNN$$. This has 32 (25) possibilities, meaning that these stipulations have reduced the entropy (and information content) of the string by half, and the known flips no longer carry information; the same information conveyed by our stipulated string $$HHHHHNNNNN$$ could just as easily, and more efficiently, be encoded by the five-flip string $$NNNNN$$.

Kolmogorov complexity
Kolmogorov complexity (also known as Chaitin information, or algorithmic information) deals with the use of algorithms to compress or decompress information. Computer scientists developed it to discuss how to compress data in the most efficient way possible to take up less disk space.

The Kolmogorov complexity depends on the number of steps that an algorithm would need to take to reproduce the information (sometimes called the "edit distance"). Thus "A20" can be thought of as a compression of "AAAAAAAAAAAAAAAAAAAA," and "(AB)9" can be thought of as a compression of "ABABABABABABABABAB." Any instruction including insertion, repeating, deletion, etc. can change the Kolmogorov complexity. Thus, the Kolmogorov complexity can be thought of as the maximum amount of information "in the string" or "in the sequence."

The Kolmogorov complexity depends entirely on the algorithm used. Hence, while there are uses in genetics, determining the change in Kolmogorov complexity would require a description of all the processes used to reproduce the developmental information from the DNA sequence; one cannot tell the amount of information (or Kolmogorov complexity) by just looking at a string of letters, symbols or DNA. (This is part of the reason why the amount of information in the words "car" and "vehicle" cannot be compared as it is dependent on the algorithms of linguistic interpretation, and why the number of letters is insignificant.) Notably, because the processes "change" (or "mutate") and "delete" can be thought of as an additional algorithmic step, they can increase the Kolomogorov complexity (or information content). More significant is that potentially they change the content.

Creationists (including IDers) make assertions about Kolmogorov complexity (or something like it) and get it wrong. For example, more than once a creationist has said that an extra copy of a string of information does not add "any new information" when in fact, it certainly does: the instruction "repeat entire string."

Note that any comparison of information "in the string" used by creationists (in the guise of meaning) is Kolmogorov complexity, while the "increase of noise" or "information loss by loss of DNA sequence fidelity" by mutations usually refers to Shannon's information entropy. The two cannot be used interchangeably.

Word analogies
Word analogies are tricky when using concepts of information theory.

Any change of a string of text by nature of being a change is an increase of information entropy (or a loss of information content). This will be true if the string is a word ("rational" changed to "rasional") or is nonsense ("alkfd" to "alkfg"). (However, a proofreader, acting as an agent of natural selection, could reject erroneous copies of a text to retain the information entropy.)

In terms of Kolmogorov complexity, changes in letters can supply more or less information, but is dependent on the linguistic structure. The number of processes required to interpret the word through an algorithm may or may not depend on the number of letters and the identity of the letters, and hence "more" or "less" has little meaning. In the same way, mutations in genetics can potentially change how an organism develops, but without a complete understanding of the processes of development, a mutation is not "more" or "less" information.

An attempt to interpret a word analogy by both concepts at the same time fails because the two concepts are not independent but also not the same. It can be true that a change of a letter ("lost" to "post") is less copying fidelity (increased information entropy) and yet changes some linguistic meaning (different Kolmogorov complexity).

Information theory and genetics, evolution, and development
The relationship between biology and information theory given above and other approaches in the literature suggest that the words "biological information", "developmental information" or "genetic information" are ambiguous without clarification. Even then, there will be ambiguity: In biology the term information is used with two very different meanings. The first is in reference to the fact that the sequence of bases in DNA codes for the sequence of amino acids in proteins. In this restricted sense, DNA contains information, namely about the primary structure of proteins. The second use of the term information is an extrapolation: it signifies the belief or expectation that the genome somehow also codes for the higher or more complex properties of living things. It is clear that the second type of information, if it exists, must be very different from the simple one-to-one cryptography of the genetic code. This extrapolation is based, loosely, on information theory. But to apply information theory in a proper and useful way it is necessary to identify the manner in which information is to be measured (the units in which it is to be expressed in both sender and receiver, and the total amount of information in the system and in a message), and it is necessary to identify the sender, the receiver and the information channel (or means by which information is transmitted). As it is, there exists no generally accepted method for measuring the amount of information in a biological system, nor even agreement of what the units of information are (atoms, molecules, cells?) and how to encode information about their number, their diversity, and their arrangement in space and time.

Creationist information theory
Creationists, in an attempt to coat their myths with a veneer of science, have co-opted the idea of information theory to use as a plausible-sounding attack on evolution. Essentially, the claim is that the genetic code is like a language and thus transmits information, and in part due to the usual willful misunderstandings of the second law of thermodynamics (which is about energy, not information), they maintain that information can never be increased. Therefore, the changes they cannot outright deny are defined as "losing information", while changes they disagree with are defined as "gaining information", which by their definition is impossible. Note that at no point do creationists actually specify what information actually is and often (even in the allegedly scientific case of complex specified information) will purposefully avoid defining the concept in any useful way. Creationists tend to change their meaning on an ad hoc basis depending on the argument, relying on colloquial, imprecise definitions of information rather than quantifiable ones - or worse, switching interchangeably between different definitions depending on the context of the discussion or argument.

The deliberate conflation of the totally unrelated concepts of thermodynamic and informational entropy is, while an obvious flaw in the argument, a flaw that the creationists' intended audience is less likely to pick up on, so it remains a popular argument, as seen in Ken Ham's... debate with Bill Nye at the Creation Museum.

Dr. Werner Gitt and In the Beginning was Information
Understanding that information theory has a relationship to genetics and evolution, creationists have used the language of information theory in an attempt to discredit evolution. Dr. Werner Gitt published a monograph In the Beginning was Information that creationists invariably refer to when arguing about information theory and evolution. Gitt's book is problematic in its structure and in its assertions about information theory.

Gitt separates the scientific version of information from other types. He singles out Shannon information as "statistical" and then partitions information into syntax, semantic (or "meaningful") information, pragmatic information, and apobetics. In doing so, he makes a number of claims about how genetics works. The text develops a number of statements which Gitt numbers as "theorems", as if the text were a mathematics textbook, and claims "[this] series of theorems which should also be regarded as laws of nature, although they are not of a physical or a chemical nature".

This form of argument is problematic on multiple accounts. First, theorems are usually mathematical statements based on postulates and definitions and take the form of propositional logic to prove such statements. Gitt does not state his assumptions and leaves many terms undefined. More problematically, the theorems themselves are not mathematical statements; his theorems are actually assertions. (His binning of Shannon information as statistical and the "lowest level" of information indicates Gitt's disdain for mathematics.) Second, theorems are the result of deductive logic, while scientific laws result from inductive logic based on observation. The two cannot be equated. Gitt does not refer to any observation in the development of his theorems, and hence, by definition they are not laws. It is unclear how to make statements about the natural world without any observation to support it. Third, as will be described below, it is an untestable model and hence cannot be deemed valid or invalid.

In essence, Gitt uses the language of mathematics and science, but does not perform a mathematical proof or employ the scientific method. Instead, he makes a number of assertions that cannot be validated, and Gitt's text is a poorly constructed rhetorical argument, not a scientific one.

Semantic or meaningful information
At the heart of Gitt's text is the concept of meaningful information. Gitt does not define semantic information, but instead he relies on references to hieroglyphics, language, and computer programs. Hence, he generalizes in his theorems concepts of linguistics into genetics that are unjustified. Essentially, Gitt conflates concepts of the informal definition of information (such as knowledge in a book) with that of information theory to provide statements/assertions meaningless to genetics. His statements provide examples:


 * "There can be no information without a sender." It is certainly true in the case of books and writing that a human entity must have written or typed the original source.  A reasonably educated person has observed other people writing, and has written him or herself.  However, applying that generalization to genetics is problematic.  An intelligent source has never been observed to  create a genetic code naturally, nor is there any inferential evidence that this occurs.  (The only exception is, of course, scientists in the laboratory who have only recently done so.)  To assume that there must be a sender or an intelligent source of information cannot be validated.
 * "It is impossible for information to exist without having been established voluntarily by a free will." Again, this makes sense in the case of writing books and computer programs because we observe others generating this type of information (or have ourselves).  There is no evidence that during procreation a supernatural being is deciding which genes to pass on, or was the original source of a genetic code. Further, it abrogates a basic tenet of information theory (as established by Shannon) that any physical system or model of a physical system which produces a discrete sequence of symbols from a finite, or at least countable, output space (more concisely, any stochastic process) constitutes a discrete source of information. The random vibration of atoms in any piece of matter being modeled is an example of such a process, as is the random decay of unstable nuclei in a given sample. This property was widely exploited (and still is, in some cases) for the generation of truly random numbers before it became practical to generate useful pseudorandom numbers using computers.

Books, language and computer programs do at times provide useful analogies to genetic 'information,' but they are not relevantly similar when comparing their origin, and the claims about information in books or computer programs that creationists tend to use as analogies cannot be accurately generalized to DNA.

Statements on evolution
Gitt concludes the following about evolution: We find proposals for the way the genetic code could have originated, in very many publications [e. g. O2, E2, K1]. But up to the present time nobody has been able to propose anything better than purely imaginary models. It has not yet been shown empirically how information can arise in matter, and, according to Theorem 11, this will never happen. "Theorem" 11 (deduced without postulates or definitions) states that A code system is always the result of a mental process (see footnote 14) (it requires an intelligent origin or inventor). Gitt basically uses an argument from ignorance to attempt to invalidate evolution and then uses Theorem 11, an invalid deductive statement (per the last section) based entirely on his model and not based on evidence, to entirely invalidate evolution. (Ironically, Gitt's model itself is "purely imaginary".)

This statement introduces further problems, in that it builds intelligent authorship into the definition of a "code system", thus one would now have to prove that an intelligent mind was the origin of DNA before you could call it a "code system". Just assuming it's a code system, and all that that now implies, is not evidence that there is an intelligent mind behind it; that's simply an attempt to sneak in an assumption of intelligent origin, which them makes the argument for the source of the DNA "code system" into an example of the begging the question fallacy. If you redefine a "code system" as needing a mind, then assume DNA is a code system, you've only proved that you assume DNA needed a mind, not that it actually did need one. You'd actually need to prove that it indeed needed a mind to create it before you could show it met your definition of a "code system".

His statement on mutations is similar: This idea is central in representations of evolution, but mutations can only cause changes in existing information. There can be no increase in information, and in general the results are injurious. New blueprints for new functions or new organs cannot arise; mutations cannot be the source of new (creative) information. Unfortunately, without any measurement to back this up, his assertion about no increase of information cannot be validated. Gitt never defines "meaningful information", nor provides any way to measure it, nor gives any means by which to quantify the presence of more or less "information". Hence his proposition is untestable and unfalsifiable. Gitt has constructed his model such that the status quo is meaningful and anything that manipulates information that is not intelligent (or God) makes information less meaningful. Even more problematically, Gitt isn't even well versed in real information theory: his description of genetic "information" as a "coding system" implies that he considers genes a universal set of instructions for the "Make lifeform X" algorithm - an interpretation of information which is not readily distinguishable from Kolmogorov complexity (described above). Kolmogorov complexity describes basic instructions for a particular process in the form of basic actions, which means that two mutations that have been shown to occur - gene duplication and chromosome duplication - do add information; whatever the string being duplicated, it results in the addition of a new instruction to the algorithm: "Repeat string Y." (This is generally less information than string Y contains, and would be lost if all the copies of string Y were deleted, but it's still more than zero.) Further, altering one copy of a gene and not another also adds information: "Repeat string Y" becomes "Repeat string Y, but then replace base pairs N1 through N2 with string Z".

In summary, Gitt has convoluted deductive and inductive logic to generate an invalid model based on tenuous assertions based on a false comparison between DNA sequences and humanly produced texts and algorithms. The model is not based on observations of the natural world, despite making extraordinary claims about it. It makes statements about more and less information and yet the information cannot be quantified. The model is untestable and unfalsifiable. Overall, Gitt's model is worthless at describing information in the natural world.

Real information theory can quantify such information, including the sensitivity of information entropy (or information content, as strictly defined) to knowledge about the information in a particular context.