Notes on modeling channel noise in language games

At JCoLE 2022 I saw a talk by Rahma Chaabouni on Anti-efficient encoding in emergent communication. In a nutshell, they have neural agents playing language games. In their simulations, the neural agents spontaneously invent discrete codes for communicating a 1000 concepts, and unexpectedly converge to an “anti-efficient” regime: frequent meanings are encoded using the longest possible messages, in direct opposition to Zipf’s law of abbreviation (ZLA). In their paper, Chaabouni et al. (2019) show this arises from the asymmetry of interests between speaker and listener in a noise-free channel. Shorter codes reappear only after manually penalizing message length.

This made me wonder whether assuming a perfect channel might actually be driving their result. As MacKay (2005) argues, real communication is inherently noisy and compact and robust encodings are preferred. Introducing random bit-flip noise on arithmetic-coded messages between speaker and listener could potentially produce Zipfian brevity and robustness as natural emergent properties: no need for explicitly handcrafted constraints.

So the question is: should we model channel noise?

I think that modeling channel noise might imply some desirable loss function terms “automatically” (if I remember well from the presentation: impatience and word length at some point in the presentation) when optimizing for efficiency under channel noise, and ZLA might appear spontaneously. And it does appear so in the toy model I study below.

Channel noise?

For example, if agents would transmit speech waveforms, you can add channel noise by summing e.g. white noise to the signal, and combat some overfitting.

But I was thinking about something different: in a naturally discrete channel, where you communicate bits, you could just flip a random fraction of bits to model generic errors.

So an agent can encode a message using arithmetic coding, which turns it into an almost perfectly random bitstring if the message is a typical sample from the agent’s message distribution. Conversely, if the message to be sent is the result of some optimization or searching process (i.e., inference), the bitstring will become less uniform, being associated with a more compact posterior mass.

Either way, generic “fair” noise can be introduced by flipping random bits, because from the point of view of the noise generating process, the encoded bitstring is random. Arithmetic encoding of communication in language games affords generic channel noise because all model details are abstracted away to bitstrings.

These errors when decoding would cascade down in a model-specific way, just like flipping “a bit” in a DNA string can have system-wide (i.e. body-wide) consequences.1

So it seems that this channel noise can also model listener’s confusion or the listener’s bias to messages that are more expected, because the corrupted bitstring (the message) is interpreted within the decoding model. In fact, any mechanism foreseen in the language model can be made susceptible to error, exactly what we’d expect from realistic models of sending and receiving communication.

Channel noise leads to ZLA in a toy model

When you implement this idea with a simple toy monogram model, optimal coding, and flipping a random bit, the ballpark chance of hitting symbol ii is proportional to pilog2pi−p_i \log_2 p_i, where pip_i is the probability assigned to symbol ii. In other words, the probability of corrupting a symbol with probability pip_i is equal to that symbol’s contribution to the total entropy of the symbol’s probability distribution.

So infrequent words are more likely to be hit (they need more bits), but they’re also rare, so they get hit less often overall. That’s an interesting tradeoff in itself, but not really the point here. Rather, the point is what the noise force you to optimize, if you desire efficient successful communication. If each bit flips with small probability ε\varepsilon, the expected corruption cost for type ii scales like piip_i\,\ell_i (frequency ×\times length). Minimizing ipii\sum_i p_i \ell_i under the Kraft constraint gives ilog2pi\ell_i \approx -\log_2 p_i. ZLA is a consequence of navigating the channel noise.

Here’s the short derivation.

Let symbols ii occur i.i.d. with probabilities pip_i, and be encoded by prefix-free bitstrings of length i\ell_i. Assume each bit flips independently with probability ε\varepsilon. Then the expected corruption cost is:

C()=ipi[1(1ε)i]    εipii(ε1).C(\ell) = \sum_i p_i\left[1 - (1 - \varepsilon)^{\ell_i}\right] \;\approx\; \varepsilon \sum_i p_i\,\ell_i \quad (\varepsilon \ll 1).

This cost is independent of the language model. Minimize it under the Kraft constraint:

i2i1.\sum_i 2^{-\ell_i} \lesssim 1.

Using a Lagrange multiplier:

L=ipii+μ(i2i1),\mathcal{L} = \sum_i p_i \ell_i + \mu \left( \sum_i 2^{-\ell_i} - 1 \right),

the optimum satisfies

Li=pi(μln2)2i=0,\frac{\partial \mathcal{L}}{\partial \ell_i} = p_i - (\mu \ln 2) 2^{-\ell_i} = 0,

which gives 2ipi2^{-\ell_i} \propto p_i, so

i=log2pi(+ constant)\ell_i^* = -\log_2 p_i \quad (+\ \text{constant})

That’s the usual entropy-optimal code. But here it emerges naturally by minimizing the expected impact of model-independent channel noise — not by manually adding a length or energy penalty that depends on details the language model.

Conclusion

In a simple toy model, optimizing for efficient communication will lead to the ZLA mechanism purely by the presence of channel noise; no other mechanisms or loss functions are necessary as eg. in Chaabouni et al. (2019) or more recently Ferrer-i-Cancho, Bentz, and Seguin (2022). Instead, the objective function derived here is a consequence of the desideratum of efficient and successful communication under channel noise, and nothing more.

More research into different models and the nature of the proposed corruption cost objective is required; but this preliminary line of thought at least supports the idea that ZLA may be a model-independent strategy for successful communication across noisy channels such as speech and sign language.

Chaabouni, Rahma, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni. 2019. “Anti-Efficient Encoding in Emergent Communication.” Advances in Neural Information Processing Systems 32.
Ferrer-i-Cancho, Ramon, Christian Bentz, and Caio Seguin. 2022. “Optimal Coding and the Origins of Zipfian Laws.” Journal of Quantitative Linguistics 29 (2): 165–94. https://doi.org/10.1080/09296174.2020.1778387.
MacKay, David J C. 2005. Information Theory, Inference, and Learning Algorithms. Learning. https://doi.org/10.1198/jasa.2005.s54.

Footnotes

  1. In our line of thinking, a DNA message is an example of a highly non-uniform bitstring message, as the posterior mass associated with it is absolutely minuscule and was found by eons of evolution.

Go to next post ▤ or ▥ previous post in research series.