Minicolumns as a language

the strange similarities with formal semantics

04/10/19 04/10/19 04/10/19 05/01/19 08/20/20 11/12/20

There are some striking similarities between reasoning systems. Processing in cortical minicolumns and representations in formal languages, both as logic and linguistics, and the form used in mathematical logic. In particular the triplets from semantic web has some striking similarities. There are input vectors representing symbols, functions that relates those input vectors to output vectors, and those output vectors are also symbols. These can be directly compared to the subject–predicate–object expression used in a semantic triple. When trying to implement this in machine learning, and in particular artificial neural networks, the similarities has some very real consequences.

This article is reworked from the previously published version, with corrections and additions.

These similarities are apparent when neural nets are used in inference engines, and not merely as correlation engines. Many of the current high-profile use cases are for correlation engines, such as image classifiers and shallow query answering engines often known as chat bots. In inference engines a number of simple claims formed by single triplets are stringed together to form composite statements by composition rules. Such statements can then be used both as questions to the engine (analysis) and answers from the engine (synthesis).

proposition: Language for Simple World: A sufficient language for a Simple World can be described as a set of constants used as input values to a single function that outputs new constants.

A single triplet can be viewed as the language $\mathcal{L}$ with a slight abuse of notation as we are lacking the zero item

\[\begin{equation} \mathcal{L} = \left \{ \underbrace{ x ^{\left ( k \right ) }} _{ \text{subject} }, \underbrace{ f (\cdot) } _{ \text{predicate} }, \underbrace{ y ^{ \left ( k \right ) }} _{ \text{object} } \right \} \end{equation}\]

where the symbols $x$ and $y$ represents input and output, and a function transferring symbols $f(x) \to y$. The sets $x ^{ \left ( k \right ) }$ and $y ^{ \left ( k \right ) }$ would be subsets of a larger common set to be formally correct. Generalizing the symbols into vectors should not be a big leap of faith, and likewise the function into being a vector function.

Given Searle’s terms (Searle, 1980) the symbols could be interpreted as semantics, while the functions could be interpreted as syntax. The functions (syntax) gives the structure whereby symbols (semantics) are interpreted.

A neural net trained as a typical correlation engine has no natural zero-element. It learns to classify known states. In particular, it does not learn to distinguish whats unknown, but it can guestimate such states from previous learned states. This creates some pretty weird problems. Those often shows up in semantic web as exceptions. In Wikidata they are handled as part of snaks, given as no value and unknown value. The problem is also known from natural language processing, with unknown words marked as <unk>.

The concept of “weak classifications”, the less probable states, could be interpreted as a zero-element. If all possible classifications goes towards zero, then that will approximate a zero-element. Thus there might be a rather exact zero-element for the output $y ^{ \left ( k \right ) }$, but for the input $x ^{ \left ( k \right ) }$ there might be large subspace that acts as a zero value.

A rather brilliant description of reasoning is “algebraically manipulating previously acquired knowledge in order to answer a new question” given by (Bottou, 2011). The previously acquired knowledge has been used for training the neural net, the question is new inputs, the answer is new outputs, and the algebraic manipulation is how the functions are chained together. The algebraic manipulations can be strict logical expressions, but it can be a lot more.

To even further obfuscate the problem the inputs live at a manifold in one World, while the outputs live at a manifold in another World. When the input manifold is equal to the output manifold, then partial answers given as outputs can be feed back as reformulated questions to the input. It is not a given that this happen, but if the network tries to make a faithful reproduction in the mean of the input, then it will happen.

Some current attempts on building inference engines are tree-graphs for program translation by (Chen, Liu, Song, 2018), and for logical entailment (Evans, Saxton, Amos, Kohli, Grefenstette, 2018). These work quite well for unambiguous three structures. Another alternative is reasoning by composition of attention by (Hudson, Manning, 2018).

In an ideal World a single function, that is a single total learned transfer function for a single neural layer, might take the whole input vector and produce a whole output vector in one fly. That would be a tempting approach, but note that this would also imply adjusting the whole learned function, which is a representation of a training set of $T$ samples. With $T$ being very large the learning rate will be very small to achieve stable learning. Learning from each sample would be $1/T$, thus the learning rate has to be less than this number. (This might be viewed as on-line transfer learning, but note that $1/T$ is a gross oversimplification.)

A corollary to this is that a small correction in transfer learning should be taken in smaller steps if the previous learning set was large, otherwise learning would step out of the finer manifold defined by the larger training set. The large single function leads to a large training set, which gives small training steps. Together with a large layer this makes the overall processing needs explode. In fully connected neural layer learning is a $O(\alpha TM + \beta NM)$ problem where $N$ being number of input nodes, $M$ being number of output nodes, and $T$ being number of training samples. In general $\alpha$ and $\beta$ are unknown, so both therms should be kept low, that is better keep $M$ low or keep the number of output nodes low.

Can the implemented function be kept simple, without sacrificing too much precision, and hopefully also with a limited training set?

The Global World can be partitioned into smaller Local Worlds. By doing so, the size of the function is limited (that is the number of nodes) and the necessary training set is also limited. Note that the function is the predicate in the triplet, thus it is obvious that it is a kind of subworld in a larger World, and thus its scope is somewhat limited.

corollary: Language for Complex World: A sufficient language for a Complex World can be described as a set of vector constants used as input values to multiple functions that outputs new vector constants.

A single triplet can be viewed as the language $\mathcal{L}$, organized as a 2-dimensional map, indexed by $i$ and $j$, such that

\[\begin{equation} \mathcal{L} = \left \{ \underbrace{ \mathbf{x} ^{ij \left ( k \right )}} _{ \text{subject} }, \underbrace{ f ^{ij}(\cdot) } _{ \text{predicate} }, \underbrace{ \mathbf{y} ^{ij \left ( k \right )}} _{ \text{object} } \right \} \end{equation}\]

where $f^{ij}(\cdot)$ is a function that takes an input $\mathbf{x} ^{ij}$ representing a symbol and transforms it into an output $\mathbf{y} ^{ij}$ that represents a new symbol, like $f(x) \to y$. Except for the superscript there isn’t anything new from eq. 1 above.

The two symbols $\mathbf{x} ^{ij \left ( k \right )}$ and $\mathbf{y} ^{ij \left ( k \right )}$ are high-dimensional vectors, organized as a set indexed by $k$. The indexes $i$ and $j$ are 2-dimensional indexes over a map of subworlds. This does not have to be 2-dimensional, it is just nice and it fits well with the minicolumn analogy.

Consequences

The consequence of this is that when the brain adjusts weights in a minicolumn it has fewer nodes in a smaller World and needs less training. The processing complexity the minicolumn face can be several orders lower, and thus be much less demanding. Compare this to a digital neural network where input vectors with thousands of entries are processed. That makes the brain tick faster than a similar computer, even if the processing elements in the brain are quite slow. It is not because the brain is in any way inferior to a digital computer, it is because nature has found a better paradigm for computing these relations.

Sources

Hudson, Drew A.; Manning, Christopher D.; Compositional Attention Networks for Machine Reasoning [arXiv] (2018) CoRR volume abs/1803.03067
Manning, Christopher D.; A Neural Network Model That Can Reason (2018-05-04)
Bottou, Léon; From Machine Learning to Machine Reasoning [arXiv] (2011) CoRR volume abs/1102.1808
Chen, Xinyun; Liu, Chang; Song, Dawn; Tree-to-tree Neural Networks for Program Translation [arXiv] (2018) CoRR volume abs/1802.03691
Evans, Richard; Saxton, David; Amos, David; Kohli, Pushmeet; Grefenstette, Edward; Can Neural Networks Understand Logical Entailment? [arXiv] (2018) CoRR volume abs/1802.08535
Jeblad; User:Jeblad/Minicolumns as a language [Norwegian Bokmål Wikipedia] (2019)
Searle, John R.; Minds, brains, and programs (1980) Behavioral and Brain Sciences volume 3 Cambridge University Press