Some Thoughts on Language Processing Algorithms

Published: April 14, 2008
Tags: language linguistics phd psycholinguistics algorithms

My approach to understanding natural language is what I imagine is the approach taken by most materialist scientists - that the brain is a computer made of meat and in trying to understand things like acquisition of language we are really searching for the algorithms implemented in this meat which achieve this task.

The problem of confirming that a given algorithm actually bears some resemblance to that running in our brain is an interesting one - not strictly necessary if we're only interested in cool applications like talking computers (in which case the performance of the algorithm is about all we're interested in), but probably deserving of attention if we're operating under some pretense of being psychologists, which I suppose I am now (though I don't like to think of it that way because I still haven't yet had complete success in cleansing the word "psychologist" of the stigma of pseudoscience that it carries in my mind).

An obvious approach is to implement the algorithm in silicon rather than meat and get it to perform various tasks, making as many observations as possible about its performance and comparing these to similar measurements made on humans using the meaty algorithm. There's a wide range of observations that could be used here (for example, some measure of susceptibility to linguistic "slip ups", like spoonerisms) and I expect a lot of thought could be devoted to determining which tests are the most appropriate and reliable along these lines - a kind of "psycholinguistic Voight-Kampff test" which, rather than aiming to determine whether or not a machine can understand and converse in a way which is similar to humans on the surface, like a Turing test, aims to determine whether or not that machine is understanding and conversing in a way similar to humans "under the hood".

But before we can even get to the stage where we could perform such testing, we need an algorithm to test, and I wonder if a lot of effort might not be saved by designing our algorithms from the outset to have a better chance of resembling the brain's natural algorithms. The motivating question here is "What can we deduce about the brain's language processing algorithms from the knowledge that they have been hard-coded into an organic organ by evolutionary forces?". I'm a little bit out of my league here, having no real background in evolutionary biology or neurophysiology (which may well not even be a word), but studying maths gives you a fantastic arrogance when it comes to feeling qualified to talk about other people's disciplines (after all, biology is just applied chemistry, which is just applied physics, which is just applied maths. Right?).

I have three somewhat solid thoughts on this front at the moment, both stemming from the idea that the brain, like most (all?) organic organs probably displays a high level of self-similarity, i.e. has the property, or is composed of sub-parts which have the property, of containing lots of copies of a similar sub-structure. This tendency is a pretty obvious and natural consequence of organs growing via a process of repeated cell division. So what does this self-similarity suggest?

Parallelism. Some algorithms are highly susceptible to being made to run in parallel, with linear or sometimes even super-linear speed up achievable, whereas some algorithms really are inherently very serial. It seems natural to expect that the brain is much more likely to be running parallel algorithms (with similar activity happening in several similarly structured parts of the brain), so perhaps we ought to cast some doubt over any language processing algorithm which seems hard to parallelise.

Recursion and iteration. The more recursion and iteration involved in an algorithm, the less need there is in a meat implementation for different pieces of meat which do different things. If we are supposing that evolution will tend to produce a lot of similar brain parts than a wide range of unique brain parts, then perhaps we ought to case some doubt over any language processing algorithm which does not contain a lot of recursion or iteration. This particular "restriction" (really more of an intuitive guide, I guess) puts the apparent current trend toward using Bayesian statistics in cognitive modelling in a good light, because the Bayesian paradigm is really all about iteration, in the sense of constantly updating our prior probability distribution in response to observations.

Sharing of data structures. There is more than one computational task in language processing. Sometimes we're trying to translate a string of words into a logical relation between concepts and sometimes we're trying to translate in the other direction. Obviously there are some data storage and searching issues related here - we need to store words, concepts and some sorts of mappings between them. Thus there are data structures involved here - not necessarily perfect analogues of the data structures one meets in a CS course (I doubt our brain uses literal hash tables, for instance), but data structures never the less. Presumably this data is stored in our brain only once, in one particular fashion. Thus, if you have one algorithm for translating in one direction and another for translating in the other, but they both use different data structures to represent concepts, words or the links between them, then regardless of how well the algorithms perform, perhaps we ought to suspect that at least one of them is not an accurate model of how the human brain actually works.

Of course, it would be very foolish to interpret these as hard and fast guidelines, and I don't mean to suggest that I will constrain my own studies only to algorithms fitting these criteria. But the very act of coming up with such a list is an interesting and, in my opinion, worthwhile exercise. I would be surprised if all three of these ideas were substantially wrong, and would advise that they at least be kept in mind while designing language processing algorithms that are supposed to mimic actual human language processing.

Feeds