Convolved images as hypotheses

From A conversation about the brain
Jump to: navigation, search
The traditional idea of recording the image location of a feature is by retinal location or an image $(x,y)$ coordinate (red grid). When an image is convolved at different spatial scales (right, middle), e.g. with a Laplacian of Gaussian operator, the output indicates which region of the image are darker than 'average'. The definition of 'average' depends on the scale applied. The schematic image below shows how fine scale 'darker-than-average' regions fall within the boundary of larger scale regions, at least in the MIRAGE algorithm for combining filter outputs[1].
If the brain really used retinal location as a code for the visual direction of objects then all sorts of problems would follow, for example, when the observer made a saccade. Instead, the output of neurons that are organised in a retinotopic frame (which is true of most visual neurons) can be considered as a 'vote' to say 'At this retinal location, there is such-and-such a probability that the feature I am tuned for is present in the image'. These votes can be combined (e.g. probabilities multiplied which is equivalent to signals being logged and summed to give a log likelihood[2]) to estimate the retinal location of the dark feature. A simple example of this procedure is finding the centroid of a zero-bounded region of responses from a centre-surround filter. But the same principle can be applied to a 'face-detector' convolved with an image: the response in the convolved output may rise gradually to a peak, which may be centred on the face, but there is not a separate hypothesis or signal for each $(x,y)$ location in the image. Instead, there is one hypothesis and multiple sources of evidence, from many $(x,y)$ locations. Many locations contribute to a single 'centroid' which is, itself, a hypothesis about the most likely location of the feature to which the filter is tuned: in this case, a dark feature. One might argue that the location of the centroid must be reported in a retinotopic coordinate frame but that leads to another story [expand later]. The important point here is that many pixels (or firing neurons) contribute to one maximum likelihood that is reported. The combination of neural firing rates across space can be extended to combination across time and across eye movements (e.g. micro-saccades). Given that the eye has moved in the latter case, this is a critical step in abandoning the retinal frame.

Back to Hypotheses

References

  1. Watt, R. J., & Morgan, M. J. (1985). A theory of the primitive spatial code in human vision. Vision Research, 25(11), 1661-1674.
  2. Hinton, G. E. (1999). Products of experts. In Artificial Neural Networks, 1999. ICANN 99. Ninth International Conference on (Conf. Publ. No. 470) (Vol. 1, pp. 1-6). IET.