A century ago, the idea of translating brain activity into speech through technological means was hardly imaginable. Long before scientists started conceiving techniques to this end, it was artists who first explored this uncharted territory. The literary technique known as stream of consciousness, which attempts to convey the way in which words flow as we think, reached its most refined expression in the famous Molly Bloom monologue at the end of James Joyce’s Ulysses, published in 1922.
Today, a century later, as Artificial Intelligence and Natural Language Processing techniques rapidly evolve -along with non-invasive methods to record brain activity-, the verbal content flowing in our brains is slowly sliding beyond the boundaries of art and imagination, and into the realm of science and technology.
Decoding Brain Activity
Recently, Meta has publicly shared some of the steps that are being given by its researchers toward this goal. Pointing toward a future in which the decoding of brain activity into speech can benefit many people, they used non-invasive techniques for recording brain activity known as electroencephalography and magnetoencephalography. As opposed to more traditional, deeply invasive brain-recording techniques such as stereotactic electroencephalography and electrocorticography, these methods are less risky and do not require neurosurgical interventions, which means they are far more scalable in the long run.
But in order to aid in the goal of Meta’s project, they had to be combined with Artificial Intelligence models based on Machine Learning and Natural Language Processing that can decode these non-invasive recordings.
First, researchers developed a deep learning model trained with contrastive learning. And what exactly is contrastive learning? This concept is used in Natural Language Processing to describe a technique that is used to teach a model how to identify the general features of a dataset by contrasting pairs of data points and determining which are similar and which are different.
In this case, they leveraged, on the one side, non-invasive technologies that can measure the fluctuations of electric and magnetic fields elicited by neuronal activity by taking approximately 1,000 snapshots of macroscopic brain activity every second. Since these recordings vary extensively across individuals according to a series of conditions, they need to be input into a “brain model”, which allows for the realigning of brain signals on a template brain.
On the other side, researchers had a large pool of speech sounds taken from the readings of audiobooks that were being listened to by subjects as their brain activity was being measured. The deep learning model was finally used to align those non-invasive brain recordings with the speech sounds. The system was able to determine, with a certain degree of precision, which of the audio clips was being heard by subjects. From there, it can deduce the exact words a subject most likely heard.
Thoughts into Speech – (From Speech Perception to Speech Production)
Of course, this is only a step toward a bigger goal. After this, Meta researchers plan on extending the model to directly decode speech from brain activity, without the need for a pool of audio clips. The final step is to go from decoding speech perception to producing speech from mere brain signals.
The most urgent use of this type of technology is pointed toward patients that have suffered a traumatic brain injury and are unable to communicate through any means. But as time goes by and it becomes more widely available, the capacity to translate brain activity into speech could have a very wide influence on our societies and our culture.