Note that a PhD funding is available on this subject, the PhD will start at the end of the internship, in September 2023.

Deep learning for understanding brain representations of sounds

The internship topic is related to a collaborative ANR (French Research Agency) project with Institut de Neurosciences de la Timone (Bruno Giordano, INT) and Maastricht University, Pays Bas (Elia Formisano).

Supervision team:

  • Thierry Artières : Machine Learning team (Qarma) at Computer Science Lab (LIS), Aix-Marseille University
  • Bruno Giordano : Institut de Neurosciences de la Timone – INT, Marseille (INT), Aix-Marseille University
  • Thomas Schatz : Machine Learning team (Qarma) at Computer Science Lab (LIS), Aix-Marseille University

The internship is expected to start between February and April 2023 and to last about six months.


As we cross a quiet street on a spring day, we get distracted by the birds chirping from the trees. We are suddenly startled by the sound of a car approaching fast from out of sight, and we quickly reach the other side. Our ability to effortlessly recognize sound sources in the real world supports adaptive behaviour and is essential to our well-being. As a whole, natural sounds are rich in largely diverse acoustical structure, and are carriers of meaning to the listener (What generated the sound I hear? A bird? A car? Is the car arriving fast? Am I in danger?). This internship falls within the broader context of a project that seeks to push forward our knowledge of how our brain transforms the input acoustic information into semantic representations of sound sources in our environment.

The main objective of the internship will be to train a variety of neural computational models able to learn new representation spaces for natural sounds (for instance autoencoders) and to evaluate to what extent these learned representation spaces of audio stimuli correlate with brain activity during natural listening. More precisely we will compute how the representation of a sound computed by a model correlates with the brain activation of a subject who is listening to this sound. We will rely for the evaluation stage on a large high-resolution MEG dataset which is already available.

In the proposed internship, we will build on and extend deep learning methods that have fuelled recent progress in machine learning, but have not yet been applied to the audio domain and/or whose potential to model human brain activity remains to be explored.

At first, we will explore models based on encoders, without necessarily including a decoder, to learn a relevant representation space of sounds. We will consider auto-encoders, adversarial or variational models, and models learned with a contrastive learning strategy (without decoder). We will rely on works such as variational auto-encoders [Chorowski et al. 2019], contrastive learning [van den Oord et al. 2018, Wang et al. 2022] or self-regulated methods [Bardes et al. 2021]. Each of these models aims to learn a new “latent” representation of sounds, the quality of which will be assessed by correlation with brain activity.

Then, we will build on the results obtained in the first step by considering the most promising representation space among those learned by our encoder models, and we will develop a synthesis model capable of producing a natural sound from its latent representation. We will mainly draw inspiration from the generative adversarial networks of the GAN [Donahue et al. 2018] and latent diffusion models [Baas et al., 2022], [Goel et al., 2022], see also [Dhariwal and Nichol 2021].


[Bardes et al. 2021]                Bardes, Adrien, Jean Ponce, and Yann LeCun. “Vicreg: Variance-invariance-covariance regularization for self-supervised learning.” arXiv preprint arXiv:2105.04906 (2021).

[Baas et al., 2022]                  M Baas, H Kamper, GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models, arXiv:2210.05271, 2022 –

[Chorowski et al. 2019]          Chorowski, Jan, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing 27.12 (2019): 2041-2053.

[Dhariwal and Nichol 2021]    Dhariwal, Prafulla, and Alexander Nichol. “Diffusion models beat gans on image synthesis.” Advances in Neural Information Processing Systems 34 (2021): 8780-8794.

[Donahue et al. 2018]            Donahue, Chris, Julian McAuley, and Miller Puckette. “Synthesizing audio with generative adversarial networks.” arXiv preprint arXiv:1802.04208 1 (2018).

[Goel et al., 2022]                  Karan Goel, Albert Gu, Chris Donahue, and Christopher Re,“It’s raw! Audio generation with state-space models”, arXiv:2202.09729, 2022.

[Guzhov et al. 2022]   Guzhov, Andrey, et al. “Audioclip: Extending clip to image, text and audio.” ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[Sarkar and Etemad 2021]     Sarkar, Pritam, and Ali Etemad. “Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity.” arXiv preprint arXiv:2111.05329 (2021).

[van den Oord et al. 2016] Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).

[van den Oord et al. 2018] Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. “Representation learning with contrastive predictive coding.” arXiv preprint arXiv:1807.03748 (2018).

[Wang et al. 2021]      Wang, Luyu, et al. “Multimodal self-supervised learning of general audio representations.” arXiv preprint arXiv:2104.12807 (2021).

[Wang et al. 2022]      Wang, Luyu, et al. “Towards learning universal audio representations.” ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.