AM '23: Proceedings of the 18th International Audio Mostly ConferenceFull Citation in the ACM Digital Library
SESSION: CHAPTER 1: MUSIC
Instrumental Agency and the Co-Production of Sound: From South Asian Instruments to Interactive Systems
In this paper, we will look at sympathetic resonance as seen in South Asian instruments as a source of complex performer-instrument interaction. In particular, we will compare this rich tradition to the various types of human/machine interactions that arise in digital instruments endowed with computational agency. In reflecting on the spectrum of agency that exists between the extremes of instrumental performance and machine partnership, we will arrive at two concepts to help frame our study of complex interactions in acoustic instruments: the co-production of sound and material agency. As a case study, we asked musicians of these South Asian instruments questions about their perceived relationship with their sympathetic strings. Building upon this, we designed and created an interactive system that models the phenomenon of performing with sympathetic strings. We then asked musicians to interact with this new system and answer questions based on this experience. The results of these sessions were examined both to uncover any similarities between the two sets of interviews,and to situate this entangled performer-instrument interaction with respect to markers of perceived control, influence, co-creation, and agency.
Stringesthesia: Dynamically Shifting Musical Agency Between Audience and Performer Based on Trust in an Interactive and Improvised Performance
This paper introduces Stringesthesia, an interactive and improvised performance paradigm. Stringesthesia uses real-time neuroimaging to connect performers and audiences, enabling direct access to the performer’s mental state and determining audience participation during the performance. Functional near-infrared spectroscopy (fNIRS), a noninvasive neuroimaging tool, was used to assess metabolic activity of brain areas collectively associated with a metric we call “trust”. A visualization representing the real-time measurement of the performer’s level of trust was projected behind the performer and used to dynamically restrict or promote audience participation: e.g., as the performer’s trust in the audience grew, more participatory stations for playing drums and selecting the performer’s chords were activated. Throughout the paper we discuss prior work that heavily influenced our design, conceptual and methodological issues with using fNIRS technology, and our system architecture. We then describe feedback from the audience and performer in a performance setting with a solo guitar player.
This paper presents an application of affective conditional modifiers (ACMs) in adaptive video game music – a technique whereby the emotional intent of background music is adapted, based on biofeedback, to enforce a target emotion state in the player, thus providing a more immersive experience. The proposed methods are explored in a bespoke horror game titled "The Hidden", which uses ACMs to enforce states of calmness in stressed players, and states of stress in calm players, through the procedural adaptation of background music timbre and instrumentation. These two conditions, along with a control condition, are investigated through an experimental study. Due to the low number of participants, the results of the user study provide limited insight into the effectiveness of the proposed ACMs. Nevertheless, the experiment design and user feedback highlight a number of important considerations and potential directions for future work. Namely, the need for consideration of the individual affective profile of the player, the audio-visual and narrative cues that may reduce the impact of affective audio, the effects of game familiarity on affective responses, and the need for ACM thresholds that are well-suited to the context and narrative of the game.
Inner City in the Listener's Auditory Bubble: Altering the Listener's Perception of the Inner City through the Intervention of Composed Soundscapes: Altering the Listener’s Perception of the Inner City through the Intervention of Composed Soundscapes
This paper describes the effect on the listeners’ experience of headphone listening to a music composition including inner-city sound while being in an inner-city environment, using a research through design approach. The study focuses on the listeners’ described experiences through the lens of Berleant’s aesthetic sensibility and Bull’s phenomenon of the auditory bubble. We produce a composition which participants listen to in an urban context and discuss the two main themes found, soundtrack and awareness, together with the indications of the possibility to direct listeners’ attention and level of immersion by including inner-city ambience and sound in music when listening with headphones in an urban environment.
An Interactive Tool for Exploring Score-Aligned Performances: Opportunities for Enhanced Music Engagement
Music scholars and enthusiasts have long been engaged with both performance recordings and musical scores, but inconveniently, these two closely connected mediums are usually stored separately. Currently, digital music libraries tend to have fairly traditional user interfaces for browsing music recordings, and more importantly, performance recordings are organized separately from their musical scores. In recent years, however, the same technological advances that have made vast troves of sound recordings and musical scores more widely available have also created tremendous potential for innovative new interfaces that can facilitate enhanced engagement with the music. In this paper, we present a web-based prototype tool that allows users to navigate classical piano recordings interactively in conjunction with their scores. We describe the technologies involved, and provide access to the actual website. Our pilot testing results are very positive, confirming the usefulness and potential of such a tool, especially in the areas of music education and scholarly research. We also discuss future development of this prototype.
In this paper, we will present a pilot study that explores the relationship between music and movement in dance phrases spontaneously choreographed to follow phrases of electroacoustic music. Motion capture recordings from the dance phrases were analyzed to get measurements of contraction-expansion and kinematic features, and the temporal location of the peaks of these measurements was subsequently compared with the peaks of a set of audio features analyzed from the musical phrases. The analyses suggest that the dancers variably accentuate their movements to the peaks or accents in the music. The paper discusses the findings in addition to possible improvements of the research design in further studies.
SESSION: CHAPTER 2: MUSIC INFORMATION RETRIEVAL (MIR)
Music offers a meaningful way for people living with dementia to interact with others and can provide health and wellbeing benefits. Enjoying shared activities helps couples affected by dementia retain a sense of couplehood and can support a spousal caregiver’s mental health. This paper describes the development of the Music Memory Makers (MMM) Duet System, a prototype that has been developed as part of a qualitative, multi-phase, iterative research study to test its feasibility for use with people living with dementia and their spousal caregivers. Through the iterative process, the diverse individual needs of the participants directly led to the adding, adjusting, or removal of features and components to better fit their needs and to make the system require as little technical experience from the users as possible for quick and easy engagement. In line with our work of developing system hardware and software to meet users’ needs, including 3D printed cases, coordination facilitation processes, a visual interface, and source separation tools to create familiar duets, participants found the duet system offered them an opportunity to enjoyably interact with one another by playing meaningful songs together.
Music classification algorithms use signal processing and machine learning approaches to extract and enrich metadata for audio recordings in music archives. Common tasks include music genre classification, where each song is assigned a single label (such as Rock, Pop, or Jazz), and musical instrument classification. Since music metadata can be ambiguous, classification algorithms cannot always achieve fully accurate predictions. Therefore, our focus extends beyond the correctly estimated class labels to include realistic confidence values for each potential genre or instrument label. In practice, many state-of-the-art classification algorithms based on deep neural networks exhibit overconfident predictions, complicating the interpretation of the final output values. In this work, we examine whether the issue of overconfident predictions and, consequently, non-representative confidence values is also relevant to music genre classification and musical instrument classification. Moreover, we describe techniques to mitigate this behavior and assess the impact of deep ensembles and temperature scaling in generating more realistic confidence outputs, which can be directly employed in real-world music tagging applications.
An Empirical Study on the Effectiveness of Feature Selection and Ensemble Learning Techniques for Music Genre Classification
Classical machine learning has long been utilized for classification and regression tasks, primarily focusing on tabular data or handcrafted features derived from various data modalities, such as music signals. Music Information Retrieval (MIR) is an emerging field that seeks to automate the management process of musical data. This paper explores the potential of employing ensemble learning techniques to enhance classification performance while assessing the impact of feature selection methods on accuracy and computational efficiency across three publicly available datasets: Spotify, TCC_CED, and GTZAN. The Spotify and TCC_CED datasets contain high-level musical features, such as energy, key, and duration, while the GTZAN dataset incorporates low-level acoustic features extracted from audio recordings. The empirical experiments and qualitative analysis reveal a significant performance improvement when employing ensemble learning techniques for handling high-level features. Furthermore, the findings suggest that applying appropriate feature selection methods can substantially reduce computational time. As a result, by strategically combining optimal feature selection and classification models, the performance can be boosted in terms of accuracy and computational time. This study provides insights for optimizing music genre classification tasks through the strategic selection and balancing of model performance, ensemble learning techniques, and feature selection methods, ultimately contributing to advancements of musical genre classification tasks in MIR.
In recent years, Accessible Digital Musical Instruments (ADMIs) designed for motor-impaired individuals that incorporate gaze-tracking technologies have become more prevalent. To ensure a reliable user experience and minimize delays between actions and sound production, interaction methods must be carefully studied. This paper presents Kiroll, an affordable and open-source software ADMI specifically designed for quadriplegic users. Kiroll can be played by motor-impaired users through eye gaze for note selection and breath for sound control. The interface features the infinite keyboards context-switching interaction method, which exploits the smooth-pursuit capabilities of human eyes to provide an indefinitely scrolling layout so as to resolve the Midas Touch issue typical of gaze-based interaction. This paper outlines Kiroll’s interaction paradigm, features, implementation processes, and design approach.
SESSION: CHAPTER 3: SONIFICATION
The problem of noise in hospitals is commonly tackled through noise abatement practices, which consider ’quietness’ as a quality indicator. However, the influence of positive or negative subjective reactions to these sounds are rarely examined. Recent efforts emphasize the importance of considering the benefits of wanted sound while minimizing unwanted noise to reach a positive healthcare soundscape. The authors identified sound zones in shared hospital spaces as a means to achieve this through sound separation, noise masking and designed sound zone content. Listening evaluations were conducted to evaluate subjective responses of individuals from hearing a hospital soundscape across a variety of sound zone interventions. The authors conclude that sound zone interventions in shared hospital spaces offer subjective benefits that move beyond noise reduction. As an area for future work, sound zone interventions will be deployed in hospital settings to study potential long-term restorative effects on patients and better working conditions for staff.
Using design dimensions to develop a multi-device audio experience through workshops and prototyping
Designing audio experiences for heterogeneous arrays of multiple devices is challenging, and researchers have tried to identify useful design practices. A set of design dimensions have been proposed, providing researchers and creative practitioners with a framework for understanding the different design considerations for multi-device audio; however, they have yet to be used for scoping and developing a new experience. This work investigates the utility of the design dimensions for exploring and prototyping new multi-device audio experiences. Three workshops were conducted with audio professionals to see how the design dimensions could be used to form new ideas. Using the resulting ideas, a multi-device audio system combining loudspeakers and earbuds, and an experience based on that system, were created and demonstrated. The design dimensions were found to be useful for understanding multi-device audio experiences and for quickly forming new ideas. In addition, the dimensions were a helpful reference during experience development for testing different design choices, particularly for audio allocation.
We present an interactive modular system built in Cycling ‘74 Max and interfaced with Grame’s FAUST for the purpose of analyzing, processing and mapping electrophysiological signals to sound. The system architecture combines an understanding of domain-specific (biophysiological) signal processing techniques with a flexible, modular and user-friendly interface. We explain our design process and decisions towards artistic usability, while maintaining a clear electrophysiological data flow. The system allows users to customize and experiment with different configurations of sensors, signal processing and sound synthesis algorithms, and has been tested in a range of different musical settings from user studies to concerts with a diverse range of musicians.
When developing auditory display systems, one must balance the tendency for sonification algorithms to produce potentially informative, but less engaging, direct representations of data, with more aesthetically pleasing transformations where the underlying data is prone to obfuscation. In a scientific communication context, the successful navigation of this continuum becomes increasingly critical. As such, we take air quality data as a vehicle to explore this concept, with the ultimate goal of raising awareness of declining air quality in modern urban landscapes in order to drive societal change in response.
Employing an aesthetically driven, artistic practice-based approach, we transform field recordings into an ever-evolving soundscape using generative music and algorithmic composition methods. Specifically, we present a novel, real-time granular synthesis-based sonification method that draws upon auditory icon, parameter mapping, and model-based sonification concepts, to create an output that invites an emotional connection with the underlying data. Finally, we discuss the design implications and constraints of this approach, before challenging some fundamental assumptions and conventions of modern sonification practice, while advocating for a tighter integration between the worlds of traditional sonification and sound art.
A subjective stagnation in the field of sonification research has been discussed. However, sonification has spread in simpler forms. We present a data set from Google scholar that provides insights into the state of sonification research. Based on these data, the literature, and a small expert poll, we propose criteria for effective sonification design: the use of easily perceptible sounds, that are mapped naturally, do not contradict the data metaphor, and are appropriate to the task. A quantitative analysis of the data found no correlation between effective sonifications and the number of citations or the year of their publishing.
This paper describes an ideation workshop aiming to explore the intersection of sonic interactions and energy use. As part of a larger research project exploring the role that sound can play in efficient energy behaviours, the workshop encouraged users to look for overlaps between their home resource use, potential sonic feedback and the feelings and emotions elicited by both. The workshop design was successful in providing non-experts with space and tools to reflect on the complex relationship between household, sound, energy and our feelings towards them. On a more practical level, 15 “hotspots” were identified where sound and energy concerns could be potentially addressed with sonic interventions, and four speculative prototypes were developed during the workshop each one revealing original considerations and relationships between sound and energy to be developed further in future work.
As interaction design has advanced, increased attention has been directed to the role that aesthetics play in shaping factors of user experience. Historically stemming from philosophy and the arts, aesthetics in interaction design has gravitated towards visual aspects of interface design thus far, with sonic aesthetics being underrepresented. This article defines and describes key dimensions of sonic aesthetics by drawing upon the literature and the authors’ experiences as practitioners and researchers. A framework is presented for discussion and evaluation, which incorporates aspects of classical and expressive aesthetics. These aspects of aesthetics are linked to low-level audio features, contextual factors, and user-centred experiences. It is intended that this initial framework will serve as a lens for the design, and appraisal, of sounds in interaction scenarios and that it can be iterated upon in the future through experience and empirical research.
SESSION: CHAPTER 4: ARTIFICIAL INTELLIGENCE (AI) AND MACHINE LEARNING (ML)
Tone Transfer is a novel deep-learning technique for interfacing a sound source with a synthesizer, transforming the timbre of audio excerpts while keeping their musical form content. Due to its good audio quality results and continuous controllability, it has been recently applied in several audio processing tools. Nevertheless, it still presents several shortcomings related to poor sound diversity, and limited transient and dynamic rendering, which we believe hinder its possibilities of articulation and phrasing in a real-time performance context.
In this work, we present a discussion on current Tone Transfer architectures for the task of controlling synthetic audio with musical instruments and discuss their challenges in allowing expressive performances. Next, we introduce Envelope Learning, a novel method for designing Tone Transfer architectures that map musical events using a training objective at the synthesis parameter level. Our technique can render note beginnings and endings accurately and for a variety of sounds; these are essential steps for improving musical articulation, phrasing, and sound diversity with Tone Transfer. Finally, we implement a VST plugin for real-time live use and discuss possibilities for improvement.
A Free Verbalization Method of Evaluating Sound Design: The Effectiveness of Artificially Intelligent Natural Language Processing Methods and Tools: The Effectiveness of Artificially Intelligent Natural Language Processing Methods and Tools
Research on sound design evaluation methodologies relating to connotation, or the evocation of mental imagery is limited. Prior tools for data analysis have fallen short, making the process time-consuming and difficult: We explore here a variety of new AI-powered Natural Language Processing tools to evaluate the data. Results showed that free verbalization is a fruitful method to answer some research questions about sound, giving rise to many interesting insights and leading to further research questions.
This paper applies supervised contrastive learning to musical onset detection to alleviate the issue of noisy annotated data for onset datasets. The results are compared against a state-of-the-art, convolutional, cross-entropy model. Both models were trained on two datasets. The first dataset comprised of a manually annotated selection of music. This data was then augmented with inaccurate labelling to produce the second data set. When trained on the original data the supervised contrastive model produced an F1 score of 0.878. This was close to the cross-entropy model score of 0.888. This showed that supervised contrastive loss is applicable to onset detection but does not outperform cross-entropy models in an ideal training case. When trained on the augmented set the contrastive model consistently outperformed the cross-entropy model across increasing percentage inaccuracies, with a difference in F1 score of 0.1 for the most inaccurate data. This demonstrates the robustness of supervised contrastive learning with inaccurate data for onset detection, suggesting that supervised contrastive loss could provide a new onset detection architecture which is invariant to noise in the data or inaccuracies in labelling.
Onset Detection for String Instruments Using Bidirectional Temporal and Convolutional Recurrent Networks
Recent work in note onset detection has centered on deep learning models such as recurrent neural networks (RNN), convolutional neural networks (CNN) and more recently temporal convolutional networks (TCN), which achieve high evaluation accuracies for onsets characterized by clear, well-defined transients, as found in percussive instruments. However, onsets with less transient presence, as found in string instrument recordings, still pose a relatively difficult challenge for state-of-the-art algorithms. This challenge is further exacerbated by a paucity of string instrument data containing expert annotations. In this paper, we propose two new models for onset detection using bidirectional temporal and recurrent convolutional networks, which generalise to polyphonic signals and string instruments. We perform evaluations of the proposed methods alongside state-of-the-art algorithms for onset detection on a benchmark dataset from the MIR community, as well as on a test set from a newly proposed dataset of string instrument recordings with note onset annotations, comprising approximately 40 minutes and over 8,000 annotated onsets with varied expressive playing styles. The results demonstrate the effectiveness of both presented models, as they outperform the state-of-the-art algorithms on string recordings while maintaining comparative performance on other types of music.
The term impact sound as referred to in this paper, can be broadly defined as the sudden burst of short-lasting impulsive noise generated by the collision of two objects. This type of sound effect is prevalent in multimedia productions. However, conventional methods of sourcing these materials are often costly in time and resources. This paper explores the potential of neural audio synthesis for generating realistic impact sound effects, targeted for use in multimedia such as films, games, and AR/VR. The designed system uses a Realtime Audio Variational autoEncoder (RAVE)  model trained on a dataset of over 3,000 impact sound samples for inference in a Digital Audio Workstation (DAW), with latent representations exposed as user controls. The performance of the trained model is assessed using various objective evaluation metrics, revealing both the prospects and limitations of this approach. The results and contributions of this paper are discussed, with audio examples and source code made available.
SESSION: CHAPTER 5: SPATIAL AUDIO
With the development, for the general public, of haptic devices allowing to transform audio signals into vibrations, the question of their capacity to immerse users or players more is raised. This study aims to evaluate how haptic feedback associated with audio reinforces our immersion in a virtual space, more specifically, in VR video games. A preliminary study was carried out with an haptic belt: the Woojer’s Strap Edge. 17 participants had to play two VR shooting games, with and without haptic feedback, and then answer questionnaires between each session. A post-hoc questionnaire was proposed to get free feedback from the participants. Results show no significant differences between with and without haptic feedback conditions in the between-session questionnaires, however the final questionnaire reveals a very strong inter-subject variability when it comes to the perception and appreciation of haptic feedback.
We present a new speaker array composed of five spherical speakers with 12 independent channels each. The prototype is open source and design choices are motivated here. It is designed to be a flexible device allowing a wide range of use cases, as described in more detail in the paper: simultaneous rendering with surround speaker arrays, artistic installations and acoustical measurements. The sources in the repository include filter impulse response for frequency response correction. The measurement methodology, based on sine sweeps, is documented and allows the reader to reproduce the measurement and correction. Finally, the paper describes several use cases for which feedback is provided, and demonstrates the versatility, mobility, and ease of deployment provided by our proposed implementation.
Invoke: A Collaborative Virtual Reality Tool for Spatial Audio Production Using Voice-Based Trajectory Sketching
VR could transform creative engagement with spatial audio, given affordances for spatial visualisation and embodied interaction. But, issues exist addressing how to support collaboration for spatial audio production (SAP). Exploring this problem, we made a VR voice-based trajectory sketching tool, named Invoke, that allows two users to shape sonic ideas together. In this paper, thematic analysis is used to review two areas of a formative evaluation with expert users: (i) video analysis of VR interactions; and (ii) analysis of open questions about using the tool. Implications present new opportunities to explore co-creative VR tools for SAP.
This paper describes and evaluates the use of 3D spatial in-air body movement interaction for human control of music software. This technique was implemented, prior to this work, in an input device prototype, MoveMIDI, which allows users to initiate rhythmic musical events by hitting zones of 3D geometry in a virtual environment using position-tracked motion controllers. This work evaluates MoveMIDI’s spatial interaction strategy for music in a usability study measuring timing accuracy of participants performing rhythms using MoveMIDI in comparison to two other input devices. The study revealed spatial unsureness of participants using MoveMIDI due to visualization and haptic shortcomings. While results for the MoveMIDI prototype are not positive, points of improvement are revealed, and our methodology provides a novel comparison for input devices in the context of rhythmic performance accuracy.
A series of auditory cues were designed to assist firefighters with navigation and general safety in a fire emergency. Firefighters must maintain situational awareness at all times and this can be lost with disorientation, which is one of the main causes of injury and even death. Disorientation can be caused by restricted vision due to heavy smoke, a lack of familiarity with the surroundings as well as hearing and communication difficulties caused by the intensity of the fireground sounds. Five professional firefighters were interviewed to identify ways in which auditory affordances could be used to support their work. Existing sounds from both the emergency environment and those generated by firefighting equipment were assessed to determine their importance in maintaining situational awareness. Noise reduction technology was investigated, to assess its potential use in limiting the levels of noise exposure experienced. A series of auditory cues were designed to address the issues that were found using binaural spatialization and Augmented Reality methods. A prototype system was presented to firefighters to determine its effectiveness. The firefighters found that noise reduction would be effective in improving their situational awareness and ability to communicate effectively. Additionally, the firefighters found that spatially placed auditory cues had the potential to be effective in navigation and orientation in a fire emergency. The findings suggest that the use of noise reduction and auditory affordances have the potential to improve situational awareness for firefighters, increase safety and potentially save lives.
In the past decade personal listening technologies have been through a proliferation with headphones with novel features and smart speaker systems. A technology expected to be introduced to the public in coming years is sound zone systems, that allow interface-free personal listening for multiple users in the same space without overlap. As with all new technologies, it is difficult to ascertain what barriers may lay before successful user adoption. To that end, we conducted a four-week field study in five households. Through a thematic analysis, we discovered 10 barriers for adoption concerning sound zones grouped into three different aspects; Interaction and use, Current Standards and Practices, and Limitations of sound zone technology. In addition to the barriers, we discuss our work towards current personal listening technologies.
A tool to study the apparent trajectories evoked by sounds (auditory trajectories) is presented. This tool is built with the aim of easing the task of the experimenters (building and analyzing interventions) and the task of the participants (reporting their opinions). By using infrared tracked controllers in a Virtual Reality environment, participants can freely describe the three-dimensional path evoked by a stimulus. The implemented tool also assists participants in recording trajectories by providing additional visual cues and feedback on the recorded data. A mock-up study is presented to demonstrate the benefits of the proposed system. Results from this study show that participants are able to accurately report elicited trajectories. While the implemented tool has limitations, such as the number of available blocks (only practice and main blocks), it could cover the needs of several laboratories. The tool is a valuable resource for researchers seeking to explore the perception and processing of auditory stimuli.