This signature work explores voice synthesis and its applications in multilingual, multimodal, and multidisciplinary contexts. Advancements in voice synthesis have been witnessed during the author’s undergraduate studies, paralleled by the emergence of large generative models. This research presents new methods for voice synthesis, such as the creation of BiSinger, a bilingual singing voice synthesis system that can handle both English and Chinese Mandarin using a single model. Additionally, the study introduces Stable Diffusion-Enhanced Voice Generation (SD-EVG), which utilizes visual data to produce more authentic voice generation. Furthermore, this study presents a novel method that utilizes EEG data to decode speech envelopes using the S4 model and a diffusion denoising block, demonstrating the potential of multidisciplinary research in improving voice synthesis technologies. These advancements are based on the integration of various data domains and the refinement of existing models to expand the limits of voice synthesis, with the ultimate goal of creating a voice synthesis system that is both universally accessible and expressive. The presented findings and methodologies emphasize the significance of innovative and inclusive approaches in advancing the field of voice synthesis for community benefit. |