Researchers at the University of Texas have used generative AI to create images of street views. However, they didn’t use text prompts. The AI created the images purely from audio recordings.
Research paper
The researchers published their findings via Computers, Environment and Urban Systems on the Science Direct website. In the research paper, the team explained how they trained a soundscape-to-image AI model to create street view images. Yuhao Kang is an assistant professor of geography and the environment at UT and a co-author of the study. He explained, “Our study found that acoustic environments contain enough visual cues to generate highly recognizable streetscape images that accurately depict different places. This means we can convert the acoustic environments into vivid visual representations, effectively translating sounds into sights.”
Training the AI
The research team worked with YouTube video and audio files showing cities in North America, Asia and Europe. They then created 10-second clips from the different locations to train the AI. The AI model was designed to produce high-resolution images prompted by only audio samples. Once trained, the team got the AI to generate images using 100 audio clips. These images were evaluated by computers, comparing the relative proportions of greenery, buildings and sky between the sources and generated images. In addition, the team challenged human judges to match one of three generated images to an audio sample.
Results
The computer analysis determined that the AI-generated images showed strong correlations in the proportions of sky and greenery compared to the real-world images. However, the correlation was lower in relation to the proportions of buildings. In addition, the human judges correctly matched the AI-generated image to the audio source around 80% of the time. The team’s analysis of the images also revealed that the AI images often maintained the architectural styles of the real-world images. Images also matched whether the audio samples were recorded when it was sunny, cloudy or at night as well.
What next
Yuhao Kang said, “Traditionally, the ability to envision a scene from sounds is a uniquely human capability, reflecting our deep sensory connection with the environment. Our use of advanced AI techniques supported by large language models (LLMs) demonstrates that machines have the potential to approximate this human sensory experience. This suggests that AI can extend beyond mere recognition of physical surroundings to potentially enrich our understanding of human subjective experiences at different places.”
What we think
The use of audio samples alone to generate AI images isn’t necessarily new. However, using samples to attempt to create a realistic image of a specific environment is an interesting development. In addition, the accuracy of the results obtained by the researchers at Texas is impressive. As the research continues it will be interesting to see whether the AI is truly creating these images or merely copying from the data it was trained on.
Addition source: ScienceDirect