audioLens

Type

Coursework for Intro to Generative AI

Role

Development, UX Design

Team

Solo, with help from course TA JiabinWei

Year

2023

Inspiration

In my walks, I used to record sounds,
Simple things: footsteps, the wind, town's hums.
Then, forgotten, those recordings lay around,
Gathering digital dust, silent, numb.

Memory faded like old photographs,
What those moments held, I couldn't recall.
But listening back, a curious path
Unfolds - the sounds, once vague, now enthrall.

Through the crackle, the rustle, the distant tone,
I piece together scenes, reshaping the known.
A re-imagination through what was caught,
A revisiting of thought, through sounds, sought.
‍
‍
Okay, ChatGPT wrote me this poem. To summarize the motivation for this project in my plainest human language: I wondered would it be cool to
‍

revive and visualize memories from audio clips?

and through the lens of the audio, we could probably even discover things that eyes might have missed.

Technical
pipeline

🤔 Problem: little sound-to-image or sound-to-video model

🔍 Research: plenty text-to-image and audio-to-text models

💡 Solution:

Breaking down the problem into pieces, I decided to achieve audio-to-video generation through 1) audio-to-text, where I get a description of the audio; 2) text-to-image, where I can have more control of the image generation through prompting or fine-tuning; 3) image-to-video, to have the final clip. For step two, an express way would be using text-to-video generative models.

🚧 Google Colab overview:

Audio-to-text

The first audio-to-text model I tested out was the AST: Audio Spectrogram Transformer (Gong et al). Published in 2021, AST is a deep learning model utilizing Transformer architecture for audio processing. Notable for its effectiveness in audio classification and sound event detection, AST operates on spectrogram-transformed audio data. It's particularly suitable for tasks requiring detailed frequency-time analysis, such as acoustic scene classification and audio tagging. AST can automatically learn complex audio patterns, offering advantages over traditional methods that rely on hand-crafted features.

I tested the model using an a chunk of the audio presented above. I resampled the audio to 16kHz before being transformed into a spectrogram for the standardized format. The code was taken from the original Github repo and tested on Google colab.

AST results

Though not perfect, the result was satisfying. AST successfully predicted sound classes that are essential to construct a scene.
‍

Listen to the audio clip:

As I was experimenting with GPT-4 for scene description generation based on the sound classes, I encounter another audio recognition model LTU...

screenshot of the test result

AST: Audio Spectrogram Transformer

LTU: Listen, Think and Understand

LTU: Listen, Think and Understand (Gong et al), published in 2023, primarily uses a neural network-based approach, combining elements from both audio signal processing and natural language processing, to recognize and understand audios.
It is one of the first multimodal LLM that provides general audio (beyond speech) understanding. This approach leverages the interrelated aspects of audio and speech signals, facilitating a comprehensive understanding and interpretation of sound.

Importantly to my audio-to-image goal, LTU provides a streamlined process of textual scene description based on audio.

Since the model process higher sample rate audio efficiently, I tested the model using the same same audio clip but in its original m4a format. I tested the model on Hugging Face using Inference API.

the AST paper by Gong et al

LTU test results

LTU demonstrated fascinating capability of understanding the audio, and generating a detailed description of the audio.

Since it leverages the LLM, I played with the prompting to test better scene descriptions. Some of the prompts that I tested:
- What can be inferred from the spoken text and sounds? Why?
- Infer and describe a scene from the spoken text and sounds
- Describe the scene in this audio
- Describe the scene in this audio

screenshot of the test result

Text-to-video

For direct text-to-video generation, I tried the text-to-video-sythesis model developed by Alibaba Vision Lab. I cloned the model to Google colab for later audio-to-video work flow.
The text prompt for the video generation is chosen from the scene description generated by the LTU model mentioned above.

Alibaba Vision Lab Damo text-to-video-synthesis

damo text-to-video-synthesis model on Hugging Face

Results

Prompt: A person is walking down the street, and as they walk, they hear the sound of footsteps. Suddenly, a car honks its horn, and the person stops to look around for traffic.

I also added the audio clip used to generate this video onto the video

Text-to-image

An alternative way to text-to-video generation is text-to-image-to-video generation. I tested this method in hope that it could give me more control of the generated video in terms of quality and visual style.
To add style to the image, I tried 1) adding the style through prompting; 2) fine-tuning technique such as DreamBooth.

Through the tests, I found out that DreamBooth is not. the best training technique to add style since it is more subject/object focused. Nevertheless, both the prompting technique and DreamBooth customized training yield satisfying styled images.

DreamBooth paper

DreamBooth training

Results

Prompt:
Imagine a first-person perspective scene where you are sitting in a room. In the far top left of your view, there's a window through which a cicada can be seen in the distance, perched on a tree branch outside. Close to you in the bottom right, there's a table fan, its blades slightly blurred as they spin. Directly ahead in the center, at a distance, is a TV on a stand, displaying a vibrant, colorful show. The room is softly lit, creating a cozy, relaxed atmosphere.

Prompt:
A landscape with a river in the bottom left corner, surrounded by lush greenery, and a small airplane with blue and white colors in the top right corner against a backdrop of an orange and purple sky at dawn or dusk.

Prompt:
Cicada in background left, fan in foreground right, and TV in foreground middle.

Adding style via prompt

Alternatively, I tried to style the image directly through the prompt. The styling part of the text prompt of this set of images were added onto the LTU-generated scene description by GPT-4. The image generation model used was stable-diffusion-xl-base-1.0. The results were consistent, accurate and pleasing.
‍
Prompt:
An image of fountain far in the middle, violin music close on the right, children chattering far on the left, detailed, realistic, in style of Amy Friend, Man Ray, Kurt Schwitter.

With or without styling?

Since I was impressed by the diffusion model, I started to wonder if it is even necessary to style the image generation. To find out, I compared the generated images with and without styling side by side.

Prompt:
A person is walking down the street, and as they walk, they hear the sound of footsteps. Suddenly, a car honks its horn, and the person stops to look around for traffic.

stabilityai/stable-diffusion-xl-base-1.0

with customized Dreambooth training

Prompt:
The scene of a person walking on snowy ground while carrying an object that makes noise.

stabilityai/stable-diffusion-xl-base-1.0

with customized Dreambooth training

Without directed styling, diffusion model took the liberty of adding one, and it is hard to know beforehand what style it would choose. While the customized Dreambooth training may limit the imagination, it did give a more consistent outcome. So I decided to move forward with the diffusion model customized with Dreambooth training.

Image-to-video

For image-to-video generation, I tested out SVD-XT and Runway-Gen2.

input image for video genertaion

SDV-XT v.s. Runway-Gen2

Results

Runway-Gen2 generates way more compelling result in this test.

SVD-XT

Runway-Gen2

SDV-XT v.s. Runway-Gen2

I noticed that Runway-Gen2 also supports text prompting on top of image prompting for video generation. I wonder if supplementary text description would improve Gen-2's generation? Or even help to generate a narrative timeline? The prompt used was one of the LTU-generated scene description "A person is walking down the street, and as they walk, they hear the sound of footsteps. Suddenly, a car honks its horn, and the person stops to look around for traffic."

SVD-XT

Runway-Gen2 with prompt

It turns out that the prompt is probably more confusing than helpful for Runway-Gen2's video generation. We can just keep it simple then!

Further steps

streamline the pipeline
adding different styling options for different moods
need-finding & user research
explore spatial audio/3d scene possibilities

audioLens

Inspiration

Technicalpipeline

Audio-to-text

Text-to-video

Text-to-image

Image-to-video

Further steps

Technical
pipeline