Building AI Workflows: Combining LLMs and Voice Models - Part 2
This is part two of a two-part guide on building an AI podcast using Nitric. In this part of the guide we'll enhance Part 1 to complete our fully autonomous AI podcast, adding an LLM for script writing to the existing text-to-speech model.
In this guide we'll build a fully autonomous AI podcast, combining an LLM for script writing and a text-to-speech model to produce the audio content. By the end of this guide we'll be able to produce podcast style audio content from simple text prompts like "a 10 minute podcast about [add your topic here]".
In this first part we'll focused on generating the audio content, in this part we'll add an LLM agent to our project to automatically generate scripts for our podcasts from small prompts. Here's an example of what we'll be able to generate from a prompt requesting a podcast about "Daylight Saving Time":
Welcome to the Dead Internet Podcast, I'm your host. I'm a writer and a curious person, and I'm here to explore the weird and fascinating side of the internet.
Today, we're going to talk about a topic that's often met with eye-rolling and groans: daylight saving time. You know, that bi-annual ritual where we spring forward or fall back an hour, supposedly to make better use of natural light. But is it really worth it? Let's take a closer look.
Daylight saving time has been around for over a century, first implemented during World War I as a way to conserve energy. The idea was simple: move the clock forward in the summer to make better use of daylight, and then fall back in the winter to conserve energy. It's a system that's been adopted by over 70 countries around the world, but it's also been criticized and ridiculed by many.
One of the main criticisms is that it's just plain annoying. Who likes waking up an hour earlier in the spring, or going to bed an hour later in the fall? It's a disruption to our daily routines, and it can be especially difficult for people who have to deal with it, like parents trying to get their kids to school on time, or people who work non-traditional hours.
But the implications of daylight saving time go beyond just annoyance. Research has shown that it can actually have negative effects on our health, including increased risk of heart attacks, strokes, and other cardiovascular problems. And then there's the economic impact, which can be significant, especially for people who work night shifts or have to deal with disruptions to their schedules.
So, why do we still do it? Well, the answer is complicated. Some argue that it's a necessary evil, that the benefits of daylight saving time outweigh the costs. Others argue that it's a relic of a bygone era, a reminder of the mistakes of the past. And then there are those who just plain don't get it, who think that the whole thing is a waste of time.
As we wrap up our discussion on daylight saving time, let's take a moment to summarize. Daylight saving time is a complex and contentious topic, with both benefits and drawbacks. While some people argue that it's a necessary ritual, others see it as a disruption to our daily lives. Whether or not you're a fan of daylight saving time, it's clear that it's a topic that sparks a lot of debate and discussion. Thanks for joining me on this journey down the rabbit hole of timekeeping. Until next time, stay curious.
Prerequisites
- uv - for Python dependency management
- The Nitric CLI
- Complete Part 1 of this guide
- (optional) An AWS account
Continuing from Part 1
If you haven't already completed Part 1, we recommend you start there. In Part 1 we set up the project structure and deployed the audio generation job to the cloud. In this part we'll add an LLM agent to our project to automatically generate scripts for our podcasts from prompts.
We'll start by adding llama-cpp-python which will allow us to perform LLM inference in Python using Llama models.
# Add the extra llm dependenciesuv add llama-cpp-python --optional llm
We add the extra dependencies to the 'llm' optional dependencies to keep them separate. This allows only certain services in our project to include the dependencies, reducing the size of other docker images.
Download the LLM Model
In part 1, we downloaded the audio model using the huggingface_hub
package's snapshot_download
function. In this step we'll show another method to include the model in the docker image at build time. This method is useful if you want the model to be immutable and included in the docker image. If you prefer to download the model at runtime, you can follow similar steps to part 1.
In this example we'll use a quantized version of the Llama 3.2 3B model (specifically Llama-3.2-3B-Instruct-Q4_K_L.gguf
), which is smaller Llama model, but still provides decent quality results with fast performance even on CPU only environments.
Download the Llama model file, then move it to a new ./models/
directory in the project.
mkdir modelscurl -L https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_L.gguf -o models/Llama-3.2-3B-Instruct-Q4_K_L.gguf
Let's also add this new directory to the existing .dockerignore
files to prevent it from being included our docker images from Part 1.
echo "models/" >> python.dockerfile.dockerignoreecho "model/" >> torch.dockerfile.dockerignore
We'll create another dockerfile for the LLM model, as it requires a different set of dependencies to the audio model. This will allow us to keep the docker images for the audio and LLM models separate.
Add the Script Generation Service
Next, we'll add a new service to our project to generate scripts for our podcasts. We'll use the llama-cpp-python
package to interact with the LLM model. Create a new file batches/script.py
with the following content:
touch batches/script.py
from common.resources import gen_podcast_job, gen_audio_job, scripts_bucketfrom nitric.context import JobContextfrom nitric.application import Nitricfrom llama_cpp import Llamaimport ossystem_prompt = """You're a writer for the Dead Internet Podcast.The podcast only has the host and no guests so the writing style is more like a speech than a script and should just be simple text with no queues or speaker names.The host always starts with a brief introduction and then dives into the topic, and always finishes with a summary and a farewell."""# Allow the model to be set via an environment variablemodel = os.environ.get("LLAMA_MODEL", "./models/Llama-3.2-3B-Instruct-Q4_K_L.gguf")llm = Llama(model_path=model, chat_format="llama-3", n_ctx=4096)# allow this service to write scripts to the scripts bucketscripts = scripts_bucket.allow("write")# allow this job to submit the script to be turned into audioaudio_job = gen_audio_job.allow("submit")@gen_podcast_job()async def do_gen_script(ctx: JobContext):prompt = ctx.req.data["prompt"]title = ctx.req.data["title"]preset = ctx.req.data["preset"]print('generating script')completion = llm.create_chat_completion(messages=[{"role": "system","content": system_prompt,},{"role": "user","content": prompt},],# unlimited tokens, set a limit if you prefermax_tokens=-1,temperature=0.9,)# extract just the text from the outputtext_response = completion["choices"][0]["message"]["content"]# store the script in the scripts bucketscript_file = f'{title}.txt'await scripts.file(script_file).write(str.encode(text_response))print(f'script written to {script_file}')# send the script for audio generationawait audio_job.submit({"text": text_response,"file": title,"preset": preset,})Nitric.run()
You'll notice this new script service references two new resources gen_audio_job
and scripts_bucket
. Let's add those now to the common/resources.py
file:
from nitric.resources import api, bucket, job, topicimport osimport tempfile# Our main API for submitting audio generation jobsmain_api = api("main")# A job for generating our audio contentgen_audio_job = job("audio")+# A job for generating our audio script+gen_podcast_job = job("podcast")# A bucket for storing output audio clipsclips_bucket = bucket("clips")# And another bucket for storing our modelsmodels_bucket = bucket("models")+# A bucket for storing our scripts+scripts_bucket = bucket("scripts")# Many cloud API Gateways impose hard response time limits on synchronous requests.# To avoid these limits, we can use a Pub/Sub topic to trigger asynchronous processing.download_audio_model_topic = topic("download-audio-model")model_dir = os.path.join(tempfile.gettempdir(), "ai-podcast", ".model")cache_dir = os.path.join(tempfile.gettempdir(), "ai-podcast", ".cache")zip_path = os.path.join(tempfile.gettempdir(), "ai-podcast", "model.zip")
Add a new API route
Next, we'll add a new API route to our project to allow users to submit prompts for script generation. We can do this by editing the services/api.py
file:
from common.resources import (main_api, model_dir, cache_dir, zip_path,gen_audio_job, download_audio_model_topic, models_bucket,+gen_podcast_job)from nitric.application import Nitricfrom nitric.context import HttpContext, MessageContextfrom huggingface_hub import snapshot_downloadimport osimport zipfileimport requestsmodels = models_bucket.allow('write')generate_audio = gen_audio_job.allow('submit')download_audio_model = download_audio_model_topic.allow("publish")+generate_podcast = gen_podcast_job.allow("submit")audio_model_id = "suno/bark"default_voice_preset = "v2/en_speaker_6"@download_audio_model_topic.subscribe()@main_api.post("/download-model")async def download_audio(ctx: HttpContext):model_id = ctx.req.query.get("model", audio_model_id)if isinstance(model_id, list):model_id = model_id[0]# asynchronously download the modelawait download_audio_model.publish({ "model_id": model_id })@main_api.post("/audio/:filename")+# Generate a full podcast from script to audio+@main_api.post("/podcast/:title")+async def submit_script(ctx: HttpContext):+title = ctx.req.params["title"]+preset = ctx.req.query.get("preset", default_voice_preset)+body = ctx.req.data++if body is None:+ctx.res.status = 400+return++await generate_podcast.submit({"title": title, "prompt": body.decode('utf-8'), "preset": preset})+Nitric.run()
Add the LLM Dockerfile
Next, we'll create a new Dockerfile for the LLM model. This Dockerfile will be used to build a new docker image for the script service. Create a new file llama.dockerfile
and add the content below:
touch llama.dockerfile
# The python version must match the version in .python-versionFROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builderARG HANDLERENV HANDLER=${HANDLER}# Set flags for common execution environmentENV CMAKE_ARGS="-DLLAMA_NATIVE=OFF -DGGML_NATIVE=OFF -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_AVX512=OFF -DGGML_AVX512_VNNI=OFF -DGGML_AVX512_VBMI=OFF -DGGML_AVX512_BF16=OFF"ENV UV_COMPILE_BYTECODE=1 UV_LINK_MODE=copy PYTHONPATH=.WORKDIR /appRUN --mount=type=cache,target=/root/.cache/uv \--mount=type=bind,source=uv.lock,target=uv.lock \--mount=type=bind,source=pyproject.toml,target=pyproject.toml \uv sync --frozen --no-install-project --extra llm --no-dev --no-python-downloadsCOPY . /appRUN --mount=type=cache,target=/root/.cache/uv \uv sync --frozen --no-dev --extra llm --no-python-downloads# Then, use a final image without uvFROM python:3.11-bookwormARG HANDLERENV HANDLER=${HANDLER} PYTHONPATH=.# Copy the application from the builderCOPY --from=builder --chown=app:app /app /appWORKDIR /app# Place executables in the environment at the front of the pathENV PATH="/app/.venv/bin:$PATH"# Run the service using the path to the handlerENTRYPOINT python -u $HANDLER
We can also add a .dockerignore
file to prevent unnecessary files from being included in the docker image:
touch llama.dockerfile.dockerignore
.mypy_cache/.nitric/.venv/.model/nitric-spec.jsonnitric.yamlnitric.*.yamlREADME.mdmodel.zip
Update our .env file
We'll change some of the environment variables in our .env
file to factor in the extra init time it will take to start our LLM job:
PYTHONPATH=.+WORKER_TIMEOUT=300
Finally, we need to tell Nitric to use these files to create the script service. We can do this by updating the nitric.yaml
file:
name: ai-podcastservices:- match: services/*.pystart: uv run watchmedo auto-restart -p *.py --no-restart-on-command-exit -R uv run $SERVICE_PATHruntime: pythonbatch-services:-- match: batches/*.py+- match: batches/podcast.py+start: uv run watchmedo auto-restart -p *.py --no-restart-on-command-exit -R uv run $SERVICE_PATH+runtime: torch+- match: batches/script.py+start: uv run watchmedo auto-restart -p *.py --no-restart-on-command-exit -R uv run $SERVICE_PATH+runtime: llamaruntimes:python:dockerfile: './python.dockerfile'torch:dockerfile: './torch.dockerfile'+llama:+dockerfile: './llama.dockerfile'preview:- batch-services
Run or Deploy the Project
Now that we've added the LLM model and script generation service to our project, we can run or deploy the project to test it out. If you haven't already, you can run the project locally using the Nitric CLI:
nitric run
Or deploy the project to the cloud, like we did in Part 1:
nitric up
Next Steps
In this guide we added an LLM agent to our project to automatically generate scripts for our podcasts from prompts. We also added a new API route to allow users to submit prompts for script generation.
Now that you've seen how to include and connect models using Nitric, you can experiment with different models and services to build your own AI workflows. Here are some ideas that might help get you started:
- Try using other models for script generation
- Perform a different action with the LLM instead of generating scripts
- Include other Nitric resources like databases or storage in your project
If you'd like to see a part 3 of this guide and have any ideas on what we could add/improve then let us know over at the Nitric Discord