entangling exuberance
talks   experiments   about  
Create a blog from a conversation

May 05, 2023 • 3 minutes

Liked? Share
This blog post is still a work in progress, please check back later for more updates.


I was talking to our marketing team at Sequoia, one of the things they do is to look at podcasts or talks we record then extract

But what else?

For that we utilise many SaaS tools to transcribe, then read through those and the convert them to appropriate format. This is very time consuming process, and I was thinking if we can automate this process. Come to think of it, if we break it down to smaller steps, it is pretty straight forward. Also with help of AI or LLM models we can actually do a better job when we keep humans in the loop. In this case AI can act as a first pass, and then humans can review and correct the output.

I know that there are a lot of tools out there to help, I wanted to see if we can do it locally using open source tools.

Some of the basic tasks we can perform are:

The possibilities are endless, So lets get started with following tasks

And we are done! We have a blog post from a talk or podcast. Now to real work and some code.

The code for this is available at ajeygore/Video2Blog at Github

Packages and Software

Installing various packages

To install pytorch, please read details on my Running PyTorch on M1/M2 Mac blog post.

But for now following code snippets should do the trick.

#Install dependencies
brew install ffmpeg
brew install imagemagick
brew install libsndfile

#Create conda environment
conda create -n video2blog python=3.8
conda activate video2blog

#Install packages
conda install hmmlearn
pip install pydub

#Install pytorch
conda install pytorch-nightly::pytorch torchvision torchaudio -c pytorch-

#Install pyannote (This version has latest patches to run on ARM chipsets
pip install -qq https://github.com/pyannote/pyannote-audio/archive/develop.zip

#And export DYLD_LIBRARY_PATH so that on ARM we pick up libraries from /opt/homebrew/lib
export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"

Extract audio from video

ffmpeg -i video.mp4 -acodec pcm_s16le -ac 1 -ar 16000 out.wav

Use Pyannote for diarization

pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization',
                                use_auth_token="<Your Hugging Face Authortization Token>")

audio_input_file = {'uri': 'blabal', 'audio': './out.wav'}
dz = pipeline(audio_input_file)

with open("diarization.txt", "w") as text_file:

Create individual speaker segments

Second step is to create individual speaker segments, so that we can pass them to Whisper model.


I have used code from various palces, but here is incomplete list of the articles I went through.

Liked? Share