I was talking to our marketing team at Sequoia, one of the things they do is to look at podcasts or talks we record then extract
But what else?
For that we utilise many SaaS tools to transcribe, then read through those and the convert them to appropriate format. This is very time consuming process, and I was thinking if we can automate this process. Come to think of it, if we break it down to smaller steps, it is pretty straight forward. Also with help of AI or LLM models we can actually do a better job when we keep humans in the loop. In this case AI can act as a first pass, and then humans can review and correct the output.
I know that there are a lot of tools out there to help, I wanted to see if we can do it locally using open source tools.
Some of the basic tasks we can perform are:
And we are done! We have a blog post from a talk or podcast. Now to real work and some code.
The code for this is available at ajeygore/Video2Blog at Github
To install pytorch, please read details on my Running PyTorch on M1/M2 Mac blog post.
But for now following code snippets should do the trick.
#Install dependencies
brew install ffmpeg
brew install imagemagick
brew install libsndfile
#Create conda environment
conda create -n video2blog python=3.8
conda activate video2blog
#Install packages
conda install hmmlearn
pip install pydub
#Install pytorch
conda install pytorch-nightly::pytorch torchvision torchaudio -c pytorch-
#Install pyannote (This version has latest patches to run on ARM chipsets
pip install -qq https://github.com/pyannote/pyannote-audio/archive/develop.zip
#And export DYLD_LIBRARY_PATH so that on ARM we pick up libraries from /opt/homebrew/lib
export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"
ffmpeg -i video.mp4 -acodec pcm_s16le -ac 1 -ar 16000 out.wav
pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization',
use_auth_token="<Your Hugging Face Authortization Token>")
audio_input_file = {'uri': 'blabal', 'audio': './out.wav'}
dz = pipeline(audio_input_file)
with open("diarization.txt", "w") as text_file:
text_file.write(str(dz))
Second step is to create individual speaker segments, so that we can pass them to Whisper model.
I have used code from various palces, but here is incomplete list of the articles I went through.