aTrain – How to transcribe and diarize audio in seconds

Tech

#aTrain #transcribe #diarize #audio #seconds

Transcribing interviews, or “decoupage”, in journalistic jargon, is one of the most tedious tasks of the profession, it involves hours and hours of going back and forth, like the skin of it, listening to excerpts of speeches, writing, reviewing, correcting, to in the end use a fraction of everything. It is an inglorious task, made simpler with technology.

Not THAT A-Train (Credit: Vought International)

As Whisper It has become much easier to transcribe audio, but what about when we are transcribing a debate, a conversation, a movie scene? This is where the concept of diarization comes into play, where you separate each person’s speech. This is hell, it’s even more tedious than just transcribing audio.

There are tools for this, one of the most popular is Pyannotea neural network in Python, but it is notoriously complicated to install and configure, basically command line, that thing that people under 30 break out in hives just thinking about.

To use Pyannote you need Git, Python, a development environment, Conda and a bunch of other weird names for the average Afghan. You can’t demand that normal people be familiar with these tools, they want solutions. Fortunately, solutions exist.

This time thanks to the people of the University of Graz, in Austria. They unified everything into one easy-to-use tool, aTrain. And yes, It’s Open Source. aTrain is a complete transcription, diarization and subtitling of audios and videos.

Even better: It runs either on video cards (nVidia) with at least 6GB of VRAM, or directly on the CPU, just taking (much) longer to perform the transcription. Still, orders of magnitude faster than manually.

Also Read: Elon Musk and xAI did not go wrong with their word! Grok is now open-source

He understands English, Portuguese and a bunch of other languages.

Where to Download aTrain?

You can install direct from the Microsoft App Storeor download the installer from Graz University, but CALM. Before clicking, the warning: It’s a 10GB download, if your bandwidth isn’t wide, prepare your ass because you’re going to be a wallflower. And make sure you have space on your C: disk, aTrain at this point is NOT friendly at all.

It has no option to configure ANYTHING, it will install itself on disk C, use its own document directories, and you won’t make a peep.

Once everything is downloaded and installed, run aTrain (great idea – Hughie). It will open a screen like this. By clicking on Choose File we… choose the file to be transcribed. I’m using it as an example a random interview downloaded from YouTube.

The next option is where we select the AI model to do the transcription. The smaller the faster, but we lose in precision. The medium one is sufficient for most uses, but to be honest, the thing is so fast that the large-v2 model should be used as standard.

Model selection screen. In a good way, without a sofa (Credit: MeioBit)

Language selection is optional, but I recommend it, it prevents aTrain from getting confused.

Here’s the big jump: Multispeaker is where you go from Whisper’s simple transcription to diarization, separated by participant. Select Multispeaker. It will ask for the number of participants. It’s not mandatory, but it helps a lot. In this case, as it is a simple interview, I selected two.

Also Read: Tetris inventor made nothing from his game for years, “but I was happy that people had fun with it” | Games

Selecting the participants (Credit: Meio Bit)

Once that’s done, now just press START and go get a coffee.

Tip: In Advanced Settings there is an option for Compute Type, int8 or float16. It is the internal numerical representation used by the GPU. If your card is a decent Nvidia, select float16, the speed gain is absurd.

If you don’t have this option enabled… patience (Credit: Meio Bit)

The finishing screen gives the option to open the folder with the transcripts. Typically they are at:

C:UsersDocumentsaTraintranscriptions

Not friendly, I know, but worth the effort.

Final files (Credit: Meio Bit)

The metadata.txt file contains information about the length of the transcribed audio, language and other data. It’s great for comparing results and settings.

Transcription.json is the main file with all the transcribed information, a good programmer can have fun with it, but we shouldn’t worry about this file.

transcription.srt is the transcript in subtitle format, ready to be read in VLC or practically any other video player.

transcription.txt, if the Multispeaker option has been selected, the transcription will be with each participant identified. The default is SPEAKER_01, SPEAKER_02, SPEAKER_03… you, of course, will use a simple replacement command to change it to the speaker’s name, of course.

transcription_maxqda.txt brings the transcription with timestamp information, showing where in the timeline each sentence was said. This is essential for locating the part in the recording, when you need to recover that part of the video to use in something.

transcription_timestamps.txt separates line by line, marking the timestamp, easier to identify the moment, but more boring to read.

The resulting files (Credit: Meio Bit)

Speed

aTrain transcribed and diarized a 13-minute audio into 2 minutes spiked. It’s not like the Insanely Fast Whisperbut it’s quite reasonable.

aTrain is fast but not perfect (Credit: Amazon Prime Video)

Problems with aTrain

Although he is infinitely better than any intern, aTrain is not perfect. He sometimes gets confused when speaking quickly, and when people speak at the same time, he can’t transcribe the audio.

It is not advisable to run aTrain and make the result available immediately, without review, but this applies to any minimally serious journalistic work.

Tags: aTrain, ia, whisper