July 19, 2024

The new Descript: How we multiplied the app's speed and performance

We just made a generational improvement to Descript's speed and performance. Here's how we did it.
July 19, 2024

The new Descript: How we multiplied the app's speed and performance

We just made a generational improvement to Descript's speed and performance. Here's how we did it.
July 19, 2024
Marcello Bastea-Forte
In this article
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions

For the better part of four years, the engineering team at Descript has been working on building a new media engine — one that would power both a web-based video editor and our desktop app.

The goal was to make Descript equally fast and powerful on both platforms. Most web-based video editors lack the functionality, power, and precision of desktop apps. Desktop video editors lack the convenience and accessibility of web-based tools. We wanted to create a media engine that would eliminate that tradeoff so our users could create video and podcasts in either platform without sacrificing power or capability.

Now, we’ve done it. We released the new web app to production in May; in most cases it’s faster and more robust than our desktop app. And last week, we released a preview version of our new desktop app, now running on the same media engine that powers the web app.

By the end of the year, we’ll have made a generational leap in Descript’s speed and performance. I’m super proud of the work this team did. If you’re interested in understanding what we did on the backend, read on. (I’m going to nerd out on the tech here; if you don’t care about that and just want to know what the rebuilt media engine does for you, watch our video below.) ‎

Previously on Descript…

We’ll start with some historical context. Back in 2018, we built Descript as an Electron application — same as web apps like Slack, Figma, and Notion.

Using Electron allowed us to power our user interface with well-known web technologies while retaining access to low-level native C/C++ libraries like FFMPEG/libav for processing media files and SQLite for storing local state. It served us well for six years.

Doubling down on web technology

To make Descript work in browsers, we had to remove those dependencies on native libraries and rebuild key pieces of our media engine from the ground up.

While we've been using many web technologies (i.e. WebAudio for audio playback and WebGL for compositing and realtime effects), recent enhancements in WebAssembly and the availability of the new WebCodecs API unlocked the remaining pieces.

The key user experience change you'll notice is that we don't download anything to your computer: everything streams from the cloud. But we also made Descript a lot faster and unlocked some new features you’ll see in the coming months.

New Descript

We did three things to make this happen:

1. Decode video in the browser

Before WebCodecs, if you wanted to build a web-based video engine, you had three main options for decoding video:

  1. HTML element
  2. Software decoding
  3. Cloud rendering

HTML <video> element

The HTMLVideoElement is great: it demuxes, decodes, and renders video! It will even use the user’s graphics card for efficient playback.

What it doesn’t provide is precision, or the ability to accurately step through frames (though you can get some of this now through requestVideoFrameCallback).

This gets more challenging when you try to manage A/V sync across multiple layers concurrently. You also cannot create these elements in a WebWorker, which further limits performance.

Check out VideoContext for a great example of this approach.

Software decoding

The next thing to try is decoding video in the CPU. The best performance you can get in the browser is with WebAssembly, but that runs at roughly half the speed of native code. To do this efficiently you need multi-threaded WebAssembly, which adds a lot more complexity to your stack.

But even that will be much slower than hardware decoding, and gets slower and slower as you operate at higher resolutions.

This is great for handling a variety of formats (we use WebAssembly to process ProRes files), but is not ideal for efficient playback.

Cloud rendering

The last, and perhaps most expensive approach is to do all your video decoding and compositing on a cloud GPU and stream the final result down to the user’s computer.

This is the same technique used for services like NVIDIA's to stream games to your computer and requires a dedicated computer in the cloud for every active user of your product.

One noteworthy limitation of this approach is that the file needs to be uploaded before the user can see it. No instant gratification!

Enter WebCodecs

As part of Interop 2023, web browsers started implementing a set of APIs called WebCodecs. WebCodecs provides zero-copy interface between hardware video decoders/encoders and WebGL/WebGPU. It takes advantage of hardware support for the most common video capture codecs (h.264, HEVC, and VP8/VP9) available in computers from the last 5–10 years.

In Electron, we previously used native FFmpeg bindings via Beamcoder and copied the decoded frames into WebGL as a custom texture. This resulted in two extra memory copies that effectively limited us to 720p playback on the average computer, and made exporting 4K video much slower.

A single frame of decoded 4K video takes 33 MB of memory, and at 30 frames per second, that’s nearly 1GB per second. With WebCodecs we can decode the frames, composite and process them, and encode to a final file all within GPU—much faster! In practice, our users have been seeing two to three times faster 4K exports on average.

2. Demux files and decode audio with WebAssembly

There are two things WebCodecs doesn’t give us: file demuxing (defined below) and cross-browser audio decoding. To solve this, we sponsored an open-source WebAssembly port of FFmpeg called libav.js.

Demuxing files

Decoding files is half the battle. To use WebCodecs, we first need to extract the raw compressed data from the files in a process known as demuxing. To handle a variety of container formats (e.g. MP4, MOV, WEBM, WAV), we need code that understands those various formats. libav.js lets us access all the container demuxing functionality of FFmpeg.

Decoding audio

Not all browsers support WebCodecs for audio (yet), so we can only rely on it for video. Fortunately, processing audio in CPU is much less taxing than video (you can fit almost 3 minutes of uncompressed audio in the same memory as one 4K video frame), so WebAssembly is a good match. libav.js can do this, too!

3. Transcode files with a new Media Transform Server

Most importantly, we built what we're calling a Media Transform Server.

In order to support the widest range of computers and file types, we decided to transcode user files into a consistent format.

Previously, we did this on the user’s computer. It would spin your fans and slow down the computer for minutes every time you add a file. Now we've moved this to the cloud so we can do it faster and in higher quality.

We use the user’s system capabilities and window size (including Retina/HiDPI configuration) to decide how to efficiently stream video while the user edits.

To do this quickly, we don’t process the entire file at once, but instead stream it in small chunks, on demand, as you move about your document or hit play!

This architecture is stateless, allowing us to improve and optimize the transcoding quality and efficiency over time to handle a wider variety of files.

Unlocking instant AI effects

Now that we stream everything on demand, we can spin up cloud GPUs to add high-quality AI effects like Green Screen and Eye Contact in real time. With our old technology stack, applying these effects to large videos sometimes took hours, but now they can be applied in just a few seconds.

To make this magic happen, we built specialized AI servers that carefully keep computation in the GPU. For example, using the dedicated video encoding hardware available on GPUs, we sped up video encoding/decoding by 10x. This allows us to pull a frame out of a video stream, apply an AI effect, and re-encode the video stream, all faster than the video plays back. Other than a small latency right after seeking to a new spot in the video, the server makes sure frames are available to the client with the effect applied before they are needed, making it appear as though the effect is available instantaneously.

This feature is currently being developed and will be released in the coming months.

Marcello Bastea-Forte
Marcello is a lead engineer at Descript, where he is among the company’s earliest hires. He’s worked extensively on the Descript’s media platform and more recently on bringing the app to the browser. Previously, he was worked on conversational assistants at Apple and Samsung.
Share this article
Start creating—for free
Sign up
Join millions of others creating with Descript

The new Descript: How we multiplied the app's speed and performance

For the better part of four years, the engineering team at Descript has been working on building a new media engine — one that would power both a web-based video editor and our desktop app.

The goal was to make Descript equally fast and powerful on both platforms. Most web-based video editors lack the functionality, power, and precision of desktop apps. Desktop video editors lack the convenience and accessibility of web-based tools. We wanted to create a media engine that would eliminate that tradeoff so our users could create video and podcasts in either platform without sacrificing power or capability.

Now, we’ve done it. We released the new web app to production in May; in most cases it’s faster and more robust than our desktop app. And last week, we released a preview version of our new desktop app, now running on the same media engine that powers the web app.

By the end of the year, we’ll have made a generational leap in Descript’s speed and performance. I’m super proud of the work this team did. If you’re interested in understanding what we did on the backend, read on. (I’m going to nerd out on the tech here; if you don’t care about that and just want to know what the rebuilt media engine does for you, watch our video below.) ‎

Previously on Descript…

We’ll start with some historical context. Back in 2018, we built Descript as an Electron application — same as web apps like Slack, Figma, and Notion.

Using Electron allowed us to power our user interface with well-known web technologies while retaining access to low-level native C/C++ libraries like FFMPEG/libav for processing media files and SQLite for storing local state. It served us well for six years.

Doubling down on web technology

To make Descript work in browsers, we had to remove those dependencies on native libraries and rebuild key pieces of our media engine from the ground up.

While we've been using many web technologies (i.e. WebAudio for audio playback and WebGL for compositing and realtime effects), recent enhancements in WebAssembly and the availability of the new WebCodecs API unlocked the remaining pieces.

The key user experience change you'll notice is that we don't download anything to your computer: everything streams from the cloud. But we also made Descript a lot faster and unlocked some new features you’ll see in the coming months.

New Descript

We did three things to make this happen:

1. Decode video in the browser

Before WebCodecs, if you wanted to build a web-based video engine, you had three main options for decoding video:

  1. HTML element
  2. Software decoding
  3. Cloud rendering

HTML <video> element

The HTMLVideoElement is great: it demuxes, decodes, and renders video! It will even use the user’s graphics card for efficient playback.

What it doesn’t provide is precision, or the ability to accurately step through frames (though you can get some of this now through requestVideoFrameCallback).

This gets more challenging when you try to manage A/V sync across multiple layers concurrently. You also cannot create these elements in a WebWorker, which further limits performance.

Check out VideoContext for a great example of this approach.

Software decoding

The next thing to try is decoding video in the CPU. The best performance you can get in the browser is with WebAssembly, but that runs at roughly half the speed of native code. To do this efficiently you need multi-threaded WebAssembly, which adds a lot more complexity to your stack.

But even that will be much slower than hardware decoding, and gets slower and slower as you operate at higher resolutions.

This is great for handling a variety of formats (we use WebAssembly to process ProRes files), but is not ideal for efficient playback.

Cloud rendering

The last, and perhaps most expensive approach is to do all your video decoding and compositing on a cloud GPU and stream the final result down to the user’s computer.

This is the same technique used for services like NVIDIA's to stream games to your computer and requires a dedicated computer in the cloud for every active user of your product.

One noteworthy limitation of this approach is that the file needs to be uploaded before the user can see it. No instant gratification!

Enter WebCodecs

As part of Interop 2023, web browsers started implementing a set of APIs called WebCodecs. WebCodecs provides zero-copy interface between hardware video decoders/encoders and WebGL/WebGPU. It takes advantage of hardware support for the most common video capture codecs (h.264, HEVC, and VP8/VP9) available in computers from the last 5–10 years.

In Electron, we previously used native FFmpeg bindings via Beamcoder and copied the decoded frames into WebGL as a custom texture. This resulted in two extra memory copies that effectively limited us to 720p playback on the average computer, and made exporting 4K video much slower.

A single frame of decoded 4K video takes 33 MB of memory, and at 30 frames per second, that’s nearly 1GB per second. With WebCodecs we can decode the frames, composite and process them, and encode to a final file all within GPU—much faster! In practice, our users have been seeing two to three times faster 4K exports on average.

2. Demux files and decode audio with WebAssembly

There are two things WebCodecs doesn’t give us: file demuxing (defined below) and cross-browser audio decoding. To solve this, we sponsored an open-source WebAssembly port of FFmpeg called libav.js.

Demuxing files

Decoding files is half the battle. To use WebCodecs, we first need to extract the raw compressed data from the files in a process known as demuxing. To handle a variety of container formats (e.g. MP4, MOV, WEBM, WAV), we need code that understands those various formats. libav.js lets us access all the container demuxing functionality of FFmpeg.

Decoding audio

Not all browsers support WebCodecs for audio (yet), so we can only rely on it for video. Fortunately, processing audio in CPU is much less taxing than video (you can fit almost 3 minutes of uncompressed audio in the same memory as one 4K video frame), so WebAssembly is a good match. libav.js can do this, too!

3. Transcode files with a new Media Transform Server

Most importantly, we built what we're calling a Media Transform Server.

In order to support the widest range of computers and file types, we decided to transcode user files into a consistent format.

Previously, we did this on the user’s computer. It would spin your fans and slow down the computer for minutes every time you add a file. Now we've moved this to the cloud so we can do it faster and in higher quality.

We use the user’s system capabilities and window size (including Retina/HiDPI configuration) to decide how to efficiently stream video while the user edits.

To do this quickly, we don’t process the entire file at once, but instead stream it in small chunks, on demand, as you move about your document or hit play!

This architecture is stateless, allowing us to improve and optimize the transcoding quality and efficiency over time to handle a wider variety of files.

Unlocking instant AI effects

Now that we stream everything on demand, we can spin up cloud GPUs to add high-quality AI effects like Green Screen and Eye Contact in real time. With our old technology stack, applying these effects to large videos sometimes took hours, but now they can be applied in just a few seconds.

To make this magic happen, we built specialized AI servers that carefully keep computation in the GPU. For example, using the dedicated video encoding hardware available on GPUs, we sped up video encoding/decoding by 10x. This allows us to pull a frame out of a video stream, apply an AI effect, and re-encode the video stream, all faster than the video plays back. Other than a small latency right after seeking to a new spot in the video, the server makes sure frames are available to the client with the effect applied before they are needed, making it appear as though the effect is available instantaneously.

This feature is currently being developed and will be released in the coming months.

Featured articles:

No items found.

Articles you might find interesting

Video

The best free video editing software with no watermark: Top 10 picks

Free video editing software usually leaves ugly watermarks on your project, but these 10 picks let you create the video you want with no watermark at no cost.

Podcasting

How to start a video podcast in 2024

If you’re looking for a primer on video podcasting, with a full breakdown of all the options out there, you’re in the right place.

Product Updates

Product marketing: The bridge between the product and the market

What is product marketing, and how does it affect your overall marketing strategy? We'll run through the basics in this article.

AI for Creators

The best AI tools for podcast show notes, reviewed

There are a lot of AI tools for podcast show notes, and it’s hard to know how they’re different without taking the time to test each one. So that’s what I did.

Podcasting

The best fiction podcasts worth listening to in 2023

There are a lot of fiction podcasts out there, and it can be hard to find the hidden gems. To help, we’ve compiled 10 of the best fiction podcasts worth listening to. With a mix of genres and narratives, there’s sure to be something for everyone.

Related articles:

Share this article

Get started for free →