My dad got his first video camera the day I was born nearly three decades ago, which also happened to be Father’s Day. “Say hello to the camera!” are the first words he caught on tape, as he pointed it at a red, puffy baby (me) in a hospital bassinet. The clips got more embarrassing from there, as he continued to film through many a diaper change, temper tantrum, and—worst of all—puberty.
Most of those potential blackmail tokens sat trapped on miniDV tapes or scattered across SD cards until two years ago when my dad uploaded them all to Google Drive. Theoretically, since they were now stored in the cloud, my family and I could watch them whenever we wanted. But with more than 456 hours of footage, watching it all would have been a herculean effort. You can only watch old family friends open Christmas gifts so many times. So this year, for Father’s Day, I decided to build my dad an AI-powered searchable archive of our family videos.
If you’ve ever used Google Photos, you’ve seen the power of using AI to search and organize images and videos. The app uses machine learning to identify people and pets, as well as objects and text in images. So, if I search “pool” in the Google Photos app, it’ll show me all the pictures and videos I ever took of pools.
The Photos app is a great way to index photos and videos in a hurry, but as a developer (just like my dad), I wanted to get my hands dirty and build my own custom video archive. In addition to doing some very custom file processing, I wanted to add the ability to search my videos by things people said (the transcripts) rather than just what’s shown on camera, a feature the Photos app doesn’t currently support. This way, I could search using my family’s lingo (“skutch” for someone who’s being a pain) and for phrases like “first word” or “first steps” or “whoops.” Plus, my dad is a privacy nut who’d never give his fingerprint for fingerprint unlock, and I wanted to make sure I understood where all of our sensitive family video data was being stored and have concrete privacy guarantees. So, I built my archive on Google Cloud. Here’s what it looked like:
Building a searchable, indexed video archive is fun for personal projects, but it’s useful in the business world, too. Companies can use this technology to automatically generate metadata for large video datasets, caption and translate clips, or quickly search brand and creative assets
So how do you build an AI-powered video archive? Let’s take a look.
How to build an AI-powered video archive
The main workhorse of this project was the Video Intelligence API, a tool that can:
Transcribe audio (i.e. “automatic subtitles”)
Recognize objects (i.e. plane, beach, snow, bicycle, cake, wedding)
Extract text (i.e. on street signs, T-shirts, banners, and posters)
Detect shot changes
Flag explicit content
Making videos searchable
I used the Video Intelligence API in a couple of different ways. First, and most importantly, I used it to pull out features I could later use for search. For example, the audio transcription feature allowed me to find the video of my first steps by pulling out this cute quote:
“All right, this is one of Dale’s First Steps. Even we have it on camera. Let’s see. What are you playing with Dale?” (This is the word-for-word transcription output from the Video Intelligence API.)
The object recognition feature, powered by computer vision, recognized entities like “bridal shower,” “wedding,” “bat and ball games,” “baby,” and “performance,” which were great sentimental searchable attributes.
And the text extraction feature let me search videos by text featured on the screen, so I could search for writing on signs, posters, t-shirts, and even birthday cakes. That’s how I was able to find both my brother’s and my first birthdays:
Splitting long videos and extracting the dates
One of the most challenging parts of this project was dealing with all the different file types from all the different cameras my dad has owned over the years. His digital camera produced mostly small video clips with the date stored in the filename (i.e.
clip-2007-12-31 22;44;51.mp4). But before 2001, he used a camera that wrote video to miniDV tapes. When he digitized it, all the clips got slammed together into one big, two-hour file per tape. The clips contained no information about when they were filmed, unless my dad chose to manually hit a button that showed a date marker on the screen:
Happily, the Video Intelligence API was able to solve both of these problems. Automatic Shot Change detection recognized where one video ended and another began, even though they were mashed up into one long MOV file, so I was able to automatically split the long clips into smaller chunks. The API also extracted the dates shown on the screen, so I could match videos with timestamps. Since these long videos amounted to about 18 hours of film, I saved myself some 18 hours (minus development time) of manual labor.
Keeping big data in the cloud
One of the challenges of dealing with videos is that they’re beefy data files, and doing any development locally, on your personal computer, is slow and cumbersome. It’s best to keep all data handling and processing in the cloud. So, I started off by transferring all the clips my dad stored in Google Drive into a Cloud Storage bucket. To do this efficiently, keeping all data within Google’s network, I followed this tutorial, which uses a colab notebook to do the transfer.
My goal was to upload all video files to the Google Cloud, analyze them with the Video Intelligence API, and write the resulting metadata to a source I could later query and search from my app.
For this, I used a technique I use all the time to build machine learning pipelines: upload data to a Cloud Storage bucket, use a Cloud Function to kick off analysis, and write the results to a database (like Firestore). Here’s what that architecture looks like for this project:
If you’ve never used these tools before, Cloud Storage provides a place to store all kinds of files, like movies, images, text files, PDFs—really anything. Cloud Functions are a “serverless” way of running code in the cloud: Rather than use an entire virtual machine or container to run your code, you upload a single function or set of functions (in Python or Go or Node.js or Java) which runs in response to an event—an HTTP request, a Pub/Sub event, or when a file is uploaded to Cloud Storage.
Here, I uploaded a video to a Cloud Storage bucket (“gs://input_videos”) which triggered a Cloud Function that called the Video Intelligence API to analyze the uploaded video. Because this analysis can take a while, it runs in the background and finishes by writing data to a JSON file in a second Cloud Storage bucket (“gs://video_json”). As soon as this JSON file is written to storage, a second Cloud Function is triggered, which parses the JSON data and writes it to a database—in this case, Firestore. If you want an even more in-depth review of this design and the code that goes with it, take a look at this post.
Firestore is a real-time, NoSQL database designed with app and web developers in mind. As soon as I wrote the video metadata to Firestore, I could access that data in my app quickly and easily.
Simple search with Algolia
With all this information extracted from my videos—transcriptions, screen text, object labels—I needed a good way to search through it all. I needed something that could take a search word or phrase, even if the user made a typo (i.e. “birthdy party”), and search through all my metadata to return the best matches. I considered using Elasticsearch, an open-source search and analytics engine that’s often used for tasks like this, but decided it looked a bit heavy-handed for my use case. I didn’t want to create a whole search cluster just to search through videos.
Instead, I turned to Search API from a company called Algolia. It’s a neat tool that lets you upload JSON data and provides a slick interface to easily search through it all, handling things like typo correction and sorting. It was the perfect serverless search solution to complement the rest of my serverless app.
Putting it all together
And that’s pretty much it! After analyzing all the videos and making them searchable, all I had to do was build a nice UI. I decided to use Flutter, but you could build a frontend using Angular or React, or even a mobile app. Here’s what mine looks like:
Finding lost memories
What I hoped more than anything for this project was that it would let my dad search for memories that he knew he’d once recorded but that were almost impossible to find. So when I gifted it to him a few days before Father’s Day, that’s exactly what I asked: Dad, is there a memory of us you want to find?
He remembered the time he surprised me with a Barbie bicycle for my fourth birthday. We searched “bicycle” and the clip appeared. I barely remembered that day and had never seen the video before, but from the looks of it, I was literally agape. “I love it!” I yelled as I pedaled around the living room. It might be the best birthday/Father’s Day we have on record.
Want to see for yourself? Take a look
Read More for the details.