Artificial Intelligence Meets Public Broadcasting’s Archives

Since the 1940s, America’s public television and radio stations have produced a remarkable collection of programs, creating what is now a sizable archive.

Six years ago, the Corporation for Public Broadcasting funded a project to digitize and preserve 40,000 hours of its programming. Boston’s WGBH-TV and the Library of Congress were awarded stewardship of this digitized material, collectively known as the American Archive of Public Broadcasting.

Today, the AAPB includes more than 100,000 items and is growing. In many cases, however, its contents have not been thoroughly cataloged. Some shows lack transcripts. Some containers of audiotapes and films don’t indicate what’s inside. “Having a librarian sit down and catalog every single item would have been insane,” says Karen Cariani, the David O. Ives Executive Director of the WGBH Media Library and Archives. “We needed something faster.”

Enter James Pustejovsky, the TJX Feldberg Professor of Computer Science. After visiting WGBH’s vault last spring, he and doctoral students Kelley Lynch, MS’17; Keigh Rim; and Ken Lai volunteered to develop AI computer programs to automate the AAPB indexing process.

“These materials are part of our national heritage,” says Pustejovsky. “It’s critical they’re widely available.”

Staff rarely had the time to properly tag or identify shows when they were made, Cariani says. “The first thing on a producer or writer’s mind is getting the next show on the air, not archiving the one you just completed.”

For their part, Pustejovsky and his team are taking a slow, steady approach to indexing the materials. Many shows begin with a film slate, or clapboard, showing text — often handwritten — that states the broadcast date, the producer and the show title. To extract this text, Pustejovsky’s team use optical character recognition. Where there’s a program transcript, they use timestamps to align the transcript’s words with the spoken dialogue down to the millisecond.

Facial recognition is trickier. Algorithms developed by Pustejovsky’s team will look for instances where an onscreen name identifies an interviewee. The computer will recognize the same person the next time, even if unidentified. This information will become part of the transcript, indicating each time, and at what time, the individual speaks.

Pustejovsky also wants to develop algorithms that can determine whether a segment in a show was shot indoors or outdoors, in a forest or on a mountaintop, inside a house or an office. Here, the computer must search for clues in the transcript or identify visual elements in the video, such as a tree or water.

Perhaps the most challenging task will be generating summaries of what happens onscreen, says Pustejovsky. For now, the descriptions will be basic: for example, “storm in Louisiana.” Over time, with enough refinements, the algorithms may be able to produce more complete summaries, such as “The banks of a river in Louisiana are overflowing during a hurricane.”

Though some of the tools for doing this work already exist, they’re proprietary and expensive. Pustejovsky and his lab are creating open-source software that will be available free of charge to libraries, TV stations and radio stations.

You can currently search the AAPB materials online — for instance, by series title or general topic. Pustejovsky’s work is expected to generate much more information about series and individual episodes that researchers and the general public can access during their searches.

WGBH officials hope that, down the road, Pustejovsky’s team will also help the station index its own archival materials, including iconic programs like “American Experience,” “Frontline” and “Nova.”

Cariani says it may be as much as a decade before all the work is completed. “There’s a huge amount of really great content that’s been produced and created over the years that we need to preserve and make accessible to the American people,” she says. “We have to start somewhere.”

— Lawrence Goodman