Artificial intelligence meets public broadcasting’s archives
Computational linguist James Pustejovsky and his lab are using AI to index and catalog some of the most famous programs in public television and radio history.
Since the 1940s, America’s public television and radio stations have produced a remarkable collection of programs. Their prodigious output has created an equally sizable archive
Six years ago, the Corporation for Public Broadcasting funded the digitization of 40,000 hours of public broadcast programming and awarded WGBH and the Library of Congress stewardship of this material, which collectively is known as the American Archive of Public Broadcasting (AAPB). Today, the AAPB has more than 100,000 items and is growing annually..
In many cases though, the programs in the AAPB are not thoroughly
But “having a librarian sit down and catalog every single item would have been insane,” says Karen Cariani, the David O. Ives Executive Director of WGBH Media Library and Archives. “We needed something faster.”
Enter James Pustejovsky, the TJX Feldberg Professor of Computer Science, and his lab. After visiting WGBH’s vault last spring, he and his students, Kelley Lynch,
“These materials are part of our national heritage,” says Pustejovsky. “It’s critical they’re widely available.”
Cariani says staff rarely had the time to properly tag or identify the programs when they were made. “The first thing on a producer or writer’s mind is getting the next show on the air, not archiving the one you just completed,” she said.
Pustejovsky and his lab have been taking a slow, steady approach to indexing the materials in the AAPB. They started with the easiest part. Many shows begin with a film slate, or clapboard, with the broadcast date, producer and title.
Pustejovsky’s team used optical character recognition (OCR) to extract this text, which is often handwritten. Where there’s a program transcript, Pustejovsky and his collaborators use timestamps to align the transcript’s words with the spoken dialogue down to the millisecond.
Facial recognition is trickier. Algorithms developed by Pustejovsky’s team will look for instances where an onscreen name is used to identify the interviewee. The second time the person appears, even if unidentified, the computer will recognize him or her. That information will become part of the transcript, indicating each time and at what time the individual speaks.
Then there’s identifying the location. Pustejovsky wants the computer program to determine whether a segment in a television show was shot indoors or outdoors, in a forest or on top of a mountain, inside a house or office. The computer must search for clues in the transcript or identify visual elements, such as a tree or water, in the video.
Perhaps the most challenging task will be generating summaries, says Pustejovsky.
These descriptions will be basic for now; for example, “storm in Louisiana” for one segment. Over time, with enough refinements, the algorithms may be able to produce more complete summaries, such as, “The banks of the river are overflowing during the hurricane.”
Some of the tools Pustejovsky and his lab are developing already exist, but they’re proprietary and expensive. Pustejovsky plans to create open-source software so it’s available free to libraries, TV and radio stations.
WGBH’s Cariani says it may be years or even a decade before all the work is completed. “There’s a huge amount of really great content that's been produced and created over the years that we need to preserve and make accessible to the American people,” she says. “We have to start somewhere.”
WGBH also expects that down the road Pustejovsky’s team will help the station index its own archival materials. This collection includes iconic programs like “American Experience,” “Frontline,” and “NOVA,” and more obscure ones, like “Gallimaufry,” “Hot Nights” and a 1961 lecture by Harvard philosophy professor Gabriel Marcel on “The Existential Backgrounds of Human Dignity.”
The materials are currently searchable online based on title and description, but WGBH hopes that Pustejovsky’s AI program will generate even more information that the public and researchers can use to research episodes and programs.
Categories: Research, Science and Technology