Automatically Segmenting Oral History Transcripts
This work addresses the incremental challenge of automating segmentation for oral history archives, which could assist archivists in speeding up a tedious task, though it notes low human agreement complicates evaluation.
The paper tackled the problem of automatically segmenting oral history transcripts into topically coherent segments to improve online accessibility, finding that the BayesSeg algorithm performed slightly better than TextTiling, while TextTiling did not significantly outperform uniform segmentation.
Dividing oral histories into topically coherent segments can make them more accessible online. People regularly make judgments about where coherent segments can be extracted from oral histories. But making these judgments can be taxing, so automated assistance is potentially attractive to speed the task of extracting segments from open-ended interviews. When different people are asked to extract coherent segments from the same oral histories, they often do not agree about precisely where such segments begin and end. This low agreement makes the evaluation of algorithmic segmenters challenging, but there is reason to believe that for segmenting oral history transcripts, some approaches are more promising than others. The BayesSeg algorithm performs slightly better than TextTiling, while TextTiling does not perform significantly better than a uniform segmentation. BayesSeg might be used to suggest boundaries to someone segmenting oral histories, but this segmentation task needs to be better defined.