Uncategorized

Download e-book Audiovisual Archives: Digital Text and Discourse Analysis

Free download. Book file PDF easily for everyone and every device. You can download and read online Audiovisual Archives: Digital Text and Discourse Analysis file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Audiovisual Archives: Digital Text and Discourse Analysis book. Happy reading Audiovisual Archives: Digital Text and Discourse Analysis Bookeveryone. Download file Free Book PDF Audiovisual Archives: Digital Text and Discourse Analysis at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Audiovisual Archives: Digital Text and Discourse Analysis Pocket Guide.

Apart from player reviews, they focus on cheats instructions on how to play games more easily. For current television content, new content alert systems based on program schedules provide automatic notification of broadcasts that fit certain criteria e. MeeVee category 1 or Radio Times category 1. The BBC has announced its commitment to making 1 million hours of television and radio searchable and available online. On the horizon of searchability, systems that bridge different media are under active development.

For instance, search engines that range across television and web contents have been designed e. However, these systems really do not actually search the content of film, video or broadcast. As in the case of the spoken word resources, they still rely on previous cataloguing, annotation or transcription. Even the most advanced video upload sites such as Yahoo! Video category 1 require submitters to supply the keywords used for indexing and cataloguing the clip.

Web search engines will probably index video at a fine grained level as collaborative annotation techniques develop see Section 3. This topic is discussed further at the end of the following section. More generally, online suppliers of content for online and offline viewing are rapidly increasing in number. For example, iTunes allows the download of video content e. TV shows or media company pod casts for transfer to a video-capable iPod and TiVo now supports transfer of content recorded by the TiVo to an iPod or PlayStation portable.

Video coming through these mechanisms have established charging mechanisms, some per month, some per content unit.


  • Joannas Training - Volume 3 (Trans Erotica)!
  • Audiovisual Archives: Digital Text and Discourse Analysis?
  • ReadWest: Stories of the American West (Elmer Kelton Book 1).

We distinguish these offerings from emerging Web search tools aimed at locating AV. Most of these fall into categories 1 or 2, and they share many similarities with search engines for text. One of the earliest of these was Speechbot, a general Web deployed tool for audio indexing speech recognition transcriptions. Speechbot supported many of the functions now familiar for text-based searching, allowing free text, advanced or power searches and produced a results list displaying a number of items comprising a 10 second long errorful transcription around the located and highlighted query terms, the ability to play the corresponding 10 second extract and the date of the recording.

Speechbot is now unavailable due to the closure of the Compaq Cambridge Research Lab US , but in the past couple of years a number of similar services have emerged. For this reason, we describe typical functionalities rather than describing specific systems in depth. Some tools crawl the Web for audio and video made openly available on websites.

For example, podscope offers the ability to search audio blogs and pod casts, as does Blinkx Truveo offers a similar service Rev2. Some tools support the search of video or audio submitted by users. For example, podscope allows users to submit content Price, a , while Google Video operates the Google Video Upload program Google, c , whereby video and optionally a transcript are submitted to the system. Some tools index content legitimately provided by media companies and archives.

The tools perform the search in different ways. Some rely on metadata associated with videos, such as web page captions or user uploaded transcripts the current version of Google video may fall into this category.


  • By His Desire?
  • My Shopping Bag.
  • Exceptional Leadership:16 Critical Competencies for Healthcare Executives;
  • Learning Chinese, Turning Chinese: Challenges to Becoming Sinophone in a Globalised World (Asias Transformations).
  • Terms & Conditions.

Others extract closed captioning or use speech-to-text technology to allow more precise indexing as discussed earlier, returning results which play from the point of the first-matching query term e. TV Eyes Price, b. Most of the sites described offer services in English; services are also appearing in Mandarin e. The business models of these companies are still evolving. Services such as blinkx appeared to be inserting advertisements into searchable content net imperative, Others offer premium fee-based services e.

TVEyes Price, b. Video and audio are large media. On the web, in film and video databases, on DVDs, in legacy collections of video and film, there is no shortage of film footage or television content. Many scholars collect large amounts of this material on their own computers and on portable storage media. While professionally curated online archives usually have extensive catalogues and indexes, personal collections of audiovisual materials sometimes suffer from lack of organization. At one level, the folder and directory structures available on desktop computers allow virtually any material to be organised.

However, others means of organizing audiovisual materials are available. Most music and video player software such as iTunes, xmms or windows media player embodies some idea of bookmarks, libraries or playlists. Annotation software often includes file management features, sometimes for thousands of files.

CLARIN2018 Book of Abstracts

Dedicated personal information management software such as DEVONthink Devon Technologies, handles multimedia and text files equally. Some software attempts to automatically index still images and text files added to it. Their search capabilities only use tags and metadata for sound and image files. In the context of time-based media, annotation associates extra information, often textual but not necessarily so, with a particular point in an audiovisual document or media file. In humanities research, annotation has long been important, but in the context of sound and image, it takes on greater importance.

Rich annotation of content is required to access and analyse audiovisual materials, especially given the growing quantities of this material. Annotation software for images, video, music and speech is widely available, but it does not always meet the needs of scholars, who annotate for different reasons. Sometimes annotation simply allows quick access or index of different sections or scenes. Annotation has particular importance for film and video where Annotation is sometimes used for thematic or formal analysis of visual forms or narratives.

At more fine-grained levels, some film scholars analyse a small number of film frames in detail, following camera movements, lighting, figures, and framing of scenes. Annotation tools designed for analysis of cinema are not widely available. Most video analysis software concentrates on a higher level of analysis. There are many different approaches with regards to standards in annotation. There are several well-known metadata standards applicable to humanities research, such as library standards like MARC and Z These are useful standards, but are dominated by the resource-level approach; most similar metadata standards describe content on the level of an entire entity within a library.

This level of metadata is very useful, but does not satisfy the requirements of annotation as described above: the standards do not have robust models for marking points within the content. It is intended to be a comprehensive multimedia content description framework, enabling detailed metadata description aimed at multiple levels within the content. It is worthwhile to go into a little detail on the standard and what it might offer to humanities researchers.

A key to understanding MPEG-7 is appreciating the goals that shaped its conception and the environment in which it was born. It was conceived in a time when the World Wide Web was just showing its potential to be a massively interconnected multi- media resource.

Text search on the web was beginning, and throwing into relief the opacity of multimedia files: there was then no reliable way of giving human or computer access inside a multimedia resource without a human viewing it in its entirety. Spurred on by developments in the general area of query by example including query by image content and query by humming , it was thought that MPEG could bring its considerable signal processing prowess to bear on those problems.

Along the way to the standard, people discovered that the problem of multimedia content description was not all that trivial, nor could it rely wholly upon signal processing. It had to bring in higher-level concerns, such as with knowledge representation and digital library and archivist expertise. In doing so, the nascent standard became much more complex, but had the potential to be much more complete. The standard, as delivered, has a blend of high- and low-level approaches.

The visual part of the standard kept closest to MPEGs old guard, concentrating on features unambiguously based upon signal processing and very compact representations. The newly created Multimedia Description Schemes subgroup MDS brought in a very rich, often complex set of description structures that could be adopted for many different applications. MPEG-7 Audio took a middle path, offering both generic, signal processing-inspired feature descriptors and high-level description schemes geared towards specific applications.

Data validation is offered by the computationally rich, but somewhat complex XML Schema standard. Users and application providers may customise the precise schema via a variety of methods. There are numerous descriptive elements available throughout the standard, which can be mixed and matched as appropriate. Most significantly, it allows for both simple and complex time- and space-based annotations, and it enables both automated and manual annotations.

Industrial take-up and generally available implementations of MPEG-7 have been inconsistent at best so far. The representation format offered by MPEG-7, however, seems to be one that would serve arts and humanities research very well. It is agnostic to media type and format. It is very general, and can be adapted to serve a variety of different applications. Despite its flexibility, it is far more than a guideline standard: it has very specific rules for ensuring compatibility and interoperability.

If someone were to invent a framework serving the arts and humanities research community for its metadata needs, it would resemble MPEG-7, at least conceptually. A fine-grained approach to the problem of re-using annotations relies on developing shared standards for annotation. Standards for annotation of video content have been developed. Annodex , category 2 is an open standard for annotating and indexing networked media, and draws to some extent upon experience gained from MPEG That is, to provide pointers or links into time-based resources of video on the web.

The Metavid project demonstrates Annodex in action on videos of U. There are numerous tools and formats for creating linguistic annotations, many catalogued by the Linguistic Data Consortium According to the LDC, Linguistic annotation covers any descriptive or analytic notations applied to raw language data. The added notations may include transcriptions of all sorts from phonetic features to discourse structures , part-of-speech and sense tagging, syntactic analysis, named entity identification, co-reference annotation, and so on.

The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. Some of the analysis tools mentioned earlier also support annotation, see e. There is also the open source Transcriber tool and numerous other commercial solutions for more general transcription of digital speech recordings, such as NCHSwiftSound These tools fall variously into categories For video, a typical video annotation tool is Transana category 1 developed by WCER, University of Wisconsin , which allows researchers to identify analytically interesting clips, assign keywords to clips, arrange and rearrange clips, create complex collections of interrelated clips, explore relationships between applied keywords, and share your analysis with colleagues.

Annotation of music associates non-textual information with the original data more often than is the case for other media. Essentially the task remains the same: to associate some symbolic information with points or segments of the audiovisual medium. In the case of multi-channel or multi-track data, it is possible that annotations might be applied to separate channels or tracks, but we have found no instances of this.

The kinds of annotations which researchers wish to make range from structural or quasi-semantic labels e. These annotations can be attached to time points in the original stream, or to segments. In the latter case the annotations might or might not form a hierarchical structure with segments contained within segments and might or might not containing overlapping segments. Annotation tools are unfortunately rarely explicit about which of these kinds of annotation are supported. Though not intended explicitly for annotation, music-editing or music-composition software can have annotation capabilities or be repurposed to perform annotation tasks.

Two MIDI tracks were added using the software, one indicating where the beats came and the other indicating each percussion stroke and the kind of instrument bass drum, snare drum, etc. An advantage of using the software was that the MIDI track could be played either with or without the original audio track, allowing the user to check by ear whether or not the percussion strokes had been correctly identified and correctly timed. Tools intended for annotation of speech could be used also for annotating music, and while Lesaffre et al.

Similarly, generic tools for annotating video or audio, such as Project Pad Northwestern University, category 2 , can be used for annotating music. This software allows different kinds of annotations to be attached to time points or to segments, and different kinds of annotations can attach to different segmentations. Annotation types can be defined using an XML schema, and software elements can be added to automate some annotation processes see Section 3.

Annotation of audiovisual materials can take a lot of time, and even if material has been annotated by one researcher, the problem remains of how any other researcher can make use of the annotation. It is therefore not a surprise that projects have investigated sharing the effort and the results. We see much activity in this area, and some promising early ideas. Annotation for the purpose of finding audiovisual material seems successful, but we have not seen anything like the sophisticated and consistent analysis that would be needed to write even a basic film or book review.

Simple collaborative annotation of audiovisual materials is now common on the web. Sites such as Google Video Google, b category 1 or Youtube category 1 partly rely on tags supplied by contributors. Producers and consumers of audiovisual material such as photographs, speech, sound and music, or video tag them with keywords.

These keywords then become searchable via web search engines or through subscription mechanisms e. While people often choose very generic keywords, and the keywords often apply to large video files, the tags and keywords are clearly useful. There is a synergy between the descriptions supplied by different users. For example, one may annotate the style of the image, and another marks the presence of a street sign. Combinations of the annotations supplied by users allow database-driven websites such as flickr.

Currently, commercial video uploading and downloading services are growing rapidly, and they offer increasingly sophisticated annotation features e. Viddler, , category 3. However, by and large, the annotations only describe the most obvious features, which limits the searches that can be done. A number of projects have attempted to design and construct collaborative software environments for video annotation. In collaborative video annotation, a number of people can work on the same video footage.

http://devportalbackend.niftygateway.com/chloroquine-best-price-shipping-to-fr.php

Content Management System for Digital Archives with Audiovisual Content | Athena Research Center

Efficient Video Annotation EVA Volkmer, category 2 is novel Web tool designed to support distributed collaborative indexing of semantic concepts in large image and video collections. Another approach to collaborative annotation is to set annotation up as a game in which you get annotations generated as a side-result. The same team has a later game, Peekaboom Carnegie Mellon University, b , which helps in generating labels for segmented images, useful for computer vision, for example.

The designers claim to have generated over 10 million descriptive words for one million images. A different approach is taken by the application mediaBase Institute for Multimedia Literacy, category 2 , which requires some manual annotation or tagging of any media file put in the system. However, after this initial tagging, it encourages rich media authorship as a way of investigating relations between different media components. MediaBase publishes resulting compositions on the web, and they can be altered, edited, revised or added to by others.

The goal of MixedMediaGrid NCeSS, category 4 , an ESRC e-Science funded project, is to generate tools and techniques for social scientists to collaboratively analyse audio-visual qualitative data and related materials over the Grid. Certainly, these tools and techniques could be used in the humanities too. MediaMatrix category 2 developed at Michigan State University is a similar online application that allows users to isolate, segment, and annotate digital media.

Similarly in music, a number of projects have suggested processes of collaborative annotation to allow researchers to pool effort and benefit from each others annotations. Project Pad Northwestern University, is designed explicitly to allow teams envisaged as students, but they could be researchers to share annotations. Collaborative annotations of music in education have been reported, but none in research. A collaborative music-annotation project intended for research has been set up at Pompeu Fabra University, using either the CLAM Music Annotator or a Wavesurfer-based client to a web portal Herrera et al.

The BBC also has a project in this area, with the aim that listeners will progressively annotate recordings of radio programmes Ferne, This is an internal BBC research project, but a public launch has been mooted. Interestingly, this project uses a Wiki-like approach, allowing the public to edit existing annotations, including viewing histories and reverting to previous versions, but with the underlying assumption that there is a single canonical annotation.

Join Kobo & start eReading today

Already established and an everyday part of music on the web, but not really a research tool, is the Gracenote database of CD tracks Gracenote, The database supplies annotations for media players to supply information about artist and title which is not recorded in the electronic data on an audio CD. Publishers of CDs can supply the original information to Gracenote, but many CDs were published long before there were media players on computers, let alone before the Gracenote database existed.

Commonly, when a media player finds there is no information on the database for a CD, the user is invited to supply this information, which is then sent to the database. Thus Gracenote is effectively a global collaborative annotation tool. However, in the area of classical music recordings, it is notoriously inaccurate, largely because the categories for the database Artist, Song, etc. Research uses for the database are therefore likely to be confined to popular music.

An alternative response to the time-consuming nature of manual annotation is to automate part of the process. Clearly, different kinds of annotations present different levels of difficulty in automation, and it is in the simple and explicit partitioning of audio, in particular, that automatic annotation has had the greatest success. The challenges of more semantic levels are much greater, though some projects in this area have had a degree of success, particular with respect to music. The goal of audio partitioning systems is to divide up the input audio into homogeneous segments and typically to determine their type.

The resultant partitioning may provide useful metadata for the purpose of flexible access, but such partitioning is also an important prerequisite for speech-to-text transcription systems e. The past decade has seen the birth and rapid growth of the field of Music Information Retrieval MIR , fed in part by the interest of music businesses in technologies to facilitate user interaction with large databases of downloadable music.

While query by humming see Section 3. Some of these are concerned with partitioning e. The audio competitions used recorded sound as the raw data, while the symbolic competitions used MIDI files. The best symbolic systems interestingly performed at similar levels of accuracy, despite the much lower complexity of the input data. Other tasks on symbolic data, on the other hand, such as pitch spelling i.

Most MIR software falls into categories Only Marsyas has become sufficiently widely used to take on the status of category 2 released, but not yet finished, software , but its use is currently as a toolkit for MIR research rather than a tool for musicologists. It would be a mistake, however, to think that MIR research will not assist musicological and music-analytical research.

While it is true that tools which automate the typical tasks of music analysis are, as yet, not in prospect, MIR tools do produce a wealth of potentially useful and interesting data about musical sound of a somewhat different nature e. With a change of focus by music analysts and a certain amount of re-education, since the acoustics and mathematics involved are not part of the general knowledge of music analysts , these tools promise novel and fruitful areas of research which focus on the analysis of music as sound rather than music as notated structure.

A video can be partitioned into shots. A shot is an uninterrupted segment of video frame sequence of time, space and graphical configurations. For the last decade, many research projects have been working on automated video partition of footage into shots, topics, and face recognition particularly in news video processing. Some of this research has led to commercial products. Some of these systems use manual annotation to start with, and then automatically annotates and indexes any related video materials. For instance, the Marvel video annotation system IBM, category 3 demonstrates the ability to generate semantic and formal labels from television news footage.

Marvel builds statistical models from visual features using training examples and applies the models to automatically annotate large repositories. Other projects seek to generate topic structures for TV content using TV viewers comments on live web chat rooms Miyamori et al. Transcription is typically applicable only to audio within time-based multimedia. More technically, as it is a process of writing down events in a canonical form, it applies to events that are transitory and constrained.

As such, music, dance, and speech are the most commonly transcribed sources of those within the projects remit. Automatic general video transcription makes little sense in the near-term because it essentially requires a model of the whole world. With constrained worlds, some transcription is possible, and there has been some automatic video understanding of sports on video as well. Speech-to-text or automatic speech recognition systems aim to convert a speech signal into a sequence of words. Progress in the field has been driven by standardised metrics, corpora and benchmark testing through NIST since the mids, with systems developed for evermore challenging tasks or speech domains: developing from the domain of single person dictation systems to todays research into systems for the meetings and lectures domain.

A brief history of speech and speaker recognition research can be found in Furui a. Some of the differences between speech domains can create additional difficulty for automatic systems. For example, speech from the lecture domain has much in common with speech from a more conversational domain including false starts, extraneous filler words like okay and filled pauses uh. It also exhibits poor planning at higher structural levels as well as at the sentence level, often digressing from the primary theme.

Development of a system for a new speech domain or application ideally builds upon a large amount of manually transcribed in-domain training data in order to build a speech transcription system tailored to that domain often of the order of hundreds if not thousands of hours for state-of-the-art systems Kim et al. The level of accuracy of the transcriptions need not be perfect: techniques have recently been developed to handle less than perfect transcriptions such as closed captions Kim et al. Where sufficient adequately transcribed data cannot be made available for financial or other reasons, as much adequately transcribed in-domain acoustic data as is feasible is obtained which will sometimes be none and models from a similar domain are adjusted or adapted in terms of their acoustic, vocabulary or word predictor components in order to match the new domain as well as possible.

Vocabulary and language model word predictor adjustments can also be made based upon in-domain textual information such as transcripts, textbooks or other metadata where available. There is a computation time versus accuracy trade-off: a real-time system will typically perform less well than a times-real-time 10xRT or even unconstrained system, but the degradation will vary with situation.

Similarly, memory constraints can affect things. State-of-the-art systems typically use hardware beyond that of todays average desktop. The word-error rates for English speech referred to above were achieved with a constraint of 10xRT and 20xRT respectively Le, It is important to note that speech recognition systems developed for one domain cannot, in many if not most situations, be employed as a black box that can handle any domain: even speech from the same domain that differs from the training data may be problematic e.

There exist components of the system which are brittle or sensitive to such changes: the system has been trained to recognise certain types of speech and, whilst it may perform quite well on those types of speech, it may perform badly on speech which is different. Such differences may include but are not limited to :. There exist system adaptation techniques to compensate for such differences to some extent e. Gales , but despite significant progress in this area the development of systems which are robust to differences in data is a key research goal at present Le, ; Ostendorf et al.

Systems have also been developed for some domains in many other major European languages e. Mention should also be made of the recently-started DARPA Global Autonomous Language Exploitation GALE program see Linguistic Data Consortium, , which is developing technologies to absorb, analyse and interpret huge volumes of speech and text in multiple languages: as part of this, projects such as AGILE Autonomous Global Integrated Language Exploitation, involving multiple sites including the University of Cambridge and the University of Edinburgh are developing combined speech-to-text translation systems that can ingest foreign-language news programmes and TV shows and generate synchronised English subtitles Machine Intelligence Laboratory, Differences in speech transcription performance across different domains mean that speech transcription tools fall into development categories depending upon the difficulty of the domain.

For example, desktop systems for dictated speech-to-text and desktop control are readily available e. Dragon NaturallySpeaking , as are systems for constrained domains such as medical transcription e. Philips SpeechMagic supports 23 languages and specialised vocabularies. The Microsoft SDK can be freely downloaded and used for the development of speech-driven applications and is supplied with recognisers for US English, simplified Chinese and Japanese Microsoft, All of these tools fall into categories but will perform well only in certain situations.

State-of-the-art speech-to-text systems are typically made available through joint projects with universities or commercial organisations such as Philips and Scansoft. These tend to fall into categories Some tools index content legitimately provided by media companies and archives.

The tools perform the search in different ways. Some rely on metadata associated with videos, such as web page captions or user uploaded transcripts the current version of Google video may fall into this category. Others extract closed captioning or use speech-to-text technology to allow more precise indexing as discussed earlier, returning results which play from the point of the first-matching query term e.

TV Eyes Price, b. Most of the sites described offer services in English; services are also appearing in Mandarin e. The business models of these companies are still evolving. Services such as blinkx appeared to be inserting advertisements into searchable content net imperative, Others offer premium fee-based services e. TVEyes Price, b. Video and audio are large media. On the web, in film and video databases, on DVDs, in legacy collections of video and film, there is no shortage of film footage or television content. Many scholars collect large amounts of this material on their own computers and on portable storage media.

While professionally curated online archives usually have extensive catalogues and indexes, personal collections of audiovisual materials sometimes suffer from lack of organization. At one level, the folder and directory structures available on desktop computers allow virtually any material to be organised.

However, others means of organizing audiovisual materials are available. Most music and video player software such as iTunes, xmms or windows media player embodies some idea of bookmarks, libraries or playlists. Annotation software often includes file management features, sometimes for thousands of files.

Dedicated personal information management software such as DEVONthink Devon Technologies, handles multimedia and text files equally. Some software attempts to automatically index still images and text files added to it. Their search capabilities only use tags and metadata for sound and image files. In the context of time-based media, annotation associates extra information, often textual but not necessarily so, with a particular point in an audiovisual document or media file.

In humanities research, annotation has long been important, but in the context of sound and image, it takes on greater importance. Rich annotation of content is required to access and analyse audiovisual materials, especially given the growing quantities of this material. Annotation software for images, video, music and speech is widely available, but it does not always meet the needs of scholars, who annotate for different reasons.

Sometimes annotation simply allows quick access or index of different sections or scenes. Annotation has particular importance for film and video where Annotation is sometimes used for thematic or formal analysis of visual forms or narratives. At more fine-grained levels, some film scholars analyse a small number of film frames in detail, following camera movements, lighting, figures, and framing of scenes.

Annotation tools designed for analysis of cinema are not widely available. Most video analysis software concentrates on a higher level of analysis. There are many different approaches with regards to standards in annotation. There are several well-known metadata standards applicable to humanities research, such as library standards like MARC and Z These are useful standards, but are dominated by the resource-level approach; most similar metadata standards describe content on the level of an entire entity within a library. This level of metadata is very useful, but does not satisfy the requirements of annotation as described above: the standards do not have robust models for marking points within the content.

It is intended to be a comprehensive multimedia content description framework, enabling detailed metadata description aimed at multiple levels within the content. It is worthwhile to go into a little detail on the standard and what it might offer to humanities researchers. A key to understanding MPEG-7 is appreciating the goals that shaped its conception and the environment in which it was born. It was conceived in a time when the World Wide Web was just showing its potential to be a massively interconnected multi- media resource.

Text search on the web was beginning, and throwing into relief the opacity of multimedia files: there was then no reliable way of giving human or computer access inside a multimedia resource without a human viewing it in its entirety. Spurred on by developments in the general area of query by example including query by image content and query by humming , it was thought that MPEG could bring its considerable signal processing prowess to bear on those problems.

Along the way to the standard, people discovered that the problem of multimedia content description was not all that trivial, nor could it rely wholly upon signal processing. It had to bring in higher-level concerns, such as with knowledge representation and digital library and archivist expertise. In doing so, the nascent standard became much more complex, but had the potential to be much more complete. The standard, as delivered, has a blend of high- and low-level approaches. The visual part of the standard kept closest to MPEGs old guard, concentrating on features unambiguously based upon signal processing and very compact representations.

The newly created Multimedia Description Schemes subgroup MDS brought in a very rich, often complex set of description structures that could be adopted for many different applications. MPEG-7 Audio took a middle path, offering both generic, signal processing-inspired feature descriptors and high-level description schemes geared towards specific applications.

Data validation is offered by the computationally rich, but somewhat complex XML Schema standard.

Get PDF Audiovisual Archives: Digital Text and Discourse Analysis

Users and application providers may customise the precise schema via a variety of methods. There are numerous descriptive elements available throughout the standard, which can be mixed and matched as appropriate. Most significantly, it allows for both simple and complex time- and space-based annotations, and it enables both automated and manual annotations.

Industrial take-up and generally available implementations of MPEG-7 have been inconsistent at best so far.

A toolbox for analysing political texts

The representation format offered by MPEG-7, however, seems to be one that would serve arts and humanities research very well. It is agnostic to media type and format. It is very general, and can be adapted to serve a variety of different applications. Despite its flexibility, it is far more than a guideline standard: it has very specific rules for ensuring compatibility and interoperability.

If someone were to invent a framework serving the arts and humanities research community for its metadata needs, it would resemble MPEG-7, at least conceptually. A fine-grained approach to the problem of re-using annotations relies on developing shared standards for annotation. Standards for annotation of video content have been developed. Annodex , category 2 is an open standard for annotating and indexing networked media, and draws to some extent upon experience gained from MPEG That is, to provide pointers or links into time-based resources of video on the web.

The Metavid project demonstrates Annodex in action on videos of U. There are numerous tools and formats for creating linguistic annotations, many catalogued by the Linguistic Data Consortium According to the LDC, Linguistic annotation covers any descriptive or analytic notations applied to raw language data.

The added notations may include transcriptions of all sorts from phonetic features to discourse structures , part-of-speech and sense tagging, syntactic analysis, named entity identification, co-reference annotation, and so on. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. Some of the analysis tools mentioned earlier also support annotation, see e. There is also the open source Transcriber tool and numerous other commercial solutions for more general transcription of digital speech recordings, such as NCHSwiftSound These tools fall variously into categories For video, a typical video annotation tool is Transana category 1 developed by WCER, University of Wisconsin , which allows researchers to identify analytically interesting clips, assign keywords to clips, arrange and rearrange clips, create complex collections of interrelated clips, explore relationships between applied keywords, and share your analysis with colleagues.

Annotation of music associates non-textual information with the original data more often than is the case for other media. Essentially the task remains the same: to associate some symbolic information with points or segments of the audiovisual medium. In the case of multi-channel or multi-track data, it is possible that annotations might be applied to separate channels or tracks, but we have found no instances of this.

The kinds of annotations which researchers wish to make range from structural or quasi-semantic labels e. These annotations can be attached to time points in the original stream, or to segments. In the latter case the annotations might or might not form a hierarchical structure with segments contained within segments and might or might not containing overlapping segments.

Annotation tools are unfortunately rarely explicit about which of these kinds of annotation are supported. Though not intended explicitly for annotation, music-editing or music-composition software can have annotation capabilities or be repurposed to perform annotation tasks. Two MIDI tracks were added using the software, one indicating where the beats came and the other indicating each percussion stroke and the kind of instrument bass drum, snare drum, etc.

An advantage of using the software was that the MIDI track could be played either with or without the original audio track, allowing the user to check by ear whether or not the percussion strokes had been correctly identified and correctly timed. Tools intended for annotation of speech could be used also for annotating music, and while Lesaffre et al. Similarly, generic tools for annotating video or audio, such as Project Pad Northwestern University, category 2 , can be used for annotating music.

This software allows different kinds of annotations to be attached to time points or to segments, and different kinds of annotations can attach to different segmentations. Annotation types can be defined using an XML schema, and software elements can be added to automate some annotation processes see Section 3. Annotation of audiovisual materials can take a lot of time, and even if material has been annotated by one researcher, the problem remains of how any other researcher can make use of the annotation.

It is therefore not a surprise that projects have investigated sharing the effort and the results. We see much activity in this area, and some promising early ideas. Annotation for the purpose of finding audiovisual material seems successful, but we have not seen anything like the sophisticated and consistent analysis that would be needed to write even a basic film or book review. Simple collaborative annotation of audiovisual materials is now common on the web. Sites such as Google Video Google, b category 1 or Youtube category 1 partly rely on tags supplied by contributors.

Producers and consumers of audiovisual material such as photographs, speech, sound and music, or video tag them with keywords. These keywords then become searchable via web search engines or through subscription mechanisms e. While people often choose very generic keywords, and the keywords often apply to large video files, the tags and keywords are clearly useful.

There is a synergy between the descriptions supplied by different users. For example, one may annotate the style of the image, and another marks the presence of a street sign. Combinations of the annotations supplied by users allow database-driven websites such as flickr. Currently, commercial video uploading and downloading services are growing rapidly, and they offer increasingly sophisticated annotation features e.

Viddler, , category 3. However, by and large, the annotations only describe the most obvious features, which limits the searches that can be done. A number of projects have attempted to design and construct collaborative software environments for video annotation. In collaborative video annotation, a number of people can work on the same video footage. Efficient Video Annotation EVA Volkmer, category 2 is novel Web tool designed to support distributed collaborative indexing of semantic concepts in large image and video collections.

Another approach to collaborative annotation is to set annotation up as a game in which you get annotations generated as a side-result. The same team has a later game, Peekaboom Carnegie Mellon University, b , which helps in generating labels for segmented images, useful for computer vision, for example. The designers claim to have generated over 10 million descriptive words for one million images. A different approach is taken by the application mediaBase Institute for Multimedia Literacy, category 2 , which requires some manual annotation or tagging of any media file put in the system.

However, after this initial tagging, it encourages rich media authorship as a way of investigating relations between different media components. MediaBase publishes resulting compositions on the web, and they can be altered, edited, revised or added to by others. The goal of MixedMediaGrid NCeSS, category 4 , an ESRC e-Science funded project, is to generate tools and techniques for social scientists to collaboratively analyse audio-visual qualitative data and related materials over the Grid.

Certainly, these tools and techniques could be used in the humanities too. MediaMatrix category 2 developed at Michigan State University is a similar online application that allows users to isolate, segment, and annotate digital media. Similarly in music, a number of projects have suggested processes of collaborative annotation to allow researchers to pool effort and benefit from each others annotations. Project Pad Northwestern University, is designed explicitly to allow teams envisaged as students, but they could be researchers to share annotations.

Collaborative annotations of music in education have been reported, but none in research. A collaborative music-annotation project intended for research has been set up at Pompeu Fabra University, using either the CLAM Music Annotator or a Wavesurfer-based client to a web portal Herrera et al. The BBC also has a project in this area, with the aim that listeners will progressively annotate recordings of radio programmes Ferne, This is an internal BBC research project, but a public launch has been mooted. Interestingly, this project uses a Wiki-like approach, allowing the public to edit existing annotations, including viewing histories and reverting to previous versions, but with the underlying assumption that there is a single canonical annotation.

Already established and an everyday part of music on the web, but not really a research tool, is the Gracenote database of CD tracks Gracenote, The database supplies annotations for media players to supply information about artist and title which is not recorded in the electronic data on an audio CD. Publishers of CDs can supply the original information to Gracenote, but many CDs were published long before there were media players on computers, let alone before the Gracenote database existed.

Commonly, when a media player finds there is no information on the database for a CD, the user is invited to supply this information, which is then sent to the database. Thus Gracenote is effectively a global collaborative annotation tool. However, in the area of classical music recordings, it is notoriously inaccurate, largely because the categories for the database Artist, Song, etc. Research uses for the database are therefore likely to be confined to popular music.

An alternative response to the time-consuming nature of manual annotation is to automate part of the process. Clearly, different kinds of annotations present different levels of difficulty in automation, and it is in the simple and explicit partitioning of audio, in particular, that automatic annotation has had the greatest success.

The challenges of more semantic levels are much greater, though some projects in this area have had a degree of success, particular with respect to music. The goal of audio partitioning systems is to divide up the input audio into homogeneous segments and typically to determine their type. The resultant partitioning may provide useful metadata for the purpose of flexible access, but such partitioning is also an important prerequisite for speech-to-text transcription systems e. The past decade has seen the birth and rapid growth of the field of Music Information Retrieval MIR , fed in part by the interest of music businesses in technologies to facilitate user interaction with large databases of downloadable music.

Download Product Flyer

While query by humming see Section 3. Some of these are concerned with partitioning e. The audio competitions used recorded sound as the raw data, while the symbolic competitions used MIDI files. The best symbolic systems interestingly performed at similar levels of accuracy, despite the much lower complexity of the input data. Other tasks on symbolic data, on the other hand, such as pitch spelling i. Most MIR software falls into categories Only Marsyas has become sufficiently widely used to take on the status of category 2 released, but not yet finished, software , but its use is currently as a toolkit for MIR research rather than a tool for musicologists.

It would be a mistake, however, to think that MIR research will not assist musicological and music-analytical research. While it is true that tools which automate the typical tasks of music analysis are, as yet, not in prospect, MIR tools do produce a wealth of potentially useful and interesting data about musical sound of a somewhat different nature e. With a change of focus by music analysts and a certain amount of re-education, since the acoustics and mathematics involved are not part of the general knowledge of music analysts , these tools promise novel and fruitful areas of research which focus on the analysis of music as sound rather than music as notated structure.

A video can be partitioned into shots. A shot is an uninterrupted segment of video frame sequence of time, space and graphical configurations. For the last decade, many research projects have been working on automated video partition of footage into shots, topics, and face recognition particularly in news video processing.

Some of this research has led to commercial products. Some of these systems use manual annotation to start with, and then automatically annotates and indexes any related video materials. For instance, the Marvel video annotation system IBM, category 3 demonstrates the ability to generate semantic and formal labels from television news footage. Marvel builds statistical models from visual features using training examples and applies the models to automatically annotate large repositories.

Other projects seek to generate topic structures for TV content using TV viewers comments on live web chat rooms Miyamori et al. Transcription is typically applicable only to audio within time-based multimedia. More technically, as it is a process of writing down events in a canonical form, it applies to events that are transitory and constrained. As such, music, dance, and speech are the most commonly transcribed sources of those within the projects remit. Automatic general video transcription makes little sense in the near-term because it essentially requires a model of the whole world.

With constrained worlds, some transcription is possible, and there has been some automatic video understanding of sports on video as well. Speech-to-text or automatic speech recognition systems aim to convert a speech signal into a sequence of words. Progress in the field has been driven by standardised metrics, corpora and benchmark testing through NIST since the mids, with systems developed for evermore challenging tasks or speech domains: developing from the domain of single person dictation systems to todays research into systems for the meetings and lectures domain.

A brief history of speech and speaker recognition research can be found in Furui a. Some of the differences between speech domains can create additional difficulty for automatic systems. For example, speech from the lecture domain has much in common with speech from a more conversational domain including false starts, extraneous filler words like okay and filled pauses uh. It also exhibits poor planning at higher structural levels as well as at the sentence level, often digressing from the primary theme.

Development of a system for a new speech domain or application ideally builds upon a large amount of manually transcribed in-domain training data in order to build a speech transcription system tailored to that domain often of the order of hundreds if not thousands of hours for state-of-the-art systems Kim et al. The level of accuracy of the transcriptions need not be perfect: techniques have recently been developed to handle less than perfect transcriptions such as closed captions Kim et al.

Where sufficient adequately transcribed data cannot be made available for financial or other reasons, as much adequately transcribed in-domain acoustic data as is feasible is obtained which will sometimes be none and models from a similar domain are adjusted or adapted in terms of their acoustic, vocabulary or word predictor components in order to match the new domain as well as possible.

Vocabulary and language model word predictor adjustments can also be made based upon in-domain textual information such as transcripts, textbooks or other metadata where available. There is a computation time versus accuracy trade-off: a real-time system will typically perform less well than a times-real-time 10xRT or even unconstrained system, but the degradation will vary with situation.

Similarly, memory constraints can affect things. State-of-the-art systems typically use hardware beyond that of todays average desktop. The word-error rates for English speech referred to above were achieved with a constraint of 10xRT and 20xRT respectively Le, It is important to note that speech recognition systems developed for one domain cannot, in many if not most situations, be employed as a black box that can handle any domain: even speech from the same domain that differs from the training data may be problematic e. There exist components of the system which are brittle or sensitive to such changes: the system has been trained to recognise certain types of speech and, whilst it may perform quite well on those types of speech, it may perform badly on speech which is different.

Such differences may include but are not limited to :. There exist system adaptation techniques to compensate for such differences to some extent e. Gales , but despite significant progress in this area the development of systems which are robust to differences in data is a key research goal at present Le, ; Ostendorf et al. Systems have also been developed for some domains in many other major European languages e. Mention should also be made of the recently-started DARPA Global Autonomous Language Exploitation GALE program see Linguistic Data Consortium, , which is developing technologies to absorb, analyse and interpret huge volumes of speech and text in multiple languages: as part of this, projects such as AGILE Autonomous Global Integrated Language Exploitation, involving multiple sites including the University of Cambridge and the University of Edinburgh are developing combined speech-to-text translation systems that can ingest foreign-language news programmes and TV shows and generate synchronised English subtitles Machine Intelligence Laboratory, Differences in speech transcription performance across different domains mean that speech transcription tools fall into development categories depending upon the difficulty of the domain.

For example, desktop systems for dictated speech-to-text and desktop control are readily available e. Dragon NaturallySpeaking , as are systems for constrained domains such as medical transcription e. Philips SpeechMagic supports 23 languages and specialised vocabularies. The Microsoft SDK can be freely downloaded and used for the development of speech-driven applications and is supplied with recognisers for US English, simplified Chinese and Japanese Microsoft, All of these tools fall into categories but will perform well only in certain situations.

State-of-the-art speech-to-text systems are typically made available through joint projects with universities or commercial organisations such as Philips and Scansoft. These tend to fall into categories For the enthusiast with time to spare, the HTK project offers downloadable software that will let you build a reasonable word-level or phonetic-transcription system and now offers an API called ATK for building experiment applications HTK, n.

These tools fall into categories Church presents a chart showing that speech-to-text transcription researchers have achieved 15 years of continuous error rate reduction and we might wonder what the future holds. At present, the accuracy of current systems lags about an order of magnitude behind the accuracy of human transcribers on the same task Moore, ; David Nahamoo quoted in Howard-Spink, n. Researchers have also investigated automatically extracting textual transcriptions comprising a sequence of sub-word units e.

This task has not been as heavily researched in recent years, but has relevance to search and indexing applications since such subword transcriptions often form the basis for techniques for searching for out-of-vocabulary query words with which the word level transcription system is not familiar. New words appear in the news every day e. A discussion of techniques for handling OOV out of vocabulary queries in spoken audio can be found in Logan et al Phonetic transcription tools typically fall into development category 5 and exist within universities and research labs, though only for specific phone sets and not necessarily in forms which are easily packaged.

As in some other problem domains, there is some convergence in research based on audio and video. Audiovisual speech-to-text systems, which combine information about the movement of the lips and possibly a wider region of interest around the mouth with audio information, have been found to improve over audio-only speech-to-text in certain conditions e.

Category 2 tools of this type are under development for constrained domains such as finance and within-car use. Allowing the combined use of audio and video was also found to improve the segmentation of stories on video, relative to purely speech transcript-based approaches for most systems at TRECVID Kraaij et al. Multimodality can also be usefully exploited in presentation e. All of these research areas are still at quite preliminary stages, with the exception of audiovisual speech recognition work, and fall mostly into categories However, it seems likely that solutions which make use of multiple rather than single information sources, where this is an option, will prove most successful in the future.

A convenient property of the most popular statistical approach to speech recognition is that the same algorithm used for speech to text transcription can be used to time align a word level transcript e.

In a robust system, the algorithm may be lightly modified to allow for errors in the script, e. The speech-to-text transcriptions as discussed above have historically comprised an unpunctuated and unformatted stream of text. Such annotations may improve applications which involve presentation of transcripts e. Areas of interest include:. There has been investigation into automatically generating punctuation as well as into generating speech-specific structural information such as marking interruption points, edit regions and boundaries of sentence-like units.

Associated tasks include speaker detection and tracking identifying a number of speakers and grouping utterances from the same speaker, although absolute speaker identities remain unknown , speaker identification determining who is speaking at a particular time , speaker verification determining whether a particular speaker corresponds to a claimed identity and tasks related to speaker localisation e.

The task involves annotating transcripts to mark word sequences corresponding to items such as proper names, people, locations and organisations, or dates and times. The BBN Identifinder BBN Technologies, b , which is a category 1 named entity extractor that has been quite widely used in the technical community. Tasks investigated include the detection of topic boundaries in a stream of data, clustering of related segments of data, the automatic detection of later occurrences of data relating to the story of interest and story link detection tests to determine whether two given stories are related.

As this description suggests, these tasks make most sense for news data which comprises a sequence of stories although there has been related work for conversational speech such as that in the MALACH project.