Understanding and Generating Spoken Language
AITopics > Speech
... Simple inquiries about bank balance, movie schedules, and phone call transfers can already be handled by telephone-speech recognizers. ... Voice activated data entry is particularly useful in medical or darkroom applications, where hands and eyes are unavailable, or in hands-busy or eyes-busy command and control applications. Speech could be used to provide more accessibility for the handicapped ... and to create high-tech amenities (intelligent houses, cars, etc.)
Definition of the Area
"Automatic speech recognition (ASR) is one of the fastest growing and commercially most promising applications of natural language technology. Speech is the most natural communicative medium for humans in many situations, including applications such as giving dictation; querying database or information-retrieval systems; or generally giving commands to a computer or other device, especially in environments where keyboard input is awkward or impossible (for example, because oneís hands are required for other tasks)." From Linguistic Knowledge and Empirical Methods in Speech Recognition. By Andreas Stolcke. (1997). AI Magazine 18 (4): 25-32.
"Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.%quot; From Making Computers Talk (below).
Good Starting Places
Making Computers Talk - Say good-bye to stilted electronic chatter: new synthetic-speech systems sound authentically human, and they can respond in real time. By Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003). "Scientists have attempted to simulate human speech since the late 1700s, when Wolfgang von Kempelen built a 'Speaking Machine' that used an elaborate series of bellows, reeds, whistles and resonant chambers to produce rudimentary words." Excellent overview.
Linguistic Knowledge and Empirical Methods in Speech Recognition. By Andreas Stolcke. (1997). AI Magazine 18 (4): 25-32.
FAQs. Topics covered include: general information, signal processing, speech coding and compression, natural language processing, speech synthesis, and speech recognition. Older site but still relevant.
The online version of Hal's Legacy: 2001's Computer as Dream and Reality. Edited by David G. Stork. Full text available. In particular, see chapters 6,7,8.
Speech Recognition Speaks To Businesses By Samuel Greengard, Baseline (Feb. 8, 2012). "Research firm Global Industry Analysts (GIA) predicts that speech recognition will grow from a $5.2 billion market at the end of 2009 to $20.9 billion by 2015. Itís not difficult to understand the appeal of these systems. Automated speech recognition systems trim call center costs by about 50 percent while improving overall productivity, GIA reports. They also provide more secure and private interactions. "
Talking PCs? Talk to the hand. By Nick Hampshire. ZDNet UK (June 12, 2006). "Voice recognition and speech synthesis technologies may not have developed to the degree some science fiction writers hoped, but have nevertheless seen some startling successes. ... Voice synthesis has been around for a long time. Bell Labs demonstrated a computer-based speech synthesis system running on an IBM704 in 1961, a demonstration seen by the author Arthur C. Clarke, giving him the inspiration for the talking computer HAL9000 in his book and film '2001: A Space Odyssey'. Forty-five years later, voice synthesis technology can be found in products as diverse as talking dolls, car information systems and various text-to-speech conversion services such as the one recently launched by BT. Many of these modern systems can convert text into a computer synthesised voice of quite respectable quality. ... Voice recognition has turned out to be a much harder task than researchers realised when work began on the problem over forty years ago. However, limited voice recognition applications are starting to creep into everyday use, voice input telephone menu systems are now commonplace, speech-to-text dictaphones are increasingly used for note-taking by doctors and lawyers, and voice input has started to appear in computer games systems. The success of some of these limited-application voice recognition systems has recently prompted the big software heavyweights, Microsoft and IBM, to make further investments. ... However, there are still a lot of technological hurdles to overcome; to understand what these are, we need to delve further into the technology. ... Speech recognition - Speech recognition, on the other hand, is a much harder task, and commercial off-the-shelf systems have only been available since the 1990s. Because every person's voice is different, and words can be spoken in a range of different nuances, tones and emotions, the computational task of successfully recognising spoken words is considerable, and has been the subject of many years of continuing research work around the world. A variety of different approaches are used, dynamic algorithms, neural networks, and knowledge bases, with the most widely used underlying technology being the Hidden Markov Model. These techniques all attempt to search for the most likely word sequence given the fact that the acoustic signal will also contain a lot of background noise."
Are you talking to me? Speech recognition: Technology that understands human speech could be about to enter the mainstream. The Economist Technology Quarterly (June 7, 2007). "Speech recognition has taken a long time to move from the laboratory to the marketplace. Researchers at Bell Labs first developed a system that recognised numbers spoken over a telephone in 1952, but in the ensuing decades the technology has generally offered more promise than product, more science fiction than function. ... Optimistic forecasts from market-research firms also suggest that the technology is on the rise. ... An area of great interest at the moment is in that of voice-driven 'mobile search' technology, in which search terms are spoken into a mobile device rather than typed in using a tiny keyboard. ... The resulting lower cost and greater reliability mean that speech-based systems can even save companies money. Last August, for example, Lloyds TSB, a British bank, switched all of its 70m annual incoming calls over to a speech-recognition system based on technology from Nuance and Nortel, a Canadian telecoms-equipment firm. ... Another promising area is in-car use. ... There are military uses, too. ..."
Computer Vision and Speech. Crossroads, The ACM Student Magazine. Fall 2007; Issue 13.4. As stated in the Introduction, by Niels Ole Bernsen: "If you are interested in computers with human capabilities, vision and speech open an entirely new world of computers that can see and talk like we do. Computer vision is the moody input cousin of computer graphics-in graphics, you have all the time you can afford to program the rendering, but visual input is an unpredictable and messy reality. Computer speech is both input and output, like in systems capable of spoken dialogue. Viewed as enabling technologies, computer speech arguably holds the lead over computer vision. Even though a speech signal is enormously rich in information and we are still far from mastering important aspects of it like online recognition and generation of speech prosody, it is still much easier to shut up the people in a room in order to get a clear speech signal than it is to control the room's lighting conditions and to identify and track all of its 3-D contents independently of the viewing angle. Given the state of the art, it makes good sense that the papers in this issue of Crossroads are about speech or vision. Two articles address different stages of the process of making computers understand what is commonly called the speaker's communicative intention, i.e., what the speaker really wishes to say by uttering a sequence of words. Deepti Singh and Frank Boland [Voice Activity Detection] discuss approaches to the important pre-(speech)-recognition problem of detecting if and when the acoustic signal includes speech in the first place. ... Nitin Madnani's introduction [Getting Started on Natural Language Processing with Python] to natural language processing, or NLP, is likely to tempt computer scientists to try out NLP for themselves."
Common sense boosts speech software. By Eric Smalley. Technology Research News (March 23 / 30, 2005). "Speech recognition software matches strings of phonemes -- the sounds that make up words -- to words in a vocabulary database. The software finds close matches and presents the best one. The software does not understand word meaning, however. This makes it difficult to distinguish among words that sound the same or similar. The Open Mind Common Sense Project database contains more than 700,000 facts that MIT Media Lab researchers have been collecting from the public since the fall of 2000. These are based on common sense like the knowledge that a dog is a type of pet rather than the knowledge that a dog is a type of mammal. The researchers used the phrase database to reorder the close matches returned by speech recognition software. ... 'One surprising thing about testing interfaces like this is that sometimes, even if they don't get the absolutely correct answer, users like them a lot better,' said [Henry] Lieberman. 'This is because they make plausible mistakes, for example 'tennis clay court' for 'tennis player', rather than completely arbitrary mistakes that a statistical recognizer might make, for example 'tennis slayer',' he said. "
Spoken Language Systems Group, MIT Computer Science and Artificial Intelligence Laboratory.
Conversations control computers. By Eric Smalley. Technology Research News (January 12/19, 2005). "Because information from spoken conversations is fleeting, people tend to record schedules and assignments as they discuss them. Entering notes into a computer, however, can be tedious -- especially when the act interrupts a conversation. Researchers from the Georgia Institute of Technology are aiming to decrease day-to-day data entry and to augment users' memories with a method that allows handheld computers to harvest keywords from conversations and make use of relevant information without interrupting the personal interactions. ... The researchers' system protects privacy by only using speech from the user's side of the conversation, said [Kent] Lyons."
Listen to Wade Roush's podcast profile of Paris Smaragdis and his work. From 2006 Young Innovators Under 35. Technology Review (September 8, 2006). "Since 1999, the editors of Technology Review have honored the young innovators whose inventions and research we find most exciting; today that collection is the TR35, a list of technologists and scientists, all under the age of 35. Their work --spanning medicine, computing, communications, electronics, nanotechnology, and more -- is changing our world. ... Paris Smaragdis, 32, Mitsubishi Electric Research Lab. Computer scientist Paris Smaragdis is building some of the world's most advanced 'machine listening' systems -- software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."
IBM's Interactive U.S. English Demo: "This demonstration of our work in unconstrained text-to-speech research allows users to submit text to be synthesized into speech."
Ernestine, Meet Julie - Natural language speech recognition is markedly improving voice-activated self-service. By Karen Bannan. CFO Magazine (January 1, 2005). "A new technology, called natural language speech recognition, is markedly improving voice-activated self-service. Powered by artificial intelligence, these speech-recognition systems are altering consumer perceptions about phone self-service, as calls for help no longer elicit calls for help. That, in turn, is spurring renewed corporate interest in the concept of phone self-service. In 2004, sales of voice self-service systems topped $1.2 billion. 'We've seen voice systems move from emerging technology to applied technology over the last few years,' says Steve Cramoysan, principal analyst at Stamford, Connecticut-based research firm Gartner. 'It's still fairly immature. But it's proven and moving toward the mainstream.'"
The Futurist - The Intelligent Internet. The Promise of Smart Computers and E-Commerce. By William E. Halal. Government Computer News Daily News (June 23, 2004). "Scientific advances are making it possible for people to talk to smart computers, while more enterprises are exploiting the commercial potential of the Internet. ... [F]orecasts conducted under the TechCast Project at George Washington University indicate that 20 commercial aspects of Internet use should reach 30% 'take-off' adoption levels during the second half of this decade to rejuvenate the economy. Meanwhile, the project's technology scanning finds that advances in speech recognition, artificial intelligence, powerful computers, virtual environments, and flat wall monitors are producing a 'conversational' human-machine interface. These powerful trends will drive the next generation of information technology into the mainstream by about 2010. ... The following are a few of the advances in speech recognition, artificial intelligence, powerful chips, virtual environments, and flat-screen wall monitors that are likely to produce this intelligent interface. ... IBM has a Super Human Speech Recognition Program to greatly improve accuracy, and in the next decade Microsoft's program is expected to reduce the error rate of speech recognition, matching human capabilities. ... MIT is planning to demonstrate their Project Oxygen, which features a voice-machine interface. ... Amtrak, Wells Fargo, Land's End, and many other organizations are replacing keypad-menu call centers with speech-recognition systems because they improve customer service and recover investment in a year or two. ... General Motors OnStar driver assistance system relies primarily on voice commands, with live staff for backup; the number of subscribers has grown from 200,000 to 2 million and is expected to increase by 1 million per year. The Lexus DVD Navigation System responds to over 100 commands and guides the driver with voice and visual directions."
From Your Lips to Your Printer. By James Fallows. The Atlantic (December 2000). "First, the computer captures the sound waves the speaker generates, tries to filter them from coughs, hmmmms, and meaningless background noise, and looks for the best match with the phonemes available. (A phoneme is the basic unit of the spoken word.)"
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. By Daniel Jurafsky and James H. Martin. Prentice-Hall, 2000. Both the Preface and Chapter 1 are available online as are the resources for all of the chapters.
Speech Recognition Using Neural Networks. By John-Paul Hosom, Ron Cole, and Mark Fanty at the Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology. "There are four basic steps to performing recognition. ... First, we digitize the speech that we want to recognize; for telephone speech the sampling rate is 8000 samples per second. Second, we compute features that represent the spectral-domain content of the speech (regions of strong energy at particular frequencies). ... Third, a neural network (also called an ANN, multi-layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame. Fourth, a Viterbi search is used to match the neural-network output scores to the target words (the words that are assumed to be in the input speech), in order to determine the word that was most likely uttered." This tutorial also includes several diagrams that clarify the many of the concepts.
Experts Use AI to Help GIs Learn Arabic. By Eric Mankin. USC News (June 21, 2004). " To teach soldiers basic Arabic quickly, USC computer scientists are developing a system that merges artificial intelligence with computer game techniques. The Rapid Tactical Language Training System, created by the USC Viterbi School of Engineering's Center for Research in Technology for Education (CARTE) and partners, tests soldier students with videogame missions in animated virtual environments where, to pass, the students must successfully phrase questions and understand answers in Arabic." Read the story.
ACM Queuecasts. Listen to discussions such as:
Automatic Speech Recognition, Spring 2003. Staff Instructors: Dr. James Glass and Professor Victor Zue. Available from MIT OpenCourseWare. "6.345 is a course in the department's 'Bioelectrical Engineering' concentration. This course offers a full set of lecture slides with accompanying speech samples, as well as homework assignments and other materials used in the course. 6.345 introduces students to the rapidly developing field of automatic speech recognition. Its content is divided into three parts. Part I deals with background material in the acoustic theory of speech production, acoustic-phonetics, and signal representation. Part II describes algorithmic aspects of speech recognition systems including pattern classification, search algorithms, stochastic modelling, and language modelling techniques. Part III compares and contrasts the various approaches to speech recognition, and describes advanced techniques used for acoustic-phonetic modelling, robust speech recognition, speaker adaptation, processing paralinguistic information, speech understanding, and multimodal processing."
After Years of Effort, Voice Recognition Is Starting to Work. By Lee Gomes. The Wall Street Journal (January 10, 2007: page B1). "So maybe you won't be talking to your car anytime soon, the way Microsoft and Ford would like you to be. Odds are, though, that you are already on speaking terms with silicon, probably more than you realize. And you can expect to be chatting it up more and more. Almost since computers were invented, computer scientists have been working to get the machines to understand what people are saying to them. Until the past few years, they hadn't been successful enough to offer anything but lab demos. Now, though, computer speech recognition is sufficiently advanced that it is showing up in a surprising variety of places. Like automobiles. ... While voice-controlled computers are sci-fi staples, in practice most people find a keyboard and a mouse are fine for telling a PC what to do. Bill Meisel, a veteran observer of the speech-recognition market, says the main use of speech recognition at the moment is in specialized applications like law and medicine. Radiologists, for example, are increasingly dictating their diagnoses and observations into a speech-recognition program rather than into a tape recorder that must later be transcribed. At its core, speech recognition takes advantage of extraordinarily complex statistical methods to match the sounds you say with the right words. ... One of the biggest applications of the technology is in call centers. ... David Nahamoo, who oversees IBM's speech research, says that some other new applications are already at hand. One is a system that produces automatic translations of foreign-language broadcasts, such as those in Arabic, first by performing speech recognition of the spoken words and then by using translation software to render things in English."
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. By Daniel Jurafsky and James H. Martin. Prentice-Hall, 2000. The Preface and Chapter 1 are available online.
IBM gets smart about Artificial Intelligence. By Pamela Kramer. IBM Think Research (June 2001). "Computer vision is important to speech recognition, too. Visual cues help computers decipher speech sounds that are obscured by environmental noise. Chalapathy Neti, manager of IBM's audiovisual speech technologies (AVST) group at Watson, often cites HAL's lip-reading ability in 2001 in promoting the group's work."
The Power of Speech. By Lawrence Rabiner, Center for Advanced Information Processing, Rutgers University. Science (September 12, 2003; Volume 301, Number 5639: 1494-1495). "In the multimedia world of future communications, speech will play an increasingly important role. From speaker verification to automatic speech recognition and the understanding of key phrases by computers, the spoken word will replace keyboards and pointing devices like the mouse. In his Perspective, Rabiner discusses recent advances and remaining challenges in the processing of speech by communication devices. The key challenge is to make the user interface for 21st-century services and devices as easy to learn and use as a telephone is today for voice conversations."
Computers That Speak Your Language. By Wade Roush. Technology Review (June 2003).
Related Videos from the AAAI Video Archive : Speech (includes tagged articles)
The Centre for Speech Technology Research at the University of Edinburgh [CSTR]: "Founded in 1984, CSTR is concerned with research in all areas of speech technology including speech recognition, speech synthesis, speech signal processing, information access, multimodal interfaces and dialogue systems. We have many collaborations with the wider community of researchers in language, cognition and machine learning for which Edinburgh is renowned." Be sure to see their collection of current research projects .
The Meeting Recorder Project at ICSI [The International Computer Science Institute]. "Despite recent advances in speech recognition technology, successful recognition is limited to co-operative speakers using close-talking microphones. There are, however, many other situations in which speech recognition would be useful - for instance to provide transcripts of meetings or other archive audio. Speech researchers at ICSI, UW, SRI, and IBM are very interested in new application domains of this kind, and we have begun to work with recorded meeting data." - from the Introduction
NovaSpeech: "developing next-generation speech technologies and related educational materials. ... Our research and development projects build on our team's extensive experience and expertise in multi-language and multi-voice speech synthesis, speech perception, linguistics, digital signal processing, acoustic phonetics, software development, and speech product development and marketing."
Quantifying Room Acoustic Quality Using Artificial Neural Networks Project. Salford Acoustics Audio and Video at the University of Salford. "This project was concerned with spaces where good acoustics are required for speech. Such spaces include shopping malls and railway stations where announcements need to be intelligible, and theatres where the quality of sound plays a crucial role in the enjoyment of a performance. The project researched a novel measurement technique intended to increase understanding of acoustics by enabling in-use, non-invasive evaluation of room acoustics to be made. ... The measurement system proposed derives the acoustic quality from a speech signal as received by a microphone in a room. Neural networks learn how to extract the determining characteristics from the speech signals that lead to the objective parameters. In this way, the neural networks predict the reverberation time, early decay time, STI (Speech Transmission Index) and RASTI (RApid Speech Transmission Index). In addition to enabling occupied measurements, the development of the neural network sensing system is of academic interest, as it is forming an artificial intelligence system to mimic the behaviour of human perception."
Speech at CMU Web Page. An extensive collection of speech resources from Carnegie Mellon University with links to many exciting projects (both at CMU and around the world).
Dennis Klatt's History of Speech Synthesis. "Audio clips of synthetic speech illustrating the history of the art and technology of synthetically produced human speech."
Other References Offline
Aaron, A., Eide, E., and Pitrelli, J.F., Conversational Computers. Scientific American, v. 292, no. 6, June, 2005, pp. 64-69. (subscription req'd) . "Call a large company these days, and you will probably start by having a conversation with a computer. Until recently, such automated telephone speech systems could string together only prerecorded phrases. ... Computer-generated speech has improved during the past decade, becoming significantly more intelligible and easier to listen to. But researchers now face a more formidable challenge: making synthesized speech closer to that of real humans--by giving it the ability to modulate tone and expression, for example--so that it can better communicate meaning. This elusive goal requires a deep understanding of the components of speech and of the subtle effects of a person's volume, pitch, timing and emphasis. That is the aim of our research group at IBM and those of other U.S. companies, such as AT&T, Nuance, Cepstral and ScanSoft, as well as investigators at institutions including Carnegie Mellon University, the University of California at Los Angeles, the Massachusetts Institute of Technology and the Oregon Graduate Institute."
Erman, Lee D. and Frederick Hayes-Roth, Victor R. Lesser, D. Raj Reddy. 1980. The Hearsay-II Speech-Understanding System: Integrating Knowledge to Resolve Uncertainty. ACM Computing Surveys 12(2): 213 - 253. "The Hearsay-II speech-understanding system ... recognizes connected speech in a 1000-word vocabulary with correct interpretations for 90 percent of test sentences. Its basic methodology involves the application of symbolic reasoning as an aid to signal processing. A marriage of general artificial intelligence techniques with special acoustic and linguistic knowledge was needed to accomplish satisfactory speech-understanding performance." <available for free to subscribers only>
Developments in Artificial Intelligence, Chapter 9 of Funding a Revolution: Government Support for Computing Research. Committee on Innovations in Computing and Communications: Lessons from History, Computer Science and Telecommunications Board, Commission on Physical Sciences, Mathematics, and Applications, National Research Council. Washington, D.C.: National Academy Press, 1999. "SUCCESS IN SPEECH RECOGNITION
- The history of speech recognition systems illustrates several themes common to AI research more generally: the long time periods between the initial research and development of successful products, and the interactions between AI researchers and the broader community of researchers in machine intelligence. Many capabilities of today's speech-recognition systems derive from the early work of statisticians, electrical engineers, information theorists, and pattern-recognition researchers. Another key theme is the complementary nature of government and industry funding. Industry supported work in speech recognition at least as far back as the 1950s, when researchers at Bell Laboratories worked on systems for recognizing individual spoken digits 'zero' through 'nine.'"