Danh mục

Applying machine learning in recognizing Vietnamese speech

Số trang: 9      Loại file: pdf      Dung lượng: 536.58 KB      Lượt xem: 11      Lượt tải: 0    
tailieu_vip

Hỗ trợ phí lưu trữ khi tải xuống: 1,000 VND Tải xuống file đầy đủ (9 trang) 0
Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Applications of speech recognition for the Vietnamese language are yet to be as accessible as more popular languages. Nevertheless, in the past decade, many Vietnamese researchers devoted their time into developing and improving speech recognition solutions for their mother’s tongue. Acknowledging that advancements in Vietnamese speech recognition have been made, this paper aims to take a look at the steps required create a solution that is portable, accessible for many Vietnamese people to conveniently make their speech utterances understood by mobile systems, then attempts to develop a portable speech recognition solution.
Nội dung trích xuất từ tài liệu:
Applying machine learning in recognizing Vietnamese speech APPLYING MACHINE LEARNING IN RECOGNIZING VIETNAMESE SPEECH SVTH: Nguyễn Duy Thái Sơn GVHD: ThS Đặng Đình Quân Tóm tắt- Nhận diện giọng nói là một lĩnh vực mang lại nhiều ứng dụng hữu ích và thực tiễn. Hơn nữa, những ứng dụng của chúng đang ngày một trở nên được ưa chuộng và dễ tiếp cận với công chúng hơn. Tuy nhiên, các thành quả của những mô hình nhận diện giọng nói tiếng Việt vẫn chưa được nhân rộng như các ngôn ngữ lớn khác. Dẫu vậy, trong thập kỉ qua, nhiều học giả và nhà nghiên cứu Việt Nam đã dành thời gian phát triển và cải thiện các giải pháp nhận diện giọng nói cho tiếng Việt. Nhận thấy rằng có nhiều tiến bộ về nhận diện giọng nói tiếng Việt, nghiên cứu này muốn xem xét những bước cần thiết để tạo ra một giải pháp gọn nhẹ và dễ tiếp cận đối với người Việt để giọng nói của họ có thể được hiểu bởi các hệ thống di động một cách thuận tiện, sau đó thử phát triển một giải pháp nhận diện giọng nói di động. Abstract- Speech recognition is a field with many realistic and helpful applications. Furthermore, they are becoming more and more popular, portable and accessible to the wide public. However, applications of speech recognition for the Vietnamese language are yet to be as accessible as more popular languages. Nevertheless, in the past decade, many Vietnamese researchers devoted their time into developing and improving speech recognition solutions for their mother’s tongue. Acknowledging that advancements in Vietnamese speech recognition have been made, this paper aims to take a look at the steps required create a solution that is portable, accessible for many Vietnamese people to conveniently make their speech utterances understood by mobile systems, then attempts to develop a portable speech recognition solution. Keywords- Speech recognition, machine learning, multipart form-data, Android I. INTRODUCTION Speech recognition, the ability for computers to understand and turn human utterances into data, is a useful subfield of computer science with popular, thriving applications: Smart home systems such as Amazon Echo and Google Home are capable to process and execute owner’s imperatives at a moment’s notice, operating systems allowing speech as an alternative method of interaction and input for vision and motion-impaired people, doctors using speech recognition-enabled software to take notes while performing surgical operations, etc… The possibilities that this technology brings is promising and seemingly limitless. II. BACKGROUND To help Vietnamese people to take advantage of the benefits that automatic speech recognition (ASR) systems bring, many scholars and researchers have devoted their time to developing an ASR system for the Vietnamese language. A large vocabulary continuous ASR 83 system with a Word Error Rate (WER) of 11.7% on digital newspaper speech was constructed using a statistical language model [1]. Other scholars turned to neural-network based models and obtained similar results [2]. Efforts have been made to construct and find the most suitable acoustic models for the Vietnamese language using different methods such as “knowledge-based and data-driven phone mapping” [3] or using “diphthongs and triphthongs as phonemes” rather than just vowels and monophthongs [4] [5]. Many large tech firms such as Viettel and FPT made investments into the field and currently provide speech-to-text solutions to their customers commercially. In order to research and develop Vietnamese speech solutions, there is a great demand of Vietnamese speech data. While many institutions decide to preserve the data for internal use or commercial purposes, some others have made their datasets available to the public or upon request for researching purposes for those in need of the data to work on speech recognition projects. These datasets will be discussed more in the next section. III. COLLECTING SPEECH DATA A pivotal part in any speech recognition research is to collect speech utterances from actual humans. For this purpose, we utilize corpuses that have been previously collected for research purposes and is made available to the public [6] or is granted upon request [7]. Furthermore, we invited ten speakers, including 8 males and 2 females, from various parts of the country, with different accents to record around two hours of speech. We gave each of them an Android apk. file to install our speech collection tool as well as a script comprised of 99 phrases that mostly contains the most frequently used words in the Vietnamese language: và, của, có, các, là, không, cho, etc… [8]. The speech collection tool works by utilizing the MediaRecorder API to record speech directly from the device’s microphone and save it under the .3gp format for data efficiency (7KB file size in average). The app then creates a HTTP POST request to a PHP form, which uses the multipart/form-data encoding type to allow the content of the recorded file to be sent to the server. The server then tries to process the file and return the status accordingly. If successful, the file is persisted to the server, which can be later accessed using an FTP client. The files are then manually transcribed all renamed for later use. 84 Figure 1. Example multipart/form-data POST request Figure 2. Multipart/form-data POST request code Figure 3. PHP code to process data from POST request 85 Figure 4. The SpeechCollect app IV. ADAPTING THE MODEL Unfortunately, since training an acoustic model from scratch requires a lot ...

Tài liệu được xem nhiều: