PhD student Google scholar Contact me: yoohao.zhang@gmail.com
Research interests: Speech translation, Machine translation and Multi-task learning.
Severed as the reviewer of NeurIPS, ACL, AAAI, EMNLP, ICASSP and other conferences for many times.
An end-to-end speech-to-text toolkit, NiuTrans.ST | 2023.09- ······ |
The toolkit can achieve ASR, MT and ST three tasks. This toolkit is based on NiuTensor library and our goal is to achieve a high speed inference framework for speech-to-text generation. | |
Project website: https://github.com/xiaozhang521/NiuTrans.ST | |
Soft alignment method for end-to-end speech translation (ICASSP2024) | 2023.05-2023.09 |
In response to the challenge of inconsistent representation between speech and text modalities, we propose a soft alignment methodology within the representation space, leveraging an adversarial training strategy. This approach not only attains advanced performance in speech translation tasks but also accomplishes speech recognition, text translation, and speech translation tasks with one model. The achieved performance closely parallels that of the individual models. | |
An End-to-End Automatic Speech Recognition Method Based on Multi-scale Modeling (CCL2023) | 2023.02-2023.05 |
We designed a multi-scale modeling method to align the original audio features to phonemes, characters and sub-words step by step, and then design a gated network to integrate multi-scale features. This method can effectively alleviate the modal gap caused by length inconsistency. | |
End-to-end Speech Translation Evaluation (IWSLT2023) | 2023.02-2023.04 |
We participated in the IWSLT 2023 English to Chinese offline end-to-end speech translation (constrained data) track. We mainly implement the multi-task leaning, multi-scale modeling, stacked acoustic-and-textual encoding and VAD based on SHAS. Our system ranked 1st in this track. | |
Information Magnitude Based Dynamic Sub-sampling for Speech-to-text (Interspeech2023) | 2022.05-2023.01 |
We design a dynamic down-sampling strategy for audio frames to solve the problem of long speech sequences. We use the Gaussian mixture model to distinguish the magnitude of each frame and then dynamically design the sampling stride according to the order of magnitude. For speech translation and speech recognition tasks, this method can achieve a larger compression ratio without losing performance compared to static down-sampling. | |
Rethinking and Improving Multi-task Learning for End-to-end Speech Translation (EMNLP2023) | 2022.06-2023.06 |
Based on the quantitative analysis of the effect and time of the interaction between speech and other auxiliary tasks, it is found that the length gap between speech and text modalities hinders the learning effect of the alignment method. Additionally, the difference between speech and text representations is still significant. In light of these phenomena, we design the lookback mechanism and the local-to-global training method to improve performance and achieve the state-of-the-art (SOTA) performance under limited data. | |
End-to-end Speech Translation Evaluation (IWSLT2022) | 2022.02-2022.05 |
We participated in the offline end-to-end English-Chinese speech translation task, mainly using the decoupling pre-training method, multi-stage pre-training method, and multi-view fusion method. We further combined the strategies of VAD consistency training and model ensemble. | |
Improving end-to-end speech translation by leveraging auxiliary speech and text data (AAAI2023) | 2021.03-2021.11 |
In order to address the problem of data scarcity for end-to-end speech translation training, we propose a multi-stage pre-training strategy to build speech translation system. Our method can use all types of unlabeled and labeled text data, as well as speech data, which achieves a new state-of-the-art performance. The work was partially completed during the internship at Information Security Department of Tencent. | |
Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders (ACL2021) | 2020.09-2021.01 |
Aiming to address the issue of unstable training in the end-to-end speech translation model, we propose a decoupling pre-training method to separately pre-train the speech recognition model and the text translation model. Subsequently, an adapter is employed to facilitate the integration of the two. We first achieve performance where the end-to-end system outperforms the cascade system under conditions of limited data. | |
Tensor computing library towards offline device | 2020.11-2020.12 |
We use the OpenCL interface in the ARM Computing Library to rewrite the operations in NiuTensor and run them on the Mali architecture chip. The system has been applied to the Translation Pen Scanner. | |
Quantifying Transfer Learning for Multilingual Neural Machine Translation (Under review) | 2020.06-2021.03 |
We conduct a quantitative analysis of the interference and transfer of rich resources and low resources in the multilingual translation model, and we find that the transfer process mainly occurs in the early training stage, while the interference is primarily in the later stage. | |
Machine Translation Evaluation (WMT2020) | 2020.03-2020.05 |
We participate in three rich resource tasks (English to/from Japanese, English to Chinese) and two low resource tasks (Tamil to English and Inuktitut to English). Our training strategies include using a multi-language model, a large-capacity model, iterative fine-tuning, generating pseudo-data by Top-p sampling, domain adaptation with a pre-trained language model, and so on. Among them, English to/from Japanese ranks first in automatic evaluation; English to Japanese and Inuktitut to English win first place in manual evaluation. | |
Learning architectures from an extended search space for language modeling (ACL2020) | 2019.10-2019.12 |
We apply the neural structure search method to the neural machine translation task. Initially, we use the differentiable method (e.g., DARTS) to search for a structure between the inner-CELL and intra-CELL on the language model task. Subsequently, we apply this structure to the task of neural machine translation. The experimental results show that the searched structure improves the performance of the IWSLT English-to-Vietnamese tasks. | |
Inference acceleration method of neural machine translation system based on coarse-to-fine (CCMT2019) | 2019.06-2019.08 |
We accelerate the attention operation in the inference of the Transformer model. We utilize the attention distribution of each layer in the model and calculate its information entropy as the amount of information contained in each layer. We found that the amount of information in each layer is different, but the amount of calculation is consistent. Therefore, we design a coarse-to-fine method to compress the parameters of attention calculation and improve the decoding speed by about 10% without performance degradation. | |
Machine Translation Evaluation (WMT2019) | 2019.02-2019.04 |
We participated in the Gujarati-to-English translation task, primarily using transfer learning, linguistic prior knowledge, back-translation with diversity, ensemble search, re-ranking based on multi-feature, and the DLCL deep network. Our system ranked first in automatic evaluation and manual evaluation. | |
NiuTensor Deep Learning Open Source Computing Library | 2018.1- ······ |
We have developed an efficient tensor computing library (similar to TensorFlow or PyTorch), named NiuTensor which is based on C++ and CUDA. Its main features are as follows: 1. Developlow-level high-efficiency CUDA operators; 2. Support speech-to-text tasks and text generation tasks; 3. Support Mobile device; 4. Optimize neural machine translation operation inference (kernel fusion, high concurrency algorithm based on GPU). The project has been applied to the online system of NiuTrans. | |
Project website: https://github.com/NiuTrans/NiuTensor |