17/01/2020
The Human Resources Strategy for Researchers

Data augmentation and generation for natural language processing

This job offer has expired


  • ORGANISATION/COMPANY
    CNRS
  • RESEARCH FIELD
    Computer science
    Engineering
    Mathematics
  • RESEARCHER PROFILE
    First Stage Researcher (R1)
  • APPLICATION DEADLINE
    07/02/2020 23:59 - Europe/Brussels
  • LOCATION
    France › ORSAY
  • TYPE OF CONTRACT
    Temporary
  • JOB STATUS
    Full-time
  • HOURS PER WEEK
    35
  • OFFER STARTING DATE
    01/03/2020

LIMSI, a strongly multi-disciplinarity laboratory, hosts researchers from Engineering Sciences and Computer Science, but also from Life Sciences and Social Sciences. Its scientific field covers Natural Language Processing, Human-Machine Interaction, Augmented and Virtual Reality, Fluid Mechanics, and Energetics.
We share a common long-term goal, which is to improve the well-being of man in their surrounding environment, both from a material and immaterial point of view.

Natural language processing and more particularly automatic understanding of documents aims to propose methods for extracting relevant information from them. The most effective approaches today use supervised machine learning approaches and very large amounts of manually annotated examples.

In many fields, applications and languages, we find ourselves in a situation where it is almost impossible to have this necessary data. Many studies over the past several years have addressed the issue in the context of active learning (see for example [Tomanek and Hahn, 2009] for sequential annotation-type tasks such as recognition and identification of named entities) or learning by transfer, domain adaptation or multi-tasking learning (see for example [Arnold et al., 2008] for entity recognition with adaptation to new domains, genres and concepts). More recently, work has been done on the feasibility of transfer learning in NLP in the context of neural networks [Mou et al., 2016].

Another possible approach is to generate annotated data according to the application objective. This has been the subject of many works in machine translation where comparable but initially non-aligned corpora are used to produce usable aligned segments (see for example [AbduI-Rauf and Schwenk, 2009]) or in term extraction (see for example [Tamura et al., 2012]). As part of the task-oriented dialogue systems, Bordes and his colleagues [Bordes et al., 2016] generated data from knowledge bases related to the task and domain (restaurant reservation) and sentence patterns. However, the artificial and systematic nature of this method of generation has shown its limitations. In particular, it can be noted that with regard to understanding and detecting intent, almost any system can achieve an excellent result if it is not tested on natural data [Williams et al., 2017]. This issue was also addressed in LIMSI [Neuraz et al., 2018].

This thesis topic proposes to answer two problems:
— how to generate synthetic data? This issue will address the question of the method (both using simple methods such as the use of patterns and learning methods based on neural networks, such as GANs or variational auto-encoders) and the robustness of the data generated for learning a NLP analysis model (various possible applications);
— the second axis concerns the problem of the combinatorial explosion of what can be generated; on this point various methods will be explored, in particular a priori filtering methods by a feedback of the learning system in order to guide the generation, or an a posteriori filtering method by sampling the generation space.

This thesis will take place within the framework of the PSPC AIDA project. The application domain will be information extraction. In order to verify the generality of the proposed approaches and methods, other application areas will be explored, such as automatic speech understanding in dialogue systems or medical NLP tasks such as patient file analysis.

[AbduI-Rauf and Schwenk, 2009] AbduI-Rauf, S. and Schwenk, H. (2009). On the use of comparable corpora to improve smt performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 16–23. Association for Computational Linguistics.
[Arnold et al., 2008] Arnold, A., Nallapati, R., and Cohen, W. W. (2008). Exploiting feature hierarchy for transfer learning in named entity recognition. Proceedings of ACL-08 : HLT, pages 245–253.
[Bordes et al., 2016] Bordes, A., Boureau, Y.-L., and Weston, J. (2016). Learning End-to-End Goal-Oriented Dialog. arXiv :1605.07683 [cs].
[Mou et al., 2016] Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., and Jin, Z. (2016). How transferable are neural networks in nlp applications ? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 479–489.
[Neuraz et al., 2018] Neuraz, A., Campillos Llanos, L., Burgun, A., and Rosset, S. (2018). Natural language understanding for task oriented dialog in the biomedical domain in a low ressources context, nips workshop. In Machine Learning for Health (ML4H) : Moving beyond supervised learning in healthcare, Montréal, Québec, Canada.
[Tamura et al., 2012] Tamura, A., Watanabe, T., and Sumita, E. (2012). Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 24–36. Association for Computational Linguistics.
[Tomanek and Hahn, 2009] Tomanek, K. and Hahn, U. (2009). Semi-supervised active learning for sequence labeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP : Volume 2, pages 1039–1047. Association for Computational Linguistics.
[Williams et al., 2017] Williams, J. D., Asadi, K., and Zweig, G. (2017). Hybrid code networks : practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long Papers), pages 665–677, Vancouver, Canada. Association for Computational Linguistics.

Required Research Experiences

  • RESEARCH FIELD
    Engineering
  • YEARS OF RESEARCH EXPERIENCE
    None
  • RESEARCH FIELD
    Computer science
  • YEARS OF RESEARCH EXPERIENCE
    None
  • RESEARCH FIELD
    Mathematics
  • YEARS OF RESEARCH EXPERIENCE
    None

Offer Requirements

  • REQUIRED EDUCATION LEVEL
    Engineering: Master Degree or equivalent
    Computer science: Master Degree or equivalent
    Mathematics: Master Degree or equivalent
  • REQUIRED LANGUAGES
    FRENCH: Basic
Work location(s)
1 position(s) available at
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
France
ORSAY

EURAXESS offer ID: 482166
Posting organisation offer ID: 13869

Disclaimer:

The responsibility for the jobs published on this website, including the job description, lies entirely with the publishing institutions. The application is handled uniquely by the employer, who is also fully responsible for the recruitment and selection processes.

 

Please contact support@euraxess.org if you wish to download all jobs in XML.