We all speak different languages. In fact, there are over 7000 languages in the world, so communication can get tricky at times. In recent years, technological solutions, such as Machine Translation (MT), have been working on bringing the language barriers down. Unfortunately, some languages still remain an enigma to most of us. However, every language deserves to be heard and should benefit from the translation tools powered by Machine Learning systems. 

This is why the OPUS project was created to make every language, however big or small, accessible to everyone across the globe. 

OPUS is a project managed by Jörg Tiedemann, Professor of Language Technology at the University of Helsinki, with the help from his students and fellow researchers. It is a Finnish-based open access database that provides an incredible - and, more importantly - free collection of multilingual texts in different styles and genres. Since its inception in 2003, OPUS has been supported by various organisations, including the University of Helsinki, The Nordic Language Processing Laboratory, and the Finnish IT Center for Science. However, OPUS remains a side project of prof. Tiedemann without regular funding and any income. Despite being a ‘hobby’, OPUS has already produced over 1000 free MT models under the label of OPUS-MT, ready for online translation, and gets up to a thousand visits from across the globe daily. 

OPUS’s mission is to collect as much electronic textual data in multiple languages as possible - especially for languages currently underrepresented by data-driven Natural language processing (NLP). This way, every language becomes more accessible to those interested - be it a language enthusiast, a language learner looking for real-world sample texts, or a scholar looking to solve a linguistic puzzle. As the OPUS creators describe it: ‘The mission of OPUS-MT is to provide open translation services and tools that are free from commercial interests and restrictions’. 

To improve MT, OPUS is looking primarily for the textual data that has not been put in the open access yet. Any individual or an organisation with the copyright licence to the text(s) can donate it to OPUS. The data are then vetted, processed, and uploaded by Professor Tiedemann’s team, becoming available to anyone interested. The collected texts help in creating pre-trained translation models that can be downloaded and used for free - an unprecedented solution in a largely commercialized world of online translation. 

Among those who already contributed are a number of researchers and organisations, including the Wikimedia foundation. If you want to find out more about OPUS, you can read this research paper by the OPUS researchers. To donate a bilingual or multilingual text, please contact Jörg Tiedemann

Interview with the OPUS creator 

To learn more about OPUS and its role in promoting Machine Translation, we asked Professor Tiedemann to tell us how the project came into being. 

How did OPUS come into being? Whose idea and realization was it? How did you find the partners & sponsors? 

OPUS started as a spontaneous idea during a summer school in Norway in 2002. Doctoral students from various Nordic European countries came together in a rather remote but very beautiful place in the Norwegian mountains to talk about topics of computational linguistics and lexicography. During a farewell coffee, the two of us, Lars Nygaard and I, discussed making translation data available to everyone to push research in machine translation and cross-lingual lexicography. The idea was to mainly use localization data from open source projects and other public resources to prepare a growing resource in many languages. We started the project and released the initial data set and paper in 2003. It was very small and limited but already received good response as nothing like this existed before. 

Motivated by the positive feedback, we carried on and slowly improved the setup and coverage. When Lars left academia, it was basically my personal long-term project that ran on the side of my regular work. OPUS never had extensive funding nor any sponsors. It has always been a side project basically managed by myself with input from the community. The first time OPUS received some official funding happened in 2016 when the virtual Nordic Language Processing Laboratory (NLPL) was established and OPUS continues until now under the umbrella of this lightweight organisation. OPUS basically continues as a non-funded project without sponsors and income, and lives from the support of the academic institution that pays my salary as well as other projects that produce data sets and components that can be integrated into OPUS. Acknowledgements for the current setup go to the University of Helsinki and CSC, the Finnish IT center for scientific computing. 

Who is working on this project? 

OPUS is basically managed by myself alone. Data sets are mainly produced by the research community and related research projects provide components and tools that we use to run the system. Recently, activities have expanded to adjacent projects such as OPUS-MT (with some partial funding from the European Language Grid), OPUS-CAT and OpusFilter. Other OPUS-specific tools have been created by various people funded by various related projects. Contributions came from PhD/master students and post-doctoral researchers such as Sami Virpioja, Mikko Aulamo, Tommi Nieminen and Umut Sulubacak. Data contributions come from various places and without them, OPUS would not exist. Thanks to all of them! 

What is the language scope of the database - if any? 

There is no limit and OPUS would like to cover as many of the World’s languages as possible. The scope is to grow as much as possible and to constantly improve quality and language coverage. We aim for better support of under-resourced languages and like to emphasize resource development for minority languages, language varieties and languages that are not supported well by current NLP and technology. 

What is the purpose of this project and who (and how) can benefit from it? 

The purpose of the project is to make data available for many languages in order to improve NLP (and machine translation in particular) for a growing number of the World’s languages. The idea is to create a convenient data hub for research and development with simple and straightforward access to open data to make it easy to train and develop tools that can benefit everyone. 

OPUS wants to contribute to efforts that reduce language barriers and supports open source projects that make advances in language technology available for the wider public with better support of the linguistic diversity in the World. The goal is to help researchers, developers and end-users of NLP technology. Furthermore, OPUS also facilitates researchers in general linguistics and translation studies to study cross-lingual phenomena on real-world data. All in all, OPUS tries to create a better awareness of the linguistic diversity and the benefits that open data sets can bring about. 

Is there a way that readers of our blog can contribute or help OPUS get bigger and better? Are you looking for any collaborators or partnerships? 

OPUS lives from data and contributions of translated material are more than welcome. Many language pairs have very limited resources and languages die without proper digital support. We welcome contributions of under-resourced languages and their translations that can help to integrate the data sets into the collection. 

We already collaborate with, e.g., the Wikimedia foundation to make machine translation a tool that can speed up the creation of additional Wikipedia pages in other languages in order to improve language coverage in this important knowledge source. We also need help with quality control and would appreciate support in reducing the noise in our data sets. Feel free to get in touch! 

We would like to thank Professor Tiedemann for the interesting insight into the OPUS project. We are sure that it is exactly this kind of project that helps us to connect people and knowledge across borders.

This blog was authored by RWS's Tomáš Burkert, Senior Solutions Architect, and Inna Bell, Project Manager.