Artificial Intelligence (AI) is the biggest technology breakthrough of our times, and in its most accessible form as machine learning, is a disruptor for many sectors across the whole spectrum of manufacturing and services. AI changes fundamentally how we humans interact in our roles as consumers, citizens, patients, and passengers with businesses, devices, robots, vehicles, governments, health systems, networks, machines, and…. with each other. Language Data for AI (or LD4AI) is a new industry sub-sector that attracts on the one hand well-established language service companies and on the other hand newcomers with innovative data science and Natural Language Processing strategies and roots in crowdsourcing. In this presentation, Anne-Maj van der Meer and Miloš Milovanović will look into the implications of this emerging sector and discuss the trends, new skills and job profiles as well as introduce the Data Marketplace, the platform for language data monetization and acquisition.
Bryan Montpetit 0:06
What we have now is presentation that's going to be called language industry transformation now and tomorrow, it's going to be presented by aminos. And I apologize in advance for this family name. I'm going to try and say it aminos below when a quick note below ven Djokovic, I'm trying. It's not working. So it's close. He's going to correct me when he comes on. And we have a management Amir, and they're both from Taos. So willows is the Head of Business at Taos, and imagine is the training and events director. So help me those. Hi, Brian. Thank you for I completely butchered your last name. I am so sorry.
Miloš Milovanović 0:43
It was nice to hear drying. I really loved it.
Bryan Montpetit 0:46
I tried twice. I tried. How do you say to please? Next time to get it mixed up. Almost there. All right. But thank you for being with us. I really appreciate you both coming on and spending some time with us and helping, you know, educate the industry. It's it's great to have you. And I know that this is taking time out of your wallet probably your evening by now or getting into it. So thank you again, what I'm going to do is I'm going to turn it over to you and come back in with about five minutes or so left, so we can feel some questions. Yeah, thank you very much.
Anne-Maj van der Meer 1:24
Thanks. And I think mirrors will be sharing the screen. So it's great. Well, thank you, everyone, for joining this session this evening for us indeed. For some of you, it's still morning. We're quite excited to tell you a bit more about the language industry transformation as we're seeing it at Taos. My name is Anna my from the mayor and I'm the training and events director. And in the Milos Muna Ivanovic, it also took me a couple of times to pronounce his last name, so don't worry there is here with me. So yeah, let's just dive into it. For those of you who are not familiar with doused, I just wanted to give you a quick overview of our company. We started in 2005 as Think Tank, we organize events of this white papers and reports. And we brought together the industry leaders to talk about the key challenges in our industry, kind of what we're doing now here today, at least from home. So from all of these discussions, that thous Data Cloud was founded in 2008. And that was our first initiative to promote language data sharing. We had about 40 founding members ranging from it giants LSPs tech providers, startups. And within a few years that data cloud grew to about 60 billion words in 2200 language pairs. And at the same time, this issue of defining translation quality, and transparency around quality became more and more important. So again, with industry leaders and researchers in the field, we worked on creating the qf the dynamic quality framework. In the meantime, it's been adopted by ASTM as a standard for translation quality. And after we created that air typology, we also introduced a DQ of dashboard, which is a platform to gain insights into the productivity, efficiency, quality and more. And then over the past couple of years, we've done a large cleanup of our data clouds, we cleaned out all the duplicates that segments, and now we're left with about 45 billion quality data in 600 language pairs, we introduce a new matching data service, which is a clustered search technique to customize and clean corpora. And two years ago, we started the human language project, which is a new platform for data creation, with native speaker communities, especially in low resource languages. And only a few months ago, in November 2020, we launched this new version of the old data clouds, the data marketplace. And I won't share too much about that yet, because that's what Minashi is going to explain in more detail. And especially talking about the value of the data marketplace for whole industry. So now on to the industry transformation. It's 2017. And we think we're ready, we published this manifesto called nucleus temples, which in Latin means now's the time, and the tagline to redesign your translation business. So we've gone from statistical MT to now neural empty and the possibilities seem endless. Technologies are popping up everywhere. And as I said, Now is the time to change our business models and make use of all these amazing new technologies. However, our big lesson was there, just because it's technologically possible, doesn't mean it will actually happen. So 2020 a year of transformation, not just in our industry. We were forced by this global pandemic to adopt and make use of all the has wonderful technologies and innovations. The way we're doing business now, is has changed vastly. So we're all using technology. And we're seeing more and more that this technology is here to help us. In the past couple of months, we've seen an acceleration of this transformation, the industry is moving faster than ever before. So we've that we feel that 2021 is the year of execution, we settle in the new reality and make the best out of all these changes. In the words of Winston Churchill never waste a good crisis. I think that's a quote, all of us have heard many, many times, these past nine months. So let me dive a little bit deeper into these happenings. For Nokia stem was ebook we interviewed CEOs like Rory Cowan, who stands there with Lionbridge. Smith fuel from we localize I'm in coffee at Oracle and many more. And together with these interviews and all the extensive research that we we did ourselves, we came up with six drivers of change for this reinvention of our businesses. As some of you know, at our annual conferences, in the quantum leap conversations with the empty doors in our industry, we always ask them this question, what's more important data or algorithms? And almost consistently, the answer we got back is that algorithms are all the same. They're open source and used by all the different players. And data is differentiator. So you need the algorithms to build the engines. But the data is what defines the quality of the engine. So it's pretty clear, we need to follow the data. And now that we know that algorithms and data will solve the translation problem, our business models need to change as well. Translation is no longer an afterthought, or pre publishing service production stage, translation will become part of the user interface and automatically adapts to the language and the communication preferences of the user. Speech is a new text for a long time, we've been very focused on text data. But more and more we were seeing new media translations like audio and video are so we're going vocal. And not only that, there has also been a large focus always on the western part of the worlds. But other markets like China are leading the innovation now within our industry, we need to shift our focus on these new markets. And last but not least, without reading the chart, we won't be able to make informed decisions. So we need to collect the business intelligence, analyze, measure, and help us make the decide on the changes to make in our businesses. So while we in 2017, thought we needed to redesign our business. And we quickly realized we can't do it in isolation. So 2020 we thought we need to work together and redesign the whole industry. We organized this very, very hands on for maybe some of you were there. It was not just a listening conference, it was hard work. Around 300. Attendees from all over the world joined us and brainstormed on, you know what the future of our industry would look like, especially going with the whole pandemic oil. There were many different interesting insights that came out of the event. But I just wanted to highlight a few here. Data is the new oil, we want to go massively multilingual. And in order to do that, we need to get the right data to train the engines. And with a massively multilingual solution, our current business models are no longer or may no longer be sustainable. So we need to rethink this cost per word model and find one that's more relevant to the new way of doing business. And last, but not least, all languages matter. We are no longer focused solely on this top 30 to 80 languages, but increasingly aware of the need to make all these underrepresented languages digital as well. So you've been hearing me say data a lot already. And data really is what it's all about in our industry now. So late last year, we published a new language data for AI report, where we highlighted a few important trends. Language is core to AI and therefore, in all language, and every language data is at the core of AI revolution, because it's the gateway to augmenting the key human intelligence skills of speaking and understanding. The second one is a data first paradigm shift. So when starting a new translation project, the focus changes from hiring translators to do the job to collecting relevant language data and pre loading the models to do the job. There's an acceleration of change. So machines are 1000s of times more productive than humans in translation and language tasks. And but as I said earlier, this current economic model is probably not sustainable. The COVID-19 global crisis only made this more visible. So we are seeing that language service providers are diversifying into data annotation services For example, and freelancers and smaller agencies offer to sell their data, rather than the translations, then the rise of the new cultural professional. So this established cascaded supply chain of global vendors contracting with local vendors, who in turn hired a freelance translators makes place for or maybe exists alongside
a human intelligence task platform when neutral or natural talents. Hundreds of 1000s of them proud workers are invited to perform small tasks, where the main criterion for success is their routes and orientation in their local culture. And then finally, new markets move faster when everything is basically up for change those markets with the least to replace and the least legacy systems they and those established roles in businesses, they will typically move faster, the law of the handicap of the Head Start, I believe they say. So think about Southeast Asia, Africa, the Middle East. Well, now it's time to execute on all these changes we saw happening in our industry in the past few years, there's a shake up of the ecosystem going on, everything is being automated, that we need to keep in mind that we need to value human intelligence. There are millions of crowd workers and translators around the world who who keep our data and ultimate format. So the big question is, is our economic model ready for the AI revolution? I'd like to now give the microphone to me Losch, who will tell you a bit more about our opportunities with the data marketplace.
Miloš Milovanović 11:41
Yes, so talking about the revolution, let's start with the data marketplace that has been launched in November last year. Why data marketplace we see from the industry leaders that more data more high quality domain specific data outperforms a smarter model. Next to it, we see big scarcity for some specific domains and some specific languages like low resource languages that are widely spoken in the world, but still poorly available online, we see that traditional translation industry doesn't have high NLP skills and capacity. And we also stress the high importance of clean domain specific high quality data. And we also see that technology providers do have access to the algorithms. But on the other side, sometimes they are not having enough of domain specific high quality data or low resource, languages data. What we say is the data comes first. And we're advocating this approach. All these points led us to development of the data marketplace together with our partners, one of them is European Union, who is financing the project. The project has been launched, as I mentioned first of November last year. So we are live for three months already. What main objectives, what are the main objectives of the data marketplace, there are two main ones the first one is to become a go to place for high quality language data, I will tell you a bit more about what this high quality means in the following slides. And the second one is to bridge the gap for a low resource languages, so rare languages, and low resource domains. So as I mentioned earlier, there are big communities of people whose languages are not available online. And we believe that every person should have access to the content in their local in their native language. And this is one of the key objectives of the data marketplace. So how does data marketplace look like? In a scheme? It looks like this. So on the left side of the screen, we have data sellers, who can be anyone owning quality data, they have a possibility to upload the data to get free cleaning and anonymization. For the data that has been published in the data marketplace. They have a possibility to market themselves through the profiles. This is a feature that is still being developed. They obviously have a possibility to sell the data multiple times through a basic sale sell the whole document or through the clustered sales. I will tell you a bit on this one in a moment, and they have a possibility to get paid. On the right side of the screen. We have the data buyers who usually are usually big technology companies looking for For high quality data, they have a possibility to search to request the data review and score as part of the feedback loop, cluster the data and eventually acquire the data. And in the middle, we have Taos. And we have data marketplace platform, who is managing these processes and services that are built into the data marketplace. Everything is protected with a strong legal framework, as Das has been in the data sharing business since 2008. We take data very, very seriously. And we have to protect it very, very seriously. What's there for data sellers. So data sellers can support the ecosystem to data sharing, they can benefit from cleaning of the data and anonymization, free of charge for all the data that is published in the data marketplace. This is, let's say, a win win situation for data sellers and for the industry, we see that all languages matter. And we have a responsibility to promote the languages that are poorly available online. We also have a headstart from some data sellers who are taking the advantage of this opportunity and starting adopting it early adopting it now, because now is the time. And obviously there is a possibility of monetization. What I want to add here is that usually, for example, for big, big enterprises, the translation department is seen as as only as a cost generator. From a financial perspective, this can turn the other way around. And let's say that the big enterprises can start generating some revenue, to scale up their budgets. And then these extra budgets can be used for some, let's say, classic translation services somewhere else. So not only they can spend money, but through this business model, they can earn money. This applies also for LSPs. And for the individual owners. Talking about the individual owners, we also see a new category, a new category where translators are focusing to produce the data and offer it to the data marketplace, we see a case where for low resource languages, one of our ambassadors started creating data to promote his local language. And to earn some revenue from it. We have different profiles in our data marketplace. Some of them are sellers, some of them are buyers, some of them are individuals. So here I just wanted to share an example of a few. How does the data marketplace? Yes, sorry, I have two screens. So it's showing the slide on one and then it was gone on the other one. So we have example, a few data sellers here. And how does the data marketplace look like from let's see, end user perspective. So, everything starts with the data upload, where data is uploaded by the data seller, in the next step, the quality is performed, where bad segments are actually removed, there is a possibility for optional anonymization currently we perform it manually, but it will become a standard feature price is defined based on some algorithmic suggestions, but in the end, the data seller is responsible to define the price they want for specific data and data is published. Next step is regarding next phase is regarding exploration of the data. So, data buyers have a possibility to search for the documents in specific simple search and download the whole document acquire the whole document or part of the document. The alternative to that is the matching data clustered search, where data buyers will have a possibility to upload a sample of a domain specific data set. And then the algorithm will look for the similar segments across all the uploaded documents in that specific language where to find the best matches. Then this data is created is aggregated in a data set that a buyer can decide to purchase or not based on a sample so data buyer has the possibility to review the data paid download and read the data seller regarding the cluster search. This part also allows data sellers not only to sell I'll document as the whole document or part of it, but also to sell segments or sentences from a specific document, meaning that one sentence can be sold indefinite amount of facts. While there's high quality data mean, high quality data is the data that is clean and anonymized and clustered, it doesn't require any, or it requires minimal efforts, before being added to the machine translation and JSON to the systems. This is the number one goal of data marketplace. And we are providing cleaning and optimization in clustering through the data marketplace, some of the cleaning services, automated cleaning services that are available through the data marketplace are listed here. Most of them are free of charge for everybody who uploads the data. So basically, they can upload the data, publish it to the data marketplace, but also download the clean version of the data set that has been uploaded, we are constantly improving the services. So expect this to get bigger.
And to conclude, because we are almost running out of time, I invite you to check the white paper that we did with Baker McKenzie last year, Baker McKenzie is international law firm, specialized in, among other things in data and data ownership. What we promote is what this reports promotes is that nobody can own the segments that are super generic, like please type in your password. And this website uses cookies, or today's a nice day. Nobody can own the sentences as and claim them as copyrightable. So this is the value of the segments our value. And this is something we want to promote in the data marketplace. Here also stated a few examples from European law and US law. But please feel free to download the report as it offers quite a different perspective on the data ownership. And with this, we can move to questions and answers.
Bryan Montpetit 22:14
Excellent. Thank you very much. It's perfect. You guys are doing great. So I do want to mention that there is a giveaway for an ebook for the best questions. So if you've got questions, now's the time to ask them. And the question first question is actually going to be selected by VITAS. So again, have the questions come in. So we load and Animesh, what we have is, how do you verify ownership of the data or translation memories? That is one of the questions. I'm not sure who's going to take it, but it's an open question for either of you.
Miloš Milovanović 22:49
Yes. So the the ownership of the data is basically what the data seller, and our terms and conditions are protecting the whole the whole, let's say, framework from uploading the data, that is, let's say free, available publicly online, or that's not owned by the by the data sellers. So this is quite regulated in data terms and conditions. And it's the responsibility of the data seller.
Bryan Montpetit 23:24
Great, thank you very much. We have another question that's come in. How can freelance linguists provide datasets and still comply with NDAs?
Miloš Milovanović 23:36
Well, I have two answers. The first one is to check the white paper that we were that we wrote. The second one, if they still have concerns after the white paper is reviewed, they can always produce the data themselves and offer it to the data marketplace. This is the use case that we already have, and that we already see.
Bryan Montpetit 23:56
Excellent. There's no other questions coming in that I see. However, I do. I do want to ask I mean, given the the data that you're working with, Have you have you experienced concern or pushback from any, any of the domains and clients, corporate entities, anything like that? I'm just very curious about the type of engagement you have with them.
Miloš Milovanović 24:18
Yeah, so this is this is quite interesting. So thank you for asking this. Obviously, there are two types of enterprises. One, they are quite protective about their data, and who don't allow any data sharing. But we also have different types of enterprises. For example, when we started Data Cloud in 2008, the big enterprises joined together and they shared some translation memories among themselves. So a lot of these big tech companies are supportive with the data marketplace now they're also supporting us. And we also see a new companies coming in and say, Hey, I'm currently only generating expenses for the company. If I generate some revenue for the company, my budget will increase. And then we see also a logic from a financial perspective, changing the mentality. So for example, some enterprises might have data that is not so let's say, highly classified, or it's super generic. They can publish it to the data marketplace in order some extra budget for it, that can be that can be spent on something else. So yeah, two types of enterprises. And we see a difference there. But we hope that more and more enterprises will adopt the concept of the data marketplace.
Bryan Montpetit 25:45
Great. Thank you. And while we were answering that other questions came in. So we have what new professionals especially one ones translators can naturally pivot to, will be needed to support the growing LD for AI sector.
Anne-Maj van der Meer 26:01
Yeah, shall I take that one? We wish. Yeah. Even though you're killing all the answers, you're doing great, but that we get in there as well. So yeah, that's a great question. And of course, I touched upon it very briefly in my one of the slides is the rise of the new cultural professional. So we're saying data is the new oil data is what we need. We need good data, high quality data, all the different domains, the underrepresented languages. So if you're a translator, and you are sitting on all translation memories, you're very welcome to become a data seller on the data marketplace. Of course, apart from that there is a lot of new tasks in the data services industry that you can get into data annotation, ner tagging, but also creating data, voice data, video data images. So that's what what I would recommend looking into.
Bryan Montpetit 27:05
Great question. And thanks for the answer. That's fantastic. We have also a question with respect to is the data marketplace primarily for training MT engines? Or do you offer TMS in TMS format?
Miloš Milovanović 27:18
Basically get if I if I may answer this one anyway. So basically, the data that is uploaded has to be a DMX format for now. So that's like a mandatory requirement. If some companies have some data that is in different format, we can convert it to DMX and then the data set can be uploaded. The same is for download of the data. The generic, let's say download possibilities, DMX. But if any specific request regarding any different type of format is needed, we can support that. Of course, the main use of the data is for machine translation engines training or AI systems training. But yeah, I'm not sure if if the person who asked the questions had any other use case in mind, so we can directly answer that one as well.
Bryan Montpetit 28:12
All right, fantastic. Another question that came in. If, if I'm reading correctly, or my reading is correct data is sold per word, is there a quality or relevance criteria that also comes into the picture.
Miloš Milovanović 28:27
So data is sold per sentence, and the pricing is expressed per word. Basically, for every data set, we're providing a sample that potential data buyer can review and depending on the sample, they can make a decision whether to buy the data or not by the date.
Bryan Montpetit 28:48
Great, and this is the last question we have, and we're gonna have to do it fairly quickly. But is there a duplicate verification? So what if two sellers provide the same data?
Miloš Milovanović 28:57
Yes. So, if the sellers provide the same, absolutely the same data, the second data set will be declined. So they will not be able to upload it but if there is one segment that is different, the data set will be available for acquisition in simple search. When we apply the clustered search the priority will be with the primary primary data upload. I hope this is clear The answer is clear enough
Bryan Montpetit 29:29
it was clear to me so thank you, hopefully it's clear to everyone and all the answers are clear to everyone that asked the questions. Did you have a preferred to a question out of that set? I know it's there quite difficult to to field. But if so we can definitely select the the winner of the ebook.
Miloš Milovanović 29:46
And then Mike has one
Anne-Maj van der Meer 29:48
all right. No, no meat fish. I'll leave that up to you. You answered most of them. So which one was your favorite?
Miloš Milovanović 29:53
Oh, please, please go ahead and you decide.
Bryan Montpetit 29:56
If you'd like we can also announce it and you can deliberate original We're going to announce it in the community. So please stay tuned to the community I'm sure we'll be announcing the the people there and we'll give the meat and veg to chance to discuss it. And we can ensure provide the names of all those people who asked the questions. So thank you very much for your time. I really appreciate you spending it with us. I apologize for butchering your surname. Next time we say each other, I'll be able to say it. And thank you. We're gonna have our next presentation in a moment.