In this special guest feature, Michael Goldberg, VP of Marketing at Innodata (NASDAQ: INOD), indicates that even the most technically advanced algorithm cannot address or solve a problem without the right data. We know having access to data is quite valuable, but having access to data with a learnable ‘signal’ consistently added at a massive scale is the biggest competitive advantage nowadays. That’s the power of data annotation. Innodata bridges human expertise with machine learning to help companies across the globe solve complex data challenges, improve operational efficiency and lower costs. Employing robotic process automation, Innodata seamlessly turns unstructured, raw content and data into structured, highly useable business intelligence, at scale.
In case you’ve been living under a rock, artificial intelligence (AI) is everywhere. It’s infiltrated almost every aspect of our private and professional lives. From healthcare to transportation, AI aims to redefine how information is collected, integrated, and analyzed; ultimately leading to more informed insights and delivering better outcomes. But for all its hype, the full promise of AI rarely comes to fruition because of one four-letter word: “data.”
While the AI story is all the rage, the data narrative is
not as prominently discussed. Sure, data may not be as sexy as the automated
systems that can learn and process information quicker than a human, but it is
equally as important. And don’t get me wrong, we all know that AI requires vast
amounts of data to continually learn and identify patterns that humans can’t.
After all, it’s the ability to process this information and make instant
decisions that has led to AI being such a game changer for industries that rely
on massive volumes of data.
But the real story is not about the
algorithms powering the AI revolution, instead it’s about the quality of data
powering these systems. What enterprises really need as they develop their AI
strategy is to integrate, clean, link, and supplement their data so they have
an accurate foundation on which to build and train their machine learning
algorithms. For many organizations, this
makes AI difficult if not impossible.
“Data-related challenges are a top reason (our) clients have halted or canceled artificial-intelligence projects,” said IBM’s senior vice president of cloud and cognitive software, Arvind Krishna, speaking at The Wall Street Journal’s Future of Everything Festival. He’s certainly not alone in his assessment. According to a report by MIT Technology Review, insufficient data quality was one of the biggest challenges to employing AI. What’s more, 85% of AI projects will “not deliver” for organizations, according to research and advisory company Gartner.
Companies need to think of AI and
machine learning as the engines that will drive the amazing things they want to
accomplish. But like every engine, it needs the right fuel to run well.
Data annotation (also referred to as
data labeling) is quite critical to ensuring your AI and machine learning
projects can scale. It provides that initial setup for training a machine
learning model with what it needs to understand and how to discriminate against
various inputs to come up with accurate outputs.
There are many different types of data
annotation modalities, depending on what kind of form the data is in. It can
range from image and video annotation, text categorization, semantic
annotation, and content categorization. Humans are needed to identify and
annotate specific data so machines can learn to identify and classy
information. Without these labels, the machine learning algorithm will have a
difficult time computing the necessary attributes.
The unfortunate reality about all of this is that it’s still a very manual process requiring manual labor. While tools for annotation are getting better, the difference between an ill-designed tool and an intuitive one makes significant difference in annotation productivity. According to some estimates, 80% of AI project time is currently spent on data preparation. But even small errors in the data could prove to be disastrous. In this area, humans actually have a leg up on machines. We’re are simply better than computers at managing subjectivity, understanding intent, and coping with ambiguity – all of which are important factors of data annotation.
Regardless of modality, the vast majority of problems in which AI
models are being built to address them can fit into one (or many) of the
below annotation tasks:
- Sequencing: text or time series from which there’s a start
(left boundary) an end (right boundary) and a label. (e.g., recognize the name of
a person in a text, identify a paragraph discussing penalties in a contract)
- Categorization: binary classes, multiple classes, one label,
multi-labels, flat or hierarchic, otologic (e.g., categorize a book
according to the BISAC ontology, categorize an image as offensive or not
- Segmentation: find paragraph splits, find an object in image,
find transitions between speakers, between topics, etc. (e.g., spot objects and people in
a picture, find the transition between topics in a news broadcast)
- Mapping: language-to-language, full text to summary, question to
answer, raw data to normalized data (e.g., translate from French to English, normalize a
date from free text to standard format)
Usually, complex problems can be solved as a sequence or a
combination of tasks. For example, when you unlock your phone with face
identification, machine learning is used to spot your nose and eyes
(segmentation) and categorize as you or not-you (categorization). Think about
when you talk to Alexa or Siri, machine learning is used to map your voice to
words (mapping), recognize sequences such as instruction, name of a
song, etc.(sequences) and play music, tell weather, etc. (categorization).
At the end of the day, even the most technically advanced algorithm cannot address or solve a problem without the right data. We know having access to data is quite valuable, but having access to data with a learnable ‘signal’ consistently added at a massive scale is the biggest competitive advantage nowadays. That’s the power of data annotation.
Sign up for the free insideBIGDATA newsletter.