In this multi-post topic, we examine the problem and reveal the secrets for successfully training AI models on full-text datasets. First, let’s understand how hard this is and why?
The following statement by Indrek Vainu CEO of AlphaBlues, an enterprise chatbot company, summarizes the current situation. “Extraction of meaning — or more specifically, semantic relations between words in free text — is a complex task. The complexity is mostly due to the rich web of relations between the conceptual entities the words represent.” He goes on to say that machine learning is “largely clueless when fed unstructured data, such as free text.”
IMImobile, another chatbot company states that “Machine learning is a powerful technology and promises an exciting future where machines can come to understand our needs and our intent, perhaps better than we do ourselves. However, at this moment in time we only recommend machine learning for scenarios where there is little scope for ambiguity, and where vectorisation (converting non-numeric input to numeric inputs) is straightforward.”
A recent customer engagement at Informatics4AI supports these statements. Our customer was working with a dataset comprised of unstructured doctor's notes. They found their machine learning efforts created a model that was highly effective for straight forward diagnostic situations (e.g. a patient passing a common screening test). But when fed notes relating to complex tests and multiple patient conditions, the model did not produce predictions with the accuracy that they needed.
As an illustration of the difficulty that AI has with full text (and for a bit of fun) let's take a look at the results that Janelle C Shane got when she trained a neural network on a database of about 30,000 recipes and then asked the machine to produce a new recipe:
Recipe:
2 pkg hershey’s can be prepared in unpeeled
1 smaller
½ cup yellow onions you may
1 cup egg; chilled, coursely chopped
½ lb bacon, chopped
1 ½ cup sugar, grated
4 oz square oil
Halve the finely chopped fresh garlic salt and pepper. Break the meat into the pineapples and pat them, scraping the room off the skillet. Add ghees and beer and bring to a boil; cover and simmer, uncovered, on High for 20 to 30 minutes or until the onion thickens.
To be fair and to clarify, this model was built by an AI enthusiast and not a AI professional, but I think it illustrates the issue – the machine has no clue what a recipe is really all about.
However, all is not lost when trying to apply machine learning to full text. The key is adding structure and meaning to the raw data, and by doing so, enable the machine to understand the text and thus begin to learn. We will review these techniques in the next blog post.