
There is often misconception about experimental data science and data science for the real world, for companies building an actual product. This video aims to clarify some things, in order to set a baseline for future videos.
We do not yet have an overview of the tools to be used during this course, by laying them out we hopefully make this clear.
Some of the tools we have used require extra downloads or installs. We will walk you through the various steps, so they are fully equipped to tackle the coding examples.
Now that we have our system up and running, we don’t know yet what we intend to do with it.
This video highlights the types of data the text mining data scientists might come into contact with, and where he can get his hands on some text data to get working on.
Under the motto garbage in = garbage out, we look at some often-used pre-processing steps.
A key aspect of transforming text to numeric is correctly building your corpus of words and word features. The appropriate techniques are explained in this video.
To further correctly refine our corpus, n-grams often play an important part. This video explains how to approach this.
This video gives a quick intro to the information extraction problem. First up is a very important difference between an ML approach and a text search approach.
A deeper dive into the world of named entity recognition, the machine learning approach to information extraction. It is important to know how this approach works.
For a lot of so-called entities, pre-trained models exist which one can use of the shelf. But when, which entities and which models often remains unclear. This is explained in this video.
Often, pre-trained entity types are just not enough. This video will demonstrate what to do in such a scenario.
State of the art results can be obtained using more advanced deep learning models. This video gently explores that option, along with the opportunities and pitfalls.
This video gives an introduction to the problem of text classification, along with the first step in the process – representing your text in a mathematical vector for use in a learning algorithm.
There are many algorithms and techniques out there to tackle the problem. This video aims to guide the student to selecting the right one for his or her problem.
A dive into a coded example, starting from a popular dataset, to show what we have learned so far in each step, and have a full working example.
In our previous example, there are a lot of choices to be made and hyperparameters to be tuned, this video gives an overview of this.
This step is often omitted in other courses but since this course aims to provide a real world view on text machine learning, some attention is spent on putting classification into production.
State of the art results can be obtained using more advanced deep learning models. This video gently explores that option, along with the opportunities and pitfalls.
This video gives an introduction to the concept of word embeddings – the history and the use cases.
Moving deeper into Word2Vec (with skip-grams and CBOW) and Glove to better clarify and structure the main techniques used.
We now know what it is, but not yet how to use it. The first step is training a word embedding model.
To demonstrate some of the powerful aspects of word embeddings, we will try to visualize one.
The power of word embeddings is being applied more and more often in other domains as well, as well on other things than just words. This video tries to tease some of these possibilities.
We saw a lot of concepts introduced in the previous sections. To consolidate, we will try to stitch everything together in one overview.
A method we haven’t seen yet is topic modelling. Though it can be a useful one for some topics. Based on what we already know (tools and methods), the introduction is easily made.
From here on, the more exotic topics will be addressed, starting with text generation. Using neural network architectures, models can be trained to predict sequences in the style of the training corpora.
One of the large areas where neural networks have revolutionized NLP is machine translation. This video quickly touches upon this topic.
From here on forth, there are a number of areas one can use to get further acquainted. It can be difficult to find these right paths on your own, so this video will give some nice pointers to get started.
Again, we saw a large number of different topics, both in this section and the previous ones. In this final video, we try to structure everything in one large overview.
Text is one of the most actively researched and widely spread types of data in the Data Science field today. New advances in machine learning and deep learning techniques now make it possible to build fantastic data products on text sources. New exciting text data sources pop up all the time. You'll build your own toolbox of know-how, packages, and working code snippets so you can perform your own text mining analyses.
You'll start by understanding the fundamentals of modern text mining and move on to some exciting processes involved in it. You'll learn how machine learning is used to extract meaningful information from text and the different processes involved in it. You will learn to read and process text features. Then you'll learn how to extract information from text and work on pre-trained models, while also delving into text classification, and entity extraction and classification. You will explore the process of word embedding by working on Skip-grams, CBOW, and X2Vec with some additional and important text mining processes. By the end of the course, you will have learned and understood the various aspects of text mining with ML and the important processes involved in it, and will have begun your journey as an effective text miner.
About the Author
Thomas Dehaene is a Data Scientist at FoodPairing, a Belgium-based Food Technology scale-up that uses advanced concepts in Machine Learning, Natural Language Processing, and AI in general to capture meaning and trends from food-related media. He obtained his Master of Science degree in Industrial Engineering and Operations Research at Ghent University, before moving his career into Data Analytics and Data Science, in which he has been active for the past 5 years. In addition to his day job, Thomas is also active in numerous Data Science-related activities such as Hackathons, Kaggle competitions, Meetups, and citizen Data Science projects.