If you are looking to build data science models that are good for production, Java has come to the rescue. With the aid of strong libraries such as MLlib, Weka, DL4j, and more, you can efficiently perform all the data science tasks you need to. This course will help you to learn how you can retrieve data from data sources with different level of complexities. You will learn how you could handle big data to extract meaningful insights from data. Later we will dive to visualizing data to uncover trends and hidden relationships. Finally, we will work through unique videos that solve your problems while taking data science to production, writing distributed data science applications, and much more—things that will come in handy at work.
About the Author
Rushdi Shams has a Ph.D. on Application of machine learning in Natural Language Processing (NLP) problem areas from Western University, Canada. Before starting work as a machine learning and NLP specialist in the industry, he was engaged in teaching undergrad and grad courses. He has been successfully maintaining his YouTube channel named "Learn with Rushdi" for learning computer technologies.
Weka's native file format is called Attribute-Relation File Format (ARFF). Let’s learn in detail about this file and how to work with it.
In this video, we will create four methods–one method will load an ARFF file, the second method will read the data in the ARFF file and generate a machine learning model, the third method will save the model using serialization, and the last method will evaluate the model on the ARFF file.
The classic supervised machine-learning classification task is to train a classifier on the labeled training instances and to apply the classifier on unseen test instances.
Many times, you will need to use a filter before you develop a classifier. The filter can be used to remove, transform, discretize, and add attributes; remove misclassified instances, randomize, or normalize instances; and so on. Let’s see how we can use a filter and classifier at the same time to classify unseen test examples.
Most of the linear regression modeling follows a general pattern–there will be many independent variables that will collectively produce a result, which is a dependent variable. Let’s use Weka's linear regression classifier.
Weka has a class named logistic, which can be used to build and use a multinomial logistic regression model with a ridge estimator. We will explore how to use Weka to generate a logistic regression model.
In this video, we will use the K-means algorithm to cluster or group data points of a dataset together.
If you have a dataset with classes, which is an unusual case for unsupervised learning, Weka has a method called clustering from classes. We will cover this method in this video.
Association rule learning is a machine learning technique to discover associations and rules between various features or variables in a dataset. Let’s see how we can use Weka to learn association rules from datasets.
In Weka, there are three ways of selecting attributes. This video will use all of the three ways of attribute selection techniques available in Weka: the low-level attribute selection method, attribute selection using a filter, and attribute selection using a meta-classifier.
Java Machine Learning (Java-ML) library is a collection of standard machine learning algorithms. Unlike Weka, the library does not have any GUI because it’s primarily aimed for software developers.
The Stanford classifier is a machine learning classifier developed in the University of Stanford by the Stanford Natural Language Processing group. The software is implemented in Java and uses maximum entropy as its classifier.
Massive Online Analysis or MOA is related to Weka, but it comes with more scalability. It is a notable Java workbench for data stream mining. With a strong community in place, MOA has implementations of classification, clustering, regression, concept drift identification, and recommender systems.
So far, we have seen multiclass classifications that aim to classify a data instance into one of the several classes. Multilabeled data instances are data instances that can have multiple classes or labels. The machine learning tools that we have used so far are not capable of handling data points that have this characteristic of having multiple target classes.
One of the most common tasks that a data scientist needs to do using text data is to detect tokens from it. This task is called tokenization.
Sentences are a very important text unit for data scientists to experiment different routing exercises, such as classification. In this video, we will see how we can detect sentences so that we can use them for further analysis.
The preceding two videos in this section detected tokens and sentences using legacy Java classes and methods in them. In this video, we will combine the two tasks of detecting tokens and sentences with an open source library of Apache named OpenNLP.
Now that we know how to extract tokens or words from a given text, we will see how we can get different types of information from the tokens, such as their lemmas and part of speech, and whether the token is a named entity.
Data scientists often measure the distance or similarity between two data points for classification, clustering, detecting outliers, and for many other cases. When they deal with texts as data points, the traditional distance or similarity measurements cannot be used.
With an ever-increasing amount of documents in text format nowadays, an important task for any data scientist is to get an overview of a large number of articles with abstracts, summaries, or a list of abstract topics, not because this saves time to read through the articles but to do clustering, classification, semantic relatedness measurement, and sentiment analysis.
Our final two videos in this section will be on the classical machine learning classification problem, that is, the classification of documents using language modeling. In this video, we will use Mallet and its command-line interface to train a model and apply the model on an unseen test data.
We used Weka to classify data points that are not in a text format. Weka is a very useful tool to classify text documents using machine learning models as well. This video will demonstrate how you can use Weka 3 to develop a document classification model.
In this video, we will use Apache Mahout to train an online logistic regression model using the Apache Mahout Java library.
This video will demonstrate how we can apply an online logistic regression model on an unseen, unlabeled test data using Apache Mahout.
In this video, we will demonstrate how to use Apache Spark to solve very simple data problems. Of course, the data problems are merely dummy problems and not real-world problems, but this can be a starting point for you to understand intuitively the use of Apache Spark on a large scale.
MLib is the machine learning component of Apache Spark and is a competitive (even better) alternative to Apache Mahout. This video will demonstrate how we can cluster data points without labels using the K-means algorithm with MLib.
In this video, we will explore how to use a linear regression model to model with MLib.
In this video, we will demonstrate how you can classify data points using the random forest algorithm with MLib.
Word2vec can be seen as a two-layer neural net that works with natural text. In this video, we will use the deep learning Java library named deep learning for Java to apply Word2vec to a raw text.
A deep-belief network can be defined as a stack of restricted Boltzmann machines, where each RBM layer communicates with both the previous and subsequent layers. In this video, we will see how to create such a network.
A deep autoencoder is a deep neural network that is composed of two deep-belief networks that are symmetrical. In this video, we will develop a deep autoencoder consisting of one input layer, four decoding layers, four encoding layers, and one output layer.
Sine graphs can be particularly useful for data scientists since they are a trigonometric graph that can be used to model fluctuations in data. In this video, we will use a free Java graph library named GRAphing library GRAL to plot a 2D sine graph.
Histograms are a very popular method of discovering the frequency distribution of a set of continuous data. Let’s take a look at how to plot histograms using GRAL.
Bar plots are the most common graph types used by data scientists. It is simple to draw a bar chart using GRAL. In this video, we will use GRAL to plot it.
Box plots are another effective visualization tool for data scientists. They give important descriptive statistics of a data distribution. In this video, we will explore drawing box plots for data distributions.
Scatter plots use both the x and y axes to plot data points and are a good means to demonstrate the correlation between variables. This video will demonstrate how to use GRAL to draw scatter plots for 100,000 random data points.
Donut plots a version of pie chart, are a popular data visualization technique that give visuals for proportions in your data. Let’s learn to plot donut plots for 10 random variables.
Area graphs are useful tools to display how quantitative values develop over a given Interval. In this video, we will use the GRAL Java library to plot area graphs.
Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.
With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.
From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.
Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.