Data analysis is a process for inspecting, consolidating, transforming, and making sense of data in a way that guides the decision-making process. If you're interested to know the statistical data analysis techniques and implement them using the popular Java APIs and libraries, then go for this Learning Path.
Packt’s Video Learning Paths are a series of individual video products put together in a logical and stepwise manner such that each video builds on the skills learned in the video before it.
The highlights of this Learning Path are:
Let’s take a quick look at your learning journey. This Learning Path starts by showing you the various techniques of pre-processing your data. You will get well-versed with the basics of data analysis with Java, how data changes state, and how Java fits into the analysis. You will then learn to apply the basic analysis to your business needs and create time-series predictions. You will also see how to implement statistical data analysis techniques using Java APIs. If you are looking to build data science models that are good for production, Java has come to the rescue. With the aid of strong libraries such as MLlib, Weka, DL4j, and more, you can efficiently perform all the data science tasks you need to. This Learning Path will help you to learn how you can retrieve data from data sources with different level of complexities. You will learn how you could handle big data to extract meaningful insights from data. Finally, you will dive into visualizing data to uncover trends and hidden relationships.
By the end of this Learning Path, you will be able to analyze your data, retrieve data from data sources with different level of complexities, and also write and modify applications that perform data analysis in a step-by-step manner.
Meet Your Experts:
We have combined the best works of the following esteemed author to ensure that your learning journey is smooth:
Erik Costlow was the principal product manager for Oracle’s launch of Java 8. His background is in software security analysis, dealing with the security issues that rose to the surface within Java 6 and Java 7. While working on the JDK, Erik applied different data analysis techniques to identify and mitigate ways that threats could propagate through the overall Java platform and overlying applications.
Rushdi Shams has a Ph.D. on application of machine learning in Natural Language Processing (NLP) problem areas from Western University, Canada. Before starting work as a machine learning and NLP specialist in the industry, he was engaged in teaching undergrad and grad courses. He has been successfully maintaining his YouTube channel named "Learn with Rushdi" for learning computer technologies.
The aim of this video is to talk about surveying data types and data structures
The aim of this video is to cover how to get the data out of a particular format
The aim of this video is to generate test data
This video talks about the pre-processing of data sets
The aim of this video is to talk about the types of data analysis problems
The aim of this video is to deal with Business Intelligence. It will use Apache POI for creating and reading spreadsheets, as well as show what users will do in MS Excel
This video talks about descriptive analysis which is a part of the statistical toolbox, and gives you an understanding of the overall data you are looking at
The aim of this video is to go through random sampling and analyze the subset of data
This video talks about MySQL database.
The aim of this video is to discuss about JDBC and JPA
The aim of this video is to talk about the SQL and NoSQL database systems and do a little bit of compare and contrast
The aim of this video is to cover data conversion in detail
The aim of this video is to talk about the selection of NoSQL databases
In this video, we take a look at how to retrieve the file paths and names from a complex directory structure that contains numerous directories and files inside a root directory.
Listing of file names in hierarchical directories can be done recursively as demonstrated in the previous video. However, this can be done in a much easier and convenient way and with less coding using the Apache Commons IO library.
There are different ways to read text files contents. This video demonstrates how to read text file contents all at once using Java 8.
The most difficult file types for extracting data are PDF. Some PDFs are not even possible to parse because they are password-protected. This video demonstrates how to extract text from PDF files using Apache Tika.
There are several ways to parse contents of XML files. In this video, we are using JDOM for XML parsing.
In this video, we will see how we can read or parse a JSON file. As our sample input file, we will be using the JSON file we created in the previous video.
One of the easiest and handy ways is to use an external Java library named JSoup. In this video, we are using Jsoup for extracting web data.
A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. This video uses a certain number of methods offered in JSoup to extract web data.
Indexing is the first step for searching data fast. In action, Lucene uses an inverted full-text index. In this video, we will demonstrate how to index a large amount of data with Apache Lucene.
Now that we have indexed our data, we will be searching the data using Apache Lucene in this video.
We can generate summary statistics for data by using the SummaryStatistics class. This is similar to the DescriptiveStatistics class used in the preceding video. The major difference is that unlike the DescriptiveStatistics class, the SummaryStatistics class does not store data in memory.
In this video, we will be creating an AggregateSummaryStatistics instance to accumulate the overall statistics and SummaryStatistics for the sample data.
The Frequency class has methods to count the number of data instances in a bucket, to count unique number of data instances, and so on. Let’s explore how we compute frequency distribution.
This video is quite different than the other ones in this section as it deals with strings and counting word frequencies in a string. We will use both Apache Commons Math and Java 8 for this task.
Unbiased covariances are given by the formula cov(X, Y) = sum [(xi - E(X))(yi - E(Y))] / (n - 1) and Pearson's correlation computes correlations defined by the formula cor(X, Y) = sum[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)], where E(X) is the mean of X and E(Y) is the mean of the Y values. Non-bias-corrected estimates use n in place of n - 1. Let’s see how we calculate them in our code.
The SimpleRegression class supports ordinary least squares regression with one independent variable: y = intercept + slope * x, where the intercept is an optional parameter. In this video, the data points are added one at a time.
The OLSMultipleLinearRegression provides Ordinary Least Squares Regression to fit the linear model Y=X*b+u. Here, Y is an n-vector regress, and X is a [n,k] matrix, where k columns are called regressors, b is k-vector of regression parameters, and u is an n vector of error terms or residuals. Let’s see how we compute ordinary least squares regression in this video.
ANOVA stands for Analysis of Variance. In this video, we will see how to use Java to perform a one-way ANOVA test to determine whether the means of three or more independent and unrelated sets of data points are significantly different.
The Kolmogorov-Smirnov test (or simply KS test) is a test of equality for one-dimensional probability distributions that are continuous in nature. It is one of the popular methods to determine whether two sets of data points differ significantly.
Weka's native file format is called Attribute-Relation File Format (ARFF). Let’s learn in detail about this file and how to work with it.
In this video, we will create four methods–one method will load an ARFF file, the second method will read the data in the ARFF file and generate a machine learning model, the third method will save the model using serialization, and the last method will evaluate the model on the ARFF file.
The classic supervised machine-learning classification task is to train a classifier on the labeled training instances and to apply the classifier on unseen test instances.
Many times, you will need to use a filter before you develop a classifier. The filter can be used to remove, transform, discretize, and add attributes; remove misclassified instances, randomize, or normalize instances; and so on. Let’s see how we can use a filter and classifier at the same time to classify unseen test examples.
Weka has a class named logistic, which can be used to build and use a multinomial logistic regression model with a ridge estimator. We will explore how to use Weka to generate a logistic regression model.
If you have a dataset with classes, which is an unusual case for unsupervised learning, Weka has a method called clustering from classes. We will cover this method in this video.
Association rule learning is a machine learning technique to discover associations and rules between various features or variables in a dataset. Let’s see how we can use Weka to learn association rules from datasets.
In Weka, there are three ways of selecting attributes. This video will use all of the three ways of attribute selection techniques available in Weka: the low-level attribute selection method, attribute selection using a filter, and attribute selection using a meta-classifier.
The Stanford classifier is a machine learning classifier developed in the University of Stanford by the Stanford Natural Language Processing group. The software is implemented in Java and uses maximum entropy as its classifier.
So far, we have seen multiclass classifications that aim to classify a data instance into one of the several classes. Multilabeled data instances are data instances that can have multiple classes or labels. The machine learning tools that we have used so far are not capable of handling data points that have this characteristic of having multiple target classes.
One of the most common tasks that a data scientist needs to do using text data is to detect tokens from it. This task is called tokenization.
The preceding two videos in this section detected tokens and sentences using legacy Java classes and methods in them. In this video, we will combine the two tasks of detecting tokens and sentences with an open source library of Apache named OpenNLP.
Now that we know how to extract tokens or words from a given text, we will see how we can get different types of information from the tokens, such as their lemmas and part of speech, and whether the token is a named entity.
Data scientists often measure the distance or similarity between two data points for classification, clustering, detecting outliers, and for many other cases. When they deal with texts as data points, the traditional distance or similarity measurements cannot be used.
With an ever-increasing amount of documents in text format nowadays, an important task for any data scientist is to get an overview of a large number of articles with abstracts, summaries, or a list of abstract topics, not because this saves time to read through the articles but to do clustering, classification, semantic relatedness measurement, and sentiment analysis
Our final two videos in this section will be on the classical machine learning classification problem, that is, the classification of documents using language modeling. In this video, we will use Mallet and its command-line interface to train a model and apply the model on an unseen test data.
We used Weka to classify data points that are not in a text format. Weka is a very useful tool to classify text documents using machine learning models as well. This video will demonstrate how you can use Weka 3 to develop a document classification model.
In this video, we will use Apache Mahout to train an online logistic regression model using the Apache Mahout Java library.
This video will demonstrate how we can apply an online logistic regression model on an unseen, unlabeled test data using Apache Mahout.
A deep-belief network can be defined as a stack of restricted Boltzmann machines, where each RBM layer communicates with both the previous and subsequent layers. In this video, we will see how to create such a network.
A deep autoencoder is a deep neural network that is composed of two deep-belief networks that are symmetrical. In this video, we will develop a deep autoencoder consisting of one input layer, four decoding layers, four encoding layers, and one output layer.
Histograms are a very popular method of discovering the frequency distribution of a set of continuous data. Let’s take a look at how to plot histograms using GRAL.
Bar plots are the most common graph types used by data scientists. It is simple to draw a bar chart using GRAL. In this video, we will use GRAL to plot it.
Area graphs are useful tools to display how quantitative values develop over a given Interval. In this video, we will use the GRAL Java library to plot area graphs.
Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.
With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.
From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.
Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.