2017-09-14 17:32:34

Learning Path: Java: Big Data Analysis with Java

3 students enrolled

Please confirm that you want to add **Learning Path: Java: Big Data Analysis with Java** to your Wishlist.

Handle big data and perform visualization techniques to gain deeper insights of your data

3 students enrolled

What Will I Learn?

- Familiarize with various data pre-processing techniques
- Get to know the basics of data analysis and explore how data changes state
- Implement statistical data analysis techniques using Java APIs
- Work with NoSQL databases
- Find out how to clean and make datasets ready so you can acquire actual insights by removing noise and outliers
- Develop the skills to use modern machine learning techniques to retrieve information and transform data to knowledge
- Perform clustering and feature selection exercises using the Weka machine learning Workbench
- Learn application of core Java and popular libraries, such as OpenNLP, Stanford CoreNLP, Mallet, and Weka
- Familiarize yourself with the very basics of deep learning using the deep learning for Java (DL4j) library

Requirements

- Basic programming knowledge of Java
- Basic programming knowledge MySQL

Description

Data analysis is a process for inspecting, consolidating, transforming, and making sense of data in a way that guides the decision-making process. If you're interested to know the statistical data analysis techniques and implement them using the popular Java APIs and libraries, then go for this Learning Path.

Packt’s Video Learning Paths are a series of individual video products put together in a logical and stepwise manner such that each video builds on the skills learned in the video before it.

The highlights of this Learning Path are:

- Get your basics on point to perform data analysis with Java
- Solutions to help you overcome your data science hurdles using Java

Let’s take a quick look at your learning journey. This Learning Path starts by showing you the various techniques of pre-processing your data. You will get well-versed with the basics of data analysis with Java, how data changes state, and how Java fits into the analysis. You will then learn to apply the basic analysis to your business needs and create time-series predictions. You will also see how to implement statistical data analysis techniques using Java APIs. If you are looking to build data science models that are good for production, Java has come to the rescue. With the aid of strong libraries such as MLlib, Weka, DL4j, and more, you can efficiently perform all the data science tasks you need to. This Learning Path will help you to learn how you can retrieve data from data sources with different level of complexities. You will learn how you could handle big data to extract meaningful insights from data. Finally, you will dive into visualizing data to uncover trends and hidden relationships.

By the end of this Learning Path, you will be able to analyze your data, retrieve data from data sources with different level of complexities, and also write and modify applications that perform data analysis in a step-by-step manner.

**Meet Your Experts**:

We have combined the best works of the following esteemed author to ensure that your learning journey is smooth:

**Erik Costlow** was the principal product manager for Oracle’s launch of Java 8. His background is in software security analysis, dealing with the security issues that rose to the surface within Java 6 and Java 7. While working on the JDK, Erik applied different data analysis techniques to identify and mitigate ways that threats could propagate through the overall Java platform and overlying applications.

**Rushdi Shams** has a Ph.D. on application of machine learning in Natural Language Processing (NLP) problem areas from Western University, Canada. Before starting work as a machine learning and NLP specialist in the industry, he was engaged in teaching undergrad and grad courses. He has been successfully maintaining his YouTube channel named "Learn with Rushdi" for learning computer technologies.

Who is the target audience?

- This Learning Path is for those who are mid-level developers and architects who are familiar with Java programming and Java developers who are familiar with the fundamentals of data science and want to improve their skills to become a pro.

Students Who Viewed This Course Also Viewed

Curriculum For This Course

93 Lectures

07:00:53
+
–

Basic Data Analysis with Java
24 Lectures
03:01:41

This video will give an overview of the entire course

Preview
02:03

The aim of this video is to discuss about the purpose of data analysis or what problems can be solved using data analysis

The Purpose of Data Analysis

09:39

The aim of this video is to talk about surveying data types and data structures

Surveying Data Types and Data Structures

07:00

The aim of this video is to cover how to get the data out of a particular format

Data Sets and File Formats

09:14

The aim of this video is to generate test data

Generating Test Data

07:13

This video talks about the pre-processing of data sets

Pre-processing Data Sets

06:40

The aim of this video is to talk about the types of data analysis problems

Preview
07:40

The aim of this video is to learn about the Java Components, specifically Lambda and Streams API introduced within Java 8

Java Components

06:52

The aim of this video is to deal with Business Intelligence. It will use Apache POI for creating and reading spreadsheets, as well as show what users will do in MS Excel

Business Intelligence

07:01

The aim of this video is to talk about the role of time specific predictions

Time Series Predictions

10:01

This video talks about descriptive analysis which is a part of the statistical toolbox, and gives you an understanding of the overall data you are looking at

Descriptive Statistics

06:06

The aim of this video is to go through random sampling and analyze the subset of data

Random Sampling

07:26

This video talks about the elementary probability and how to perform some basic statistical calculations

Elementary Probability

07:07

The aim of this video is to talk about the Bayes’ theorem which is the statistical way of evaluating the connection between events

Bayes’ Theorem

09:38

This video talks about the tables and databases and how to use JDBC

Tables and Databases

09:34

This video talks about MySQL database.

MySQL

06:10

The aim of this video is to discuss about JDBC and JPA

Code Using JDBC or JPA

10:14

The aim of this video is to talk about the SQL and NoSQL database systems and do a little bit of compare and contrast

SQL Versus NoSQL Database Systems

05:55

The aim of this video is to discuss about XML and JSON formats and how to deal with each of them.

XML and JSON Data Formats

07:29

The aim of this video is to cover data conversion in detail

Data Conversion

08:28

The aim of this video is to talk about the selection of NoSQL databases

Selection

06:58

The aim of this video is to discuss about data subsetting and how to deal with large data sets in the program

Subsetting

06:31

The aim of this video is to talk about JDK 8 and how they pertain to NoSQL

DateTime APIs in JDK 8

06:45

The aim of this video is to discuss about resampling and how to look at the smaller subsets of data

Resampling

09:57

+
–

Java Data Science Solutions - Analyzing Data
30 Lectures
01:29:26

This video will give an overview of the entire course.

Preview
01:46

In this video, we take a look at how to retrieve the file paths and names from a complex directory structure that contains numerous directories and files inside a root directory.

Retrieving All Filenames from Hierarchical Directories Using Java

02:30

Listing of file names in hierarchical directories can be done recursively as demonstrated in the previous video. However, this can be done in a much easier and convenient way and with less coding using the Apache Commons IO library.

Retrieving All Filenames from Hierarchical Directories Using Apache Commons IO

02:12

There are different ways to read text files contents. This video demonstrates how to read text file contents all at once using Java 8.

Reading Contents from Text Files All at Once Using Java 8

02:22

Another way to read the text files contents all at once is using Apache commons IO. Let’s see how the same functionality described in the previous video can be achieved using Apache Commons IO API.

Reading Contentsfrom Text Files All at Once Using Apache Commons IO

02:25

The most difficult file types for extracting data are PDF. Some PDFs are not even possible to parse because they are password-protected. This video demonstrates how to extract text from PDF files using Apache Tika.

Extracting PDF Text Using Apache Tika

03:04

ASCII text files can contain unnecessary units of characters that eventually are introduced during a conversion process. In this video, we clean several noises from ASCII text data using regular expressions.

Cleaning ASCII Text Files Using Regular Expressions

01:47

In this video, we will see parsing CSV files and handle data points retrieved from them.

Parsing Comma-Separated and Tab-Separated Value Files Using Univocity

07:21

There are several ways to parse contents of XML files. In this video, we are using JDOM for XML parsing.

Parsing XML Files Using JDOM

03:36

Just like XML, JSON is also a human-readable Data Interchange Format that is lightweight. In this video, we write JSON files.

Writing JSON Files Using JSON.Simple

03:12

In this video, we will see how we can read or parse a JSON file. As our sample input file, we will be using the JSON file we created in the previous video.

Reading JSON Files Using JSON.Simple

02:51

One of the easiest and handy ways is to use an external Java library named JSoup. In this video, we are using Jsoup for extracting web data.

Extracting Web Data from a URL Using Jsoup

03:36

A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. This video uses a certain number of methods offered in JSoup to extract web data.

Extracting Web Data from a Website Using Selenium Web Driver

02:42

Data can be stored in database tables too. In this video, we will read data from the table in MySQL

Reading Table Data from a MySQL Database

04:29

Indexing is the first step for searching data fast. In action, Lucene uses an inverted full-text index. In this video, we will demonstrate how to index a large amount of data with Apache Lucene.

Indexing Data with Apache Lucene

10:00

Now that we have indexed our data, we will be searching the data using Apache Lucene in this video.

Searching Indexed Data with Apache Lucene

04:13

Descriptive statistics are used to summarize a sample. Inferential statistics are mostly used to draw a conclusion about the population from a representative sample. In this video, we will see how we can use Java to generate descriptive statistics from small samples.

Generating Descriptive Statistics

02:58

We can generate summary statistics for data by using the SummaryStatistics class. This is similar to the DescriptiveStatistics class used in the preceding video. The major difference is that unlike the DescriptiveStatistics class, the SummaryStatistics class does not store data in memory.

Generating Summary Statistics

01:31

In this video, we will be creating an AggregateSummaryStatistics instance to accumulate the overall statistics and SummaryStatistics for the sample data.

Generating Summary Statistics from Multiple Distributions

01:47

The Frequency class has methods to count the number of data instances in a bucket, to count unique number of data instances, and so on. Let’s explore how we compute frequency distribution.

Computing Frequency Distribution

01:48

This video is quite different than the other ones in this section as it deals with strings and counting word frequencies in a string. We will use both Apache Commons Math and Java 8 for this task.

Counting Word Frequency in a String

01:28

This video does not use the Apache Commons Math library to count frequencies of words in a given string; rather, it uses core libraries and mechanisms introduced in Java 8.

Counting Word Frequency in a String Using Java 8

01:49

Unbiased covariances are given by the formula cov(X, Y) = sum [(xi - E(X))(yi - E(Y))] / (n - 1) and Pearson's correlation computes correlations defined by the formula cor(X, Y) = sum[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)], where E(X) is the mean of X and E(Y) is the mean of the Y values. Non-bias-corrected estimates use n in place of n - 1. Let’s see how we calculate them in our code.

Calculating Covariance and Pearson's Correlation of Two Sets of Data Points

03:10

The SimpleRegression class supports ordinary least squares regression with one independent variable: y = intercept + slope * x, where the intercept is an optional parameter. In this video, the data points are added one at a time.

Computing Simple Regression

02:54

The OLSMultipleLinearRegression provides Ordinary Least Squares Regression to fit the linear model Y=X*b+u. Here, Y is an n-vector regress, and X is a [n,k] matrix, where k columns are called regressors, b is k-vector of regression parameters, and u is an n vector of error terms or residuals. Let’s see how we compute ordinary least squares regression in this video.

Computing Ordinary Least Squares Regression

02:49

In this video, we will see another variant of least squares regression named generalized least squares regression. GLSMultipleLinearRegression implements Generalized Least Squares to fit the linear model Y=X*b+u.

Computing Generalized Least Squares Regression

02:25

Apache Commons Math has support for both one-sample and two-sample t-tests. Besides, two sample tests can be either paired or unpaired. The unpaired two-sample tests can be conducted with and without the assumption that the subpopulation variances are equal. We demonstrate paired t-test in this video.

Conducting a Paired T Test

01:59

For conducting a Chi-square test on two sets of data distributions, one distribution will be called the observed distribution and the other distribution will be called the expected distribution.

Conducting a Chi-Square Test

02:03

ANOVA stands for Analysis of Variance. In this video, we will see how to use Java to perform a one-way ANOVA test to determine whether the means of three or more independent and unrelated sets of data points are significantly different.

Conducting the One-Way ANOVA Test

02:08

The Kolmogorov-Smirnov test (or simply KS test) is a test of equality for one-dimensional probability distributions that are continuous in nature. It is one of the popular methods to determine whether two sets of data points differ significantly.

Conducting a Kolmogorov-Smirnov Test

02:31

+
–

Java Data Science Solutions - Big Data and Visualization
39 Lectures
02:29:46

This video will give an overview of the entire course.

Preview
02:31

Weka's native file format is called Attribute-Relation File Format (ARFF). Let’s learn in detail about this file and how to work with it.

Creating and Saving an ARFF File

06:29

In this video, we will create four methods–one method will load an ARFF file, the second method will read the data in the ARFF file and generate a machine learning model, the third method will save the model using serialization, and the last method will evaluate the model on the ARFF file.

Cross-Validating a Machine Learning Model

03:01

The classic supervised machine-learning classification task is to train a classifier on the labeled training instances and to apply the classifier on unseen test instances.

Classifying Unseen Test Data

05:34

Many times, you will need to use a filter before you develop a classifier. The filter can be used to remove, transform, discretize, and add attributes; remove misclassified instances, randomize, or normalize instances; and so on. Let’s see how we can use a filter and classifier at the same time to classify unseen test examples.

Classifying Unseen Test data with a Filtered Classifier

02:43

Most of the linear regression modeling follows a general pattern–there will be many independent variables that will collectively produce a result, which is a dependent variable. Let’s use Weka's linear regression classifier.

Generating Linear Regression Models

02:14

Weka has a class named logistic, which can be used to build and use a multinomial logistic regression model with a ridge estimator. We will explore how to use Weka to generate a logistic regression model.

Generating Logistic Regression Models

01:50

In this video, we will use the K-means algorithm to cluster or group data points of a dataset together.

Clustering Data Points Using the K-means Algorithm

02:18

If you have a dataset with classes, which is an unusual case for unsupervised learning, Weka has a method called clustering from classes. We will cover this method in this video.

Clustering Data from Classes

02:18

Association rule learning is a machine learning technique to discover associations and rules between various features or variables in a dataset. Let’s see how we can use Weka to learn association rules from datasets.

Learning Association Rules from Data

02:07

In Weka, there are three ways of selecting attributes. This video will use all of the three ways of attribute selection techniques available in Weka: the low-level attribute selection method, attribute selection using a filter, and attribute selection using a meta-classifier.

Selecting Features and Attributes

04:50

Java Machine Learning (Java-ML) library is a collection of standard machine learning algorithms. Unlike Weka, the library does not have any GUI because it’s primarily aimed for software developers.

Applying Machine Learning on Data Using the Java-ML Library

11:54

The Stanford classifier is a machine learning classifier developed in the University of Stanford by the Stanford Natural Language Processing group. The software is implemented in Java and uses maximum entropy as its classifier.

Classifying Data Points Using the Stanford Classifier

05:40

Massive Online Analysis or MOA is related to Weka, but it comes with more scalability. It is a notable Java workbench for data stream mining. With a strong community in place, MOA has implementations of classification, clustering, regression, concept drift identification, and recommender systems.

Classifying Data Points Using Massive Online Analysis (MOA)

03:33

So far, we have seen multiclass classifications that aim to classify a data instance into one of the several classes. Multilabeled data instances are data instances that can have multiple classes or labels. The machine learning tools that we have used so far are not capable of handling data points that have this characteristic of having multiple target classes.

Classifying Multilabeled Data Points Using Mulan

05:33

One of the most common tasks that a data scientist needs to do using text data is to detect tokens from it. This task is called tokenization.

Detecting Tokens Using Java

04:13

Sentences are a very important text unit for data scientists to experiment different routing exercises, such as classification. In this video, we will see how we can detect sentences so that we can use them for further analysis.

Detecting Sentences Using Java

01:36

The preceding two videos in this section detected tokens and sentences using legacy Java classes and methods in them. In this video, we will combine the two tasks of detecting tokens and sentences with an open source library of Apache named OpenNLP.

Detecting Tokens (words) and Sentences Using OpenNLP

04:31

Now that we know how to extract tokens or words from a given text, we will see how we can get different types of information from the tokens, such as their lemmas and part of speech, and whether the token is a named entity.

Retrieving Lemma and Part of Speech, and Recognizing Named Entities from Tokens

03:12

Data scientists often measure the distance or similarity between two data points for classification, clustering, detecting outliers, and for many other cases. When they deal with texts as data points, the traditional distance or similarity measurements cannot be used.

Measuring Text Similarity with Cosine Similarity Measure Using Java 8

03:05

With an ever-increasing amount of documents in text format nowadays, an important task for any data scientist is to get an overview of a large number of articles with abstracts, summaries, or a list of abstract topics, not because this saves time to read through the articles but to do clustering, classification, semantic relatedness measurement, and sentiment analysis

Extracting Topics from Text Documents Using Mallet

07:36

Our final two videos in this section will be on the classical machine learning classification problem, that is, the classification of documents using language modeling. In this video, we will use Mallet and its command-line interface to train a model and apply the model on an unseen test data.

Classifying Text Documents Using Mallet

05:27

We used Weka to classify data points that are not in a text format. Weka is a very useful tool to classify text documents using machine learning models as well. This video will demonstrate how you can use Weka 3 to develop a document classification model.

Classifying Text Documents Using Weka

03:31

In this video, we will use Apache Mahout to train an online logistic regression model using the Apache Mahout Java library.

Training an Online Logistic Regression Model Using Apache Mahout

04:28

This video will demonstrate how we can apply an online logistic regression model on an unseen, unlabeled test data using Apache Mahout.

Applying an Online Logistic Regression Model Using Apache Mahout

03:26

In this video, we will demonstrate how to use Apache Spark to solve very simple data problems. Of course, the data problems are merely dummy problems and not real-world problems, but this can be a starting point for you to understand intuitively the use of Apache Spark on a large scale.

Solving Simple Text Mining Problems with Apache Spark

02:52

MLib is the machine learning component of Apache Spark and is a competitive (even better) alternative to Apache Mahout. This video will demonstrate how we can cluster data points without labels using the K-means algorithm with MLib

Clustering Using K-means Algorithm with MLib

02:07

In this video, we will explore how to use a linear regression model to model with MLib.

Creating a Linear Regression Model with MLib

02:54

In this video, we will demonstrate how you can classify data points using the random forest algorithm with MLib.

Classifying Data Points with Random Forest Model Using MLib

03:18

Word2vec can be seen as a two-layer neural net that works with natural text. In this video, we will use the deep learning Java library named deep learning for Java to apply Word2vec to a raw text.

Creating a Word2vec Neural Net

05:33

A deep-belief network can be defined as a stack of restricted Boltzmann machines, where each RBM layer communicates with both the previous and subsequent layers. In this video, we will see how to create such a network.

Creating a Deep Belief Neutral Net

04:26

A deep autoencoder is a deep neural network that is composed of two deep-belief networks that are symmetrical. In this video, we will develop a deep autoencoder consisting of one input layer, four decoding layers, four encoding layers, and one output layer.

Creating a Deep Autoencoder

02:59

Sine graphs can be particularly useful for data scientists since they are a trigonometric graph that can be used to model fluctuations in data. In this video, we will use a free Java graph library named GRAphing library GRAL to plot a 2D sine graph.

Plotting a 2D Sine Graph

03:39

Histograms are a very popular method of discovering the frequency distribution of a set of continuous data. Let’s take a look at how to plot histograms using GRAL.

Plotting Histograms

03:55

Bar plots are the most common graph types used by data scientists. It is simple to draw a bar chart using GRAL. In this video, we will use GRAL to plot it.

Plotting a Bar Chart

02:13

Box plots are another effective visualization tool for data scientists. They give important descriptive statistics of a data distribution. In this video, we will explore drawing box plots for data distributions.

Plotting Box Plots or Whisker Diagrams

02:28

Scatter plots use both the x and y axes to plot data points and are a good means to demonstrate the correlation between variables. This video will demonstrate how to use GRAL to draw scatter plots for 100,000 random data points.

Plotting Scatter Plots

01:50

Donut plots a version of pie chart, are a popular data visualization technique that give visuals for proportions in your data. Let’s learn to plot donut plots for 10 random variables.

Plotting Donut Plots

02:56

Area graphs are useful tools to display how quantitative values develop over a given Interval. In this video, we will use the GRAL Java library to plot area graphs.

Plotting Area Graphs

04:56

About the Instructor