If you are looking to build data science models that are good for production, Java has come to the rescue. This unique video provides modern solutions to solve your common and not-so-common data science-related problems. We start with solutions to help you obtain, clean, index and search data. Then you will learn a variety of techniques to analyze data. By the end of this course, you will be able to perform all advanced operations it takes to analyze the complexity of data and to perform indexing and search operations.
About The Author
Rushdi Shams has a PhD on application of machine learning in Natural Language Processing (NLP) problem areas from Western University, Canada. Before starting work as a machine learning and NLP specialist in industry, he was engaged in teaching undergrad and grad courses. He has been successfully maintaining his YouTube channel named Learn with Rushdi for learning computer technologies.
In this video, we take a look at how to retrieve the file paths and names from a complex directory structure that contains numerous directories and files inside a root directory.
Listing of file names in hierarchical directories can be done recursively as demonstrated in the previous video. However, this can be done in a much easier and convenient way and with less coding using the Apache Commons IO library.
There are different ways to read text files contents. This video demonstrates how to read text file contents all at once using Java 8.
Another way to read the text files contents all at once is using Apache commons IO. Let’s see how the same functionality described in the previous video can be achieved using Apache Commons IO API.
The most difficult file types for extracting data are PDF. Some PDFs are not even possible to parse because they are password-protected. This video demonstrates how to extract text from PDF files using Apache Tika.
ASCII text files can contain unnecessary units of characters that eventually are introduced during a conversion process. In this video, we clean several noises from ASCII text data using regular expressions.
In this video, we will see parsing CSV files and handle data points retrieved from them.
There are several ways to parse contents of XML files. In this video, we are using JDOM for XML parsing.
Just like XML, JSON is also a human-readable Data Interchange Format that is lightweight. In this video, we write JSON files.
In this video, we will see how we can read or parse a JSON file. As our sample input file, we will be using the JSON file we created in the previous video.
One of the easiest and handy ways is to use an external Java library named JSoup. In this video, we are using Jsoup for extracting web data.
A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. This video uses a certain number of methods offered in JSoup to extract web data.
Data can be stored in database tables too. In this video, we will read data from the table in MySQL.
Indexing is the first step for searching data fast. In action, Lucene uses an inverted full-text index. In this video, we will demonstrate how to index a large amount of data with Apache Lucene.
Now that we have indexed our data, we will be searching the data using Apache Lucene in this video.
We can generate summary statistics for data by using the SummaryStatistics class. This is similar to the DescriptiveStatistics class used in the preceding video. The major difference is that unlike the DescriptiveStatistics class, the SummaryStatistics class does not store data in memory.
In this video, we will be creating an AggregateSummaryStatistics instance to accumulate the overall statistics and SummaryStatistics for the sample data.
The Frequency class has methods to count the number of data instances in a bucket, to count unique number of data instances, and so on. Let’s explore how we compute frequency distribution.
This video is quite different than the other ones in this section as it deals with strings and counting word frequencies in a string. We will use both Apache Commons Math and Java 8 for this task.
This video does not use the Apache Commons Math library to count frequencies of words in a given string; rather, it uses core libraries and mechanisms introduced in Java 8.
Unbiased covariances are given by the formula cov(X, Y) = sum [(xi - E(X))(yi - E(Y))] / (n - 1) and Pearson's correlation computes correlations defined by the formula cor(X, Y) = sum[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)], where E(X) is the mean of X and E(Y) is the mean of the Y values. Non-bias-corrected estimates use n in place of n - 1. Let’s see how we calculate them in our code.
The SimpleRegression class supports ordinary least squares regression with one independent variable: y = intercept + slope * x, where the intercept is an optional parameter. In this video, the data points are added one at a time.
The OLSMultipleLinearRegression provides Ordinary Least Squares Regression to fit the linear model Y=X*b+u. Here, Y is an n-vector regress, and X is a [n,k] matrix, where k columns are called regressors, b is k-vector of regression parameters, and u is an n vector of error terms or residuals. Let’s see how we compute ordinary least squares regression in this video.
In this video, we will see another variant of least squares regression named generalized least squares regression. GLSMultipleLinearRegression implements Generalized Least Squares to fit the linear model Y=X*b+u.
Apache Commons Math has support for both one-sample and two-sample t-tests. Besides, two sample tests can be either paired or unpaired. The unpaired two-sample tests can be conducted with and without the assumption that the subpopulation variances are equal. We demonstrate paired t-test in this video.
For conducting a Chi-square test on two sets of data distributions, one distribution will be called the observed distribution and the other distribution will be called the expected distribution.
ANOVA stands for Analysis of Variance. In this video, we will see how to use Java to perform a one-way ANOVA test to determine whether the means of three or more independent and unrelated sets of data points are significantly different.
The Kolmogorov-Smirnov test (or simply KS test) is a test of equality for one-dimensional probability distributions that are continuous in nature. It is one of the popular methods to determine whether two sets of data points differ significantly.
Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.
With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.
From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.
Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.