Julia: Performing Statistical Computations
- 1.5 hours on-demand video
- 14 articles
- Full lifetime access
- Access on mobile and TV
- Certificate of Completion
Get your team access to Udemy's top 3,000+ courses anytime, anywhere.Try Udemy for Business
- Get familiar with the key concepts in Julia
Follow a comprehensive approach to learn Julia programming
Get an extensive coverage of Julia’s packages for statistical analysis
- Sharpen your skills to work more effectively with your data
- The software requirements assume you have any of the following OSes: Linux, Windows, or OS X
- There are no specific hardware requirements, except that you run and work all your code on a desktop, or a laptop preferably
Julia is a high-performance dynamic programming language for numerical computing. This practical guide to programming with Julia will help you to work with data more efficiently.
This course begins with the important features of Julia to help you quickly refresh your knowledge of functions, modules, and arrays. We’ll explore utilizing the Julia language to identify, retrieve, and transform datasets so you can perform efficient data analysis and data manipulation.
You will then learn the concepts of metaprogramming and statistics in Julia.
Moving on, you will learn to build data science models by using several algorithms such as dimensionality reduction, linear discriminant analysis, and so on.
You’ll learn to optimize data science programs with parallel computing and memory allocation. You’ll get familiar with the concepts of package development and networking to solve numerical problems using the Julia platform.
This course includes sections on identifying and classifying data science problems, data modelling, data analysis, data manipulation, multidimensional arrays, and parallel computing.
By the end of this course, you will acquire the skills to work more effectively with your data.
What am I going to get from this course?
Extract and manage your data efficiently with Julia
Explore the metaprogramming concepts in Julia
Perform statistical analysis with StatsBase.jl and Distributions.jl
Build your data science models
Find out how to visualize your data with Gadfly
Explore big data concepts in Julia
What’s special about this course?
We've spent the last decade working to help developers stay relevant. The structure of this course is a result of deep and intensive research into what real-world developers need to know in order to be job-ready. We don't spend too long on theory, and focus on practical results so that you can see for yourself how things work in action.
We have combined the best of the following Packt products:
- Julia Cookbook by Jalem Raj Rohit
- Julia Solutions by Jalem Raj Rohit
Meet your expert instructors:
Jalem Raj Rohit is an IIT Jodhpur graduate with a keen interest in machine learning, data science, data analysis, computational statistics, and natural language processing (NLP). Rohit currently works as a senior data scientist at Zomato, also having worked as the first data scientist at Kayako. He is part of the Julia project, where he develops data science models and contributes to the codebase.
Meet your managing editor:
This course has been planned and designed for you by me, Shiny Poojary. I'm here to help you be successful every step of the way, and get maximum value out of your course purchase. If you have any questions along the way, you can reach out to me and our author group via the instructor contact feature on Udemy.
- This course is for Julia programmers who want to learn data science right from exploratory analytics to the visualization part.
- Anyone who wants to work more effectively with data
In this section, we will explain ways in which you can handle files with the Comma-separated Values (CSV) file format.
In this section, we will explain how to handle Tab Separated Values (TSV) files.
DataFrames package is needed to deal with TSV files. So, as it is already installed as instructed in the previous section, we can move ahead and make sure that all the packages are up-to-date. The following video will show how to proceed:
In this section, you will study the life of a Julia program and how it is actually represented and interpreted by Julia. You will also learn what is meant by "a language expressing its own code as a data structure of itself."
In this video, you will be introduced to macros, which are used to insert generated code into the programs. So, a macro is simply a block of code that can be compiled directly rather than the conventional method of constructing expression statements and using the
eval() function. The advantage of using macros is that a block of code that has to be hardcoded multiple times can be generated on-the-fly by creating macros for it.
Linear discriminant analysis is the algorithm that is used for classification tasks. This is often used to find the linear combination of the input features in the data, which can separate the observations into classes. In this case, it would be two classes; however, multi-class classification can also be done through the discriminant analysis algorithm, which is also called the multi-class linear discriminant analysis algorithm.
Data preprocessing is one of the most important parts of an analytics or a data science pipeline. It involves methods and techniques to sanitize the data being used, quick hacks for making the dataset easy to handle, and the elimination of unnecessary data to make it lightweight and efficient when used in the analytics process. For this recipe, we will use the
MLBase package of Julia, which is known as the Swiss Army Knife of writing machine learning code. Installation and setup instructions for the library will be explained in the Getting ready section.
Classification is one of the core concepts of data science and attempts to classify data into different classes or groups. A simple example of classification can be trying to classify a particular population of people as male and female, depending on the data provided. In this recipe, we will learn to perform score-based classification, where each class is assigned a score, and the class with the lowest or the highest score is selected depending on the problem and the analyst's choice.
Analysis of performance is very important for any analytics and machine learning processes. It also helps in model selection. There are several evaluation metrics that can be leveraged on ML models. The technique depends on the type of data problem being handled, the algorithms used in the process, and also the way the analyst wants to gauge the success of the predictions or the results of the analytics process.
Cross validation is one of the most underrated processes in the domain of data science and analytics. However, it is very popular among the practitioners of competitive data science. It is a model evaluation method. It can give the analyst an idea about how well the model would perform on new predictions that the model has not yet seen. It is also extensively used to gauge and avoid the problem of overfitting, which occurs due to an excessive precise fit on the training set leading to inaccurate or high-error predictions on the testing set.
A probability distribution is when each point or subset in a randomized experiment is allotted a certain probability. So, every random experiment (and, in fact, the data of every data science experiment) follows a certain probability distribution. And the type of distribution being followed by the data is very important for initiating the analytics process, as well as for selecting the machine learning algorithms that are to be implemented. It should also be noted that, in a multivariate data set, each variable might follow a separate distribution. So, it is not necessary that all variables in a dataset follow similar distributions. In this video, we will work with a normal distribution and use the
Time series is another very important form of data. It is more widely used in stock markets, market analysis, and signal processing. The data has a time dimension, which makes it look like a signal. So, in most cases, signal analysis techniques and formulae are applicable for time series data, such as autocorrelation, crosscorrelation, and so on, which we have already dealt with in the previous chapters. In this recipe, we will deal with methods to get around and work with datasets with the time series format.
Arrays are one of the fundamental data structures used in data analysis to store various types of data. They are also a quick way to store columns or dimensions in data, for statistical analysis as well as exploratory analysis through plots and visualization. Arrays are also very easy to plot, as they are simple. When a visualization is being done with two columns of a dataset, it means that the two column values are taken in the form of separate arrays and then plotted against each other, which again makes arrays very important.
In data science and statistical modeling, there are several instances where an analyst needs to use several functions for both transforming and exploratory analytics steps. So, one can plot them in Gadfly in a very simple way, which can be used to plot separate functions as well as to stack several functions in a single plot.
Exploratory data analytics is one of the most important processes in a data science workflow. It is simply a thorough exploration of the data to find any possible patterns that can be identified through basic statistics and the shape of the data. It is mostly done with the help of plots, as visual information is much easier to comprehend than complex statistical terms.
Line plots, as we have already seen in the preceding examples, are very effective when it comes to exploratory data analytics. They can be used both to understand correlations and look at data trends. So, by further making use of aesthetics, we can make them more interesting and informative.
Histograms are one of the best ways for visualizing and finding out the three main statistics of a dataset: the mean, median, and mode. Histograms also help analysts get a very clear understanding of the distribution of data. The ability to plot categorical data as well as numerical data is what makes the histogram unique.