Julia is a high-performance dynamic programming language for numerical computing. This practical guide to programming with Julia will help you to work with data more efficiently.
This course begins with the important features of Julia to help you quickly refresh your knowledge of functions, modules, and arrays. We’ll explore utilizing the Julia language to identify, retrieve, and transform datasets so you can perform efficient data analysis and data manipulation.
You will then learn the concepts of metaprogramming and statistics in Julia.
Moving on, you will learn to build data science models by using several algorithms such as dimensionality reduction, linear discriminant analysis, and so on.
You’ll learn to optimize data science programs with parallel computing and memory allocation. You’ll get familiar with the concepts of package development and networking to solve numerical problems using the Julia platform.
This course includes sections on identifying and classifying data science problems, data modelling, data analysis, data manipulation, multidimensional arrays, and parallel computing.
By the end of this course, you will acquire the skills to work more effectively with your data.
What am I going to get from this course?
Extract and manage your data efficiently with Julia
Explore the metaprogramming concepts in Julia
Perform statistical analysis with StatsBase.jl and Distributions.jl
Build your data science models
Find out how to visualize your data with Gadfly
Explore big data concepts in Julia
What’s special about this course?
We've spent the last decade working to help developers stay relevant. The structure of this course is a result of deep and intensive research into what real-world developers need to know in order to be job-ready. We don't spend too long on theory, and focus on practical results so that you can see for yourself how things work in action.
We have combined the best of the following Packt products:
Meet your expert instructors:
Jalem Raj Rohit is an IIT Jodhpur graduate with a keen interest in machine learning, data science, data analysis, computational statistics, and natural language processing (NLP). Rohit currently works as a senior data scientist at Zomato, also having worked as the first data scientist at Kayako. He is part of the Julia project, where he develops data science models and contributes to the codebase.
Meet your managing editor:
This course has been planned and designed for you by me, Shiny Poojary. I'm here to help you be successful every step of the way, and get maximum value out of your course purchase. If you have any questions along the way, you can reach out to me and our author group via the instructor contact feature on Udemy.
In this section, we will explain ways in which you can handle files with the Comma-separated Values (CSV) file format.
In this section, we will explain how to handle Tab Separated Values (TSV) files.
DataFrames package is needed to deal with TSV files. So, as it is already installed as instructed in the previous section, we can move ahead and make sure that all the packages are up-to-date. The following video will show how to proceed:
In this section, we will explain ways to handle data stored in databases: MySQL and PostgreSQL.
In this section, you will learn how to interact with the Web through HTTP requests, both for getting data and posting data to the Web. You will learn about sending and getting requests to and from websites and also analyzing those responses
In this section, you will study the life of a Julia program and how it is actually represented and interpreted by Julia. You will also learn what is meant by "a language expressing its own code as a data structure of itself."
In this section, you will learn about symbols and expressions in detail. They have a syntactic importance in the metaprogramming concepts of Julia. So, this section would explain them in detail, so as to appreciate the concepts covered so far and those to follow.
The usage of a semicolon to represent expressions is known as quoting. The characters inside the parentheses after the semicolon constitute an
Expression object. In the following video, we will create an expression with a single argument:
Sometimes, construction on
Expression objects is difficult, especially when you have multiple objects and/or variables. This is used for easy and readable expression construction.
eval() function is simply used for executing or evaluating an
Expressionobject. The evaluations and executions are done in a global scope.
In this video, you will be introduced to macros, which are used to insert generated code into the programs. So, a macro is simply a block of code that can be compiled directly rather than the conventional method of constructing expression statements and using the
eval() function. The advantage of using macros is that a block of code that has to be hardcoded multiple times can be generated on-the-fly by creating macros for it.
In this section, you will learn about implementing the concept of metaprogramming to dataframes.
Metrics that help calculate the distance or similarity between two vectors are called deviation metrics. These metrics help us understand the relationship between the different vectors and the data in them.
Linear discriminant analysis is the algorithm that is used for classification tasks. This is often used to find the linear combination of the input features in the data, which can separate the observations into classes. In this case, it would be two classes; however, multi-class classification can also be done through the discriminant analysis algorithm, which is also called the multi-class linear discriminant analysis algorithm.
Data preprocessing is one of the most important parts of an analytics or a data science pipeline. It involves methods and techniques to sanitize the data being used, quick hacks for making the dataset easy to handle, and the elimination of unnecessary data to make it lightweight and efficient when used in the analytics process. For this recipe, we will use the
MLBase package of Julia, which is known as the Swiss Army Knife of writing machine learning code. Installation and setup instructions for the library will be explained in the Getting ready section.
Linear Regression is a linear model that is used to determine and predict numerical values. Linear regression is one of the most basic and important starting points in understanding linear models and predictive analytics. For this video, we will use Julia's
Classification is one of the core concepts of data science and attempts to classify data into different classes or groups. A simple example of classification can be trying to classify a particular population of people as male and female, depending on the data provided. In this recipe, we will learn to perform score-based classification, where each class is assigned a score, and the class with the lowest or the highest score is selected depending on the problem and the analyst's choice.
Analysis of performance is very important for any analytics and machine learning processes. It also helps in model selection. There are several evaluation metrics that can be leveraged on ML models. The technique depends on the type of data problem being handled, the algorithms used in the process, and also the way the analyst wants to gauge the success of the predictions or the results of the analytics process.
Cross validation is one of the most underrated processes in the domain of data science and analytics. However, it is very popular among the practitioners of competitive data science. It is a model evaluation method. It can give the analyst an idea about how well the model would perform on new predictions that the model has not yet seen. It is also extensively used to gauge and avoid the problem of overfitting, which occurs due to an excessive precise fit on the training set leading to inaccurate or high-error predictions on the testing set.
A probability distribution is when each point or subset in a randomized experiment is allotted a certain probability. So, every random experiment (and, in fact, the data of every data science experiment) follows a certain probability distribution. And the type of distribution being followed by the data is very important for initiating the analytics process, as well as for selecting the machine learning algorithms that are to be implemented. It should also be noted that, in a multivariate data set, each variable might follow a separate distribution. So, it is not necessary that all variables in a dataset follow similar distributions. In this video, we will work with a normal distribution and use the
Time series is another very important form of data. It is more widely used in stock markets, market analysis, and signal processing. The data has a time dimension, which makes it look like a signal. So, in most cases, signal analysis techniques and formulae are applicable for time series data, such as autocorrelation, crosscorrelation, and so on, which we have already dealt with in the previous chapters. In this recipe, we will deal with methods to get around and work with datasets with the time series format.
Arrays are one of the fundamental data structures used in data analysis to store various types of data. They are also a quick way to store columns or dimensions in data, for statistical analysis as well as exploratory analysis through plots and visualization. Arrays are also very easy to plot, as they are simple. When a visualization is being done with two columns of a dataset, it means that the two column values are taken in the form of separate arrays and then plotted against each other, which again makes arrays very important.
In data science and statistical modeling, there are several instances where an analyst needs to use several functions for both transforming and exploratory analytics steps. So, one can plot them in Gadfly in a very simple way, which can be used to plot separate functions as well as to stack several functions in a single plot.
Exploratory data analytics is one of the most important processes in a data science workflow. It is simply a thorough exploration of the data to find any possible patterns that can be identified through basic statistics and the shape of the data. It is mostly done with the help of plots, as visual information is much easier to comprehend than complex statistical terms.
Line plots, as we have already seen in the preceding examples, are very effective when it comes to exploratory data analytics. They can be used both to understand correlations and look at data trends. So, by further making use of aesthetics, we can make them more interesting and informative.
Scatter plots are the most basic plots in exploratory analytics. They help the analyst get a rough idea of the data distribution and the relationship between the corresponding columns, which in turn helps identify some prominent patterns in the data.
Histograms are one of the best ways for visualizing and finding out the three main statistics of a dataset: the mean, median, and mode. Histograms also help analysts get a very clear understanding of the distribution of data. The ability to plot categorical data as well as numerical data is what makes the histogram unique.
As we have already gone through how to plot the most important visualizations and their customizations in the
Gadfly library, we will also see how to customize them even further.
Parallel computing is a way of dealing with data in a parallel way. This can be done by connecting multiple computers as a cluster and using their CPUs to carry out the computations.
In parallel computing, data movements are quite common and should be minimized due to the time and the network overhead as a result of the movements.
In this video, you will learn a bit about the famous Map-Reduce framework and why it is one of the most important ideas in the domains of big data and parallel computing.
Channels are like background plumbing for parallel computing in Julia. They are the reservoirs from which the individual processes access their data.
Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.
With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.
From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.
Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.