Understanding The Internals Of The Unix Kernel Architecture
4.2 (12 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
2,761 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Understanding The Internals Of The Unix Kernel Architecture to your Wishlist.

Add to Wishlist

Understanding The Internals Of The Unix Kernel Architecture

The Unix Operating System
4.2 (12 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
2,761 students enrolled
Created by Satish Venkatesh
Last updated 8/2017
English
Current price: $10 Original price: $20 Discount: 50% off
5 hours left at this price!
30-Day Money-Back Guarantee
Includes:
  • 6.5 hours on-demand video
  • 20 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Understand The Working Of Three Important Subsystems of Unix Kernel (O.S) - File Management System, Process Management System, Inter process Communication
  • Learn the algorithms related to different system calls in the Unix Operating System
View Curriculum
Requirements
  • Basics of C programming
Description

Welcome to the course 'Understanding the Internals of Unix Kernel Architecture'

Here in this course, We are covering the 3 Major subsystems of Unix Operating System:   

The File Management Subsystem which deals about the internal representation of files    

The Process Management Subsystem which talks about the structure of the process and various process control calls    

The Inter process Communication talks about the signals, pipes, message queues and shared memory      

The Algorithms of various important system calls will be explained here in this course

Here in this course, you will get to learn the Internal working of the Unix operating system. Though there are quite a few differences between a Linux operating system and the Unix operating system. Knowing The Internals of the Unix Operating system will help us to Understand The Working Of Linux Kernel Or at least start with Understanding the Linux Kernel. 

There are assignments given in each section. The answers to the assignments are uploaded as a zip file. The assignments includes the following questions and answers:

Write a c program to implement your own malloc library function
Write a c program to implement your own free library function
Write a c program to implement your own realloc library function

Write a c program to implement your own ls (list) command
Write a c program to implement your own cp (copy) command
Write a c program to implement stat command
Write a c program to implement your own tee command
Write a c program to implement your own size command
Write a c program to implement your own touch command
Write a c program to implement your own fopen, fread, fwrite calls

Write a c program to implement a sample state machine
Write a c program to implement your own ps command
Write a c program to implement your own sleep command
Write a c program to implement your own shell
Write a c program to which demonstrates the functionality of daemons

Implement client server program using FIFO
Write a C program to demonstrate pipes using child and parent
Write a C program to demonstrate fifos
Implement client server program with message queues using semaphore
Write a program to demonstrate shared memory using semaphores

Please check the course overview, and If you are interested, Kindly take up the course.

Note: This course covers the internals of Unix Operating System. We are not dealing with command line usage of Unix/Linux Operating System. we have mapped the sample code flows for the system call algorithms


Who is the target audience?
  • Anybody who is interested in learning the Internals of the Unix Operating System
Students Who Viewed This Course Also Viewed
Curriculum For This Course
90 Lectures
06:32:50
+
Course Overview
1 Lecture 02:24
+
Overview of Unix Operating System
11 Lectures 45:44


Users perspective
06:11

Memory map of a c program
06:23

Process control
08:03

Building block primitives
01:51

Operating system services
02:19

Process execution modes
03:06

Interrupts
03:33

Exceptions
02:08

Unix Kernel Architecture Design
01:37

Exercise 1: Write a c program to implement your own malloc library function

Exercise 2: Write a c program to implement your own free library function

Exercise 3: Write a c program to implement your own realloc library function
+
Internal Representation Of Files
10 Lectures 01:10:22
Directories
08:18


file: /usr/src/kernels/2.6.23.1-42.fc8-i686/include/linux/fs.h

structure of super block

struct super_block {
        struct list_head        s_list;         /* Keep this first */
        dev_t                   s_dev;          /* search index; _not_ kdev_t */
        unsigned long           s_blocksize;
        unsigned char           s_blocksize_bits;
        unsigned char           s_dirt;
        unsigned long long      s_maxbytes;     /* Max file size */
        struct file_system_type *s_type;
        const struct super_operations   *s_op;
        struct dquot_operations *dq_op;
        struct quotactl_ops     *s_qcop;
        struct export_operations *s_export_op;
        unsigned long           s_flags;
        unsigned long           s_magic;
        struct dentry           *s_root;
        struct rw_semaphore     s_umount;
        struct mutex            s_lock;
        int                     s_count;
        int                     s_syncing;
        int                     s_need_sync_fs;
        atomic_t                s_active;
#ifdef CONFIG_SECURITY
        void                    *s_security;
#endif
        struct xattr_handler    **s_xattr;

        struct list_head        s_inodes;       /* all inodes */
        struct list_head        s_dirty;        /* dirty inodes */
        struct list_head        s_io;           /* parked for writeback */
        struct hlist_head       s_anon;         /* anonymous dentries for (nfs) exporting */
        struct list_head        s_files;

        struct block_device     *s_bdev;
        struct mtd_info         *s_mtd;
        struct list_head        s_instances;
        struct quota_info       s_dquot;        /* Diskquota specific options */

        int                     s_frozen;
        wait_queue_head_t       s_wait_unfrozen;

        char s_id[32];                          /* Informational name */

        void                    *s_fs_info;     /* Filesystem private info */

        /*
         * The next field is for VFS *only*. No filesystems have any business
         * even looking at it. You had been warned.
         */
        struct mutex s_vfs_rename_mutex;        /* Kludge */

        /* Granularity of c/m/atime in ns.
           Cannot be worse than a second */
        u32                s_time_gran;

        /*
         * Filesystem subtype.  If non-empty the filesystem type field
         * in /proc/mounts will be "type.subtype"
         */
        char *s_subtype;
};

Functions which are used to perform operations on super_block

struct super_operations {
        struct inode *(*alloc_inode)(struct super_block *sb);
        void (*destroy_inode)(struct inode *);

        void (*read_inode) (struct inode *);

        void (*dirty_inode) (struct inode *);
        int (*write_inode) (struct inode *, int);
        void (*put_inode) (struct inode *);
        void (*drop_inode) (struct inode *);
        void (*delete_inode) (struct inode *);
        void (*put_super) (struct super_block *);
        void (*write_super) (struct super_block *);
        int (*sync_fs)(struct super_block *sb, int wait);
        void (*write_super_lockfs) (struct super_block *);
        void (*unlockfs) (struct super_block *);
        int (*statfs) (struct dentry *, struct kstatfs *);
        int (*remount_fs) (struct super_block *, int *, char *);
        void (*clear_inode) (struct inode *);
        void (*umount_begin) (struct vfsmount *, int);

        int (*show_options)(struct seq_file *, struct vfsmount *);
        int (*show_stats)(struct seq_file *, struct vfsmount *);
#ifdef CONFIG_QUOTA
        ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
        ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
};


Super block
06:44

Inode list and Datablock list
03:47

inode contents:

/usr/src/kernels/2.6.23.1-42.fc8-i686/include/linux/fs.h

struct inode {
        struct hlist_node       i_hash;
        struct list_head        i_list;
        struct list_head        i_sb_list;
        struct list_head        i_dentry;
        unsigned long           i_ino;
        atomic_t                i_count;
        unsigned int            i_nlink;
        uid_t                   i_uid;
        gid_t                   i_gid;
        dev_t                   i_rdev;
        unsigned long           i_version;
        loff_t                  i_size;
#ifdef __NEED_I_SIZE_ORDERED
        seqcount_t              i_size_seqcount;
#endif
        struct timespec         i_atime;
        struct timespec         i_mtime;
        struct timespec         i_ctime;
        unsigned int            i_blkbits;
        blkcnt_t                i_blocks;
        unsigned short          i_bytes;
        umode_t                 i_mode;
        spinlock_t              i_lock; /* i_blocks, i_bytes, maybe i_size */
        struct mutex            i_mutex;
        struct rw_semaphore     i_alloc_sem;
        const struct inode_operations   *i_op;
        const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
        struct super_block      *i_sb;
        struct file_lock        *i_flock;

        struct address_space    *i_mapping;
        struct address_space    i_data;
#ifdef CONFIG_QUOTA
        struct dquot            *i_dquot[MAXQUOTAS];
#endif
        struct list_head        i_devices;
        union {
                struct pipe_inode_info  *i_pipe;
                struct block_device     *i_bdev;
                struct cdev             *i_cdev;
        };
        int                     i_cindex;

        __u32                   i_generation;

#ifdef CONFIG_DNOTIFY
        unsigned long           i_dnotify_mask; /* Directory notify events */
        struct dnotify_struct   *i_dnotify; /* for directory notifications */
#endif

#ifdef CONFIG_INOTIFY
        struct list_head        inotify_watches; /* watches on this inode */
        struct mutex            inotify_mutex;  /* protects the watches list */
#endif

        unsigned long           i_state;
        unsigned long           dirtied_when;   /* jiffies of first dirtying */

        unsigned int            i_flags;

        atomic_t                i_writecount;
#ifdef CONFIG_SECURITY
        void                    *i_security;
#endif

        void                    *i_private; /* fs or device private pointer */
};

Function which perform operations on inode:

struct inode_operations {
        int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
        struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct inode *,struct dentry *,int);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct inode *,struct dentry *,int,dev_t);
        int (*rename) (struct inode *, struct dentry *,
                        struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char __user *,int);
        void * (*follow_link) (struct dentry *, struct nameidata *);
        void (*put_link) (struct dentry *, struct nameidata *, void *);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int, struct nameidata *);
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
        int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
        ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*removexattr) (struct dentry *, const char *);
        void (*truncate_range)(struct inode *, loff_t, loff_t);
        long (*fallocate)(struct inode *inode, int mode, loff_t offset,
                          loff_t len);
};



Inodes
09:42

Direct and Indirect blocks part 1
07:32

Direct and Indirect blocks part 2
10:27

Directory Entry Structure

/usr/include/dirent.h

          struct dirent {
              ino_t          d_ino;       /* inode number */
              off_t          d_off;       /* offset to the next dirent */
              unsigned short d_reclen;    /* length of this record */
              unsigned char  d_type;      /* type of file */
              char           d_name[256]; /* filename */
          };

Directory structure
08:14

The starting function in namei.c file which helps in getting the details of the file. this function is the start of the namei algorithm. this is called in the open system call to check for the filename. 

struct filename * getname_flags(const char __user *filename, int flags, int *empty)  {

        struct filename *result;
        char *kname;
        int len;

        result = audit_reusename(filename);
        if (result)
                return result;

        result = __getname();
        if (unlikely(!result))
                return ERR_PTR(-ENOMEM);

        /*
         * First, try to embed the struct filename inside the names_cache
         * allocation
         */
        kname = (char *)result->iname;
        result->name = kname;

        len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
        if (unlikely(len < 0)) {
                __putname(result);
                return ERR_PTR(len);
        }

        /*
         * Uh-oh. We have a name that's approaching PATH_MAX. Allocate a
         * separate struct filename so we can dedicate the entire
         * names_cache allocation for the pathname, and re-do the copy from
         * userland.
         */
        if (unlikely(len == EMBEDDED_NAME_MAX)) {
                const size_t size = offsetof(struct filename, iname[1]);
                kname = (char *)result;

                /*
                 * size is chosen that way we to guarantee that

                 * result->iname[0] is within the same object and that
                 * kname can't be equal to result->iname, no matter what.
                 */
                result = kzalloc(size, GFP_KERNEL);
                if (unlikely(!result)) {
                        __putname(kname);
                        return ERR_PTR(-ENOMEM);
                }
                result->name = kname;
                len = strncpy_from_user(kname, filename, PATH_MAX);
                if (unlikely(len < 0)) {
                        __putname(kname);
                        kfree(result);
                        return ERR_PTR(len);
                }
                if (unlikely(len == PATH_MAX)) {
                        __putname(kname);
                        kfree(result);
                        return ERR_PTR(-ENAMETOOLONG);
                }
        }

        result->refcnt = 1;
        /* The empty path is special. */
        if (unlikely(!len)) {
                if (empty)
                        *empty = 1;
                if (!(flags & LOOKUP_EMPTY)) {
                        putname(result);
                        return ERR_PTR(-ENOENT);
                }
        }

        result->uptr = filename;
        result->aname = NULL;
        audit_getname(result);
        return result;
}

#########################################################


The Namei Algorithm
04:55

Free inodes and Remembered inodes
06:10
+
File System
13 Lectures 01:41:03
File permissions
04:34

Navigation
08:51

Tables
02:32

File Table and Inode Table
10:50

User Area
03:38

Process Table
01:37

file: open.c

function: do_sys_open

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) {

        struct open_flags op;
        int fd = build_open_flags(flags, mode, &op);
        struct filename *tmp;

        if (fd)
                return fd;

        tmp = getname(filename);
        if (IS_ERR(tmp))
                return PTR_ERR(tmp);

        fd = get_unused_fd_flags(flags);
        if (fd >= 0) {
                struct file *f = do_filp_open(dfd, tmp, &op);
                if (IS_ERR(f)) {
                        put_unused_fd(fd);
                        fd = PTR_ERR(f);
                } else {
                        fsnotify_open(f);
                        fd_install(fd, f);
                }
        }
        putname(tmp);
        return fd;
}

The Algorithm For Open System Call
15:22

The flow of syscall from syscall instruction to the write system call:

syscall: write(int fd, const void *buf, size_t nbytes);

assembly language code for calling the system call:

write() system call has got the number syscall number 1.
we pass the system call number to the %rax register
we pass the file descriptor parameter to the %rdi register
we pass the message to be written ( that is the buf contents ) to the register %rsi
we pass the length of the parameter to be written ( that is the nbytes parameter ) to the register %rdx
then we call the syscall.

please find the the respective code which passes buf, nbytes and system call number to the respective registers

_start:
        movq  $1, %rax
    movq  $1, %rdi
    movq  $msg, %rsi
    movq  $len, %rdx
    syscall                  //calling syscall

    movq  $60, %rax
    xorq  %rdi, %rdi
    syscall

the syscall is called through the line 'syscall' instruction above:

so, the syscall knows the syscall number through %rax.
it maps the syscall number with the system call in the system call table.

arch/x86/entry/syscall_64.tbl

...
...
0       common  read                    sys_read
1       common  write                   sys_write
2       common  open                    sys_open
3       common  close                   sys_close
...
...

THE NEXT STEP:

now - the function which gets called is the below

file: read_write.c

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
                size_t, count)
{
        struct fd f = fdget_pos(fd);
        ssize_t ret = -EBADF;

        if (f.file) {
                loff_t pos = file_pos_read(f.file);
                ret = vfs_write(f.file, buf, count, &pos);
                if (ret >= 0)
                        file_pos_write(f.file, pos);
                fdput_pos(f);
        }

        return ret;
}

how does SYSCALL_DEFINE3 boil down to calling sys_write():

First of all, the SYSCALL_DEFINE3 macro is defined in the include/linux/syscalls.h header file and expands to the definition of the sys_name(...) function.
Let's look at this macro:

#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)

#define SYSCALL_DEFINEx(x, sname, ...)                \
        SYSCALL_METADATA(sname, x, __VA_ARGS__)       \
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

let's check only the __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

The first sys##name is definition of the syscall handler function with the given name - sys_system_call_name. The __SC_DECL macro takes the __VA_ARGS__ and combines call input parameter system type and the parameter name, because the macro definition is unable to determine the parameter types. And the __MAP macro applies __SC_DECL macro to the __VA_ARGS__ arguments. As a result of the SYSCALL_DEFINE3 macro, we will have:

asmlinkage long sys_write(unsigned int fd, const char __user * buf, size_t count);

The SYSCALL_DEFINE3 is defined in the file:

file: ../include/linux/syscalls.h

#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)


The Algorithm For Write System Call
18:01

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
        struct fd f = fdget_pos(fd);
        ssize_t ret = -EBADF;

        if (f.file) {
                loff_t pos = file_pos_read(f.file);
                ret = vfs_read(f.file, buf, count, &pos);
                if (ret >= 0)
                        file_pos_write(f.file, pos);
                fdput_pos(f);
        }
        return ret;
}

The Algorithm For Read System Call
06:14

SYSCALL_DEFINE1(close, unsigned int, fd)
{
        int retval = __close_fd(current->files, fd);

        /* can't restart close syscall because file table entry was cleared */
        if (unlikely(retval == -ERESTARTSYS ||
                     retval == -ERESTARTNOINTR ||
                     retval == -ERESTARTNOHAND ||
                     retval == -ERESTART_RESTARTBLOCK))
                retval = -EINTR;

        return retval;
}

The Algorithm For Close System Call
05:47

SYSCALL_DEFINE1(dup, unsigned int, fildes)
{
        int ret = -EBADF;
        struct file *file = fget_raw(fildes);

        if (file) {
                ret = get_unused_fd_flags(0);
                if (ret >= 0)
                        fd_install(ret, file);
                else
                        fput(file);
        }
        return ret;
}

=================================

sample source code for do_dup()

PUBLIC int do_dup()
{
/* Perform the dup(fd) or dup2(fd,fd2) system call. These system calls are
 * obsolete.  In fact, it is not even possible to invoke them using the
 * current library because the library routines call fcntl().  They are
 * provided to permit old binary programs to continue to run.
 */

  register int rfd;
  register struct filp *f;
  struct filp *dummy;
  int r;

  /* Is the file descriptor valid? */
  rfd = fd & ~DUP_MASK;         /* kill off dup2 bit, if on */
  if ((f = get_filp(rfd)) == NIL_FILP) return(err_code);

  /* Distinguish between dup and dup2. */
  if (fd == rfd) {                      /* bit not on */
        /* dup(fd) */
        if ( (r = get_fd(0, 0, &fd2, &dummy)) != OK) return(r);
  } else {
        /* dup2(fd, fd2) */
        if (fd2 < 0 || fd2 >= OPEN_MAX) return(EBADF);
        if (rfd == fd2) return(fd2);    /* ignore the call: dup2(x, x) */
        fd = fd2;               /* prepare to close fd2 */
        (void) do_close();      /* cannot fail */
  }

  /* Success. Set up new file descriptors. */
  f->filp_count++;
  fp->fp_filp[fd2] = f;
  return(fd2);
}


Dup System Call
10:48

sample code for do_link() system call:

PUBLIC int do_link()
{
/* Perform the link(name1, name2) system call. */

  register struct inode *ip, *rip;
  register int r;
  char string[NAME_MAX];
  struct inode *new_ip;

  /* See if 'name' (file to be linked) exists. */
  if (fetch_name(name1, name1_length, M1) != OK) return(err_code);
  if ( (rip = eat_path(user_path)) == NIL_INODE) return(err_code);

  /* Check to see if the file has maximum number of links already. */
  r = OK;
  if ( (rip->i_nlinks & BYTE) >= LINK_MAX) r = EMLINK;

  /* Only super_user may link to directories. */
  if (r == OK)
        if ( (rip->i_mode & I_TYPE) == I_DIRECTORY && !super_user) r = EPERM;

  /* If error with 'name', return the inode. */
  if (r != OK) {
        put_inode(rip);
        return(r);
  }

  /* Does the final directory of 'name2' exist? */
  if (fetch_name(name2, name2_length, M1) != OK) {
        put_inode(rip);
        return(err_code);
  }
  if ( (ip = last_dir(user_path, string)) == NIL_INODE) r = err_code;

  /* If 'name2' exists in full (even if no space) set 'r' to error. */
  if (r == OK) {
        if ( (new_ip = advance(ip, string)) == NIL_INODE) {
                r = err_code;
                if (r == ENOENT) r = OK;
        } else {
                put_inode(new_ip);
                r = EEXIST;
        }
  }

  /* Check for links across devices. */
  if (r == OK)
        if (rip->i_dev != ip->i_dev) r = EXDEV;

  /* Try to link. */
  if (r == OK)
        r = search_dir(ip, string, &rip->i_num, ENTER);

  /* If success, register the linking. */
  if (r == OK) {
        rip->i_nlinks++;
        rip->i_update |= CTIME;
        rip->i_dirt = DIRTY;
  }

  /* Done.  Release both inodes. */
  put_inode(rip);
  put_inode(ip);
  return(r);
}


The Algorithm For Link System Call
09:03

sample source code for do_unlink system call :

PUBLIC int do_unlink()
{
/* Perform the unlink(name) or rmdir(name) system call. The code for these two
 * is almost the same.  They differ only in some condition testing.  Unlink()
 * may be used by the superuser to do dangerous things; rmdir() may not.
 */

  register struct inode *rip;
  struct inode *rldirp;
  int r;
  char string[NAME_MAX];

  /* Get the last directory in the path. */
  if (fetch_name(name, name_length, M3) != OK) return(err_code);
  if ( (rldirp = last_dir(user_path, string)) == NIL_INODE)
        return(err_code);

  /* The last directory exists.  Does the file also exist? */
  r = OK;
  if ( (rip = advance(rldirp, string)) == NIL_INODE) r = err_code;

  /* If error, return inode. */
  if (r != OK) {
        put_inode(rldirp);
        return(r);
  }

  /* Do not remove a mount point. */
  if (rip->i_num == ROOT_INODE) {
        put_inode(rldirp);
        put_inode(rip);
        return(EBUSY);
  }

  /* Now test if the call is allowed, separately for unlink() and rmdir(). */
  if (fs_call == UNLINK) {
        /* Only the su may unlink directories, but the su can unlink any dir.*/
        if ( (rip->i_mode & I_TYPE) == I_DIRECTORY && !super_user) r = EPERM;

        /* Don't unlink a file if it is the root of a mounted file system. */

       if (rip->i_num == ROOT_INODE) r = EBUSY;

        /* Actually try to unlink the file; fails if parent is mode 0 etc. */
        if (r == OK) r = unlink_file(rldirp, rip, string);

  } else {
        r = remove_dir(rldirp, rip, string); /* call is RMDIR */
  }

  /* If unlink was possible, it has been done, otherwise it has not. */
  put_inode(rip);
  put_inode(rldirp);
  return(r);
}


The Algorithm For Unlink System Call
03:46

Exercise 1: Write a c program to implement your own ls (list) command

Exercise 2: Write a c program to implement your own cp (copy) command

Exercise 3: Write a c program to implement stat command

Exercise 4:Write a c program to implement your own tee command

Exercise 5: Write a c program to implement your own size command

Exercise 6: Write a c program to implement your own touch command

Exercise 7: Write a c program to implement your own fopen, fread, fwrite calls
+
Structure Of Processes
12 Lectures 52:03
Process
00:52

Process states and transitions
03:17

The kernel stores the list of processes in a circular doubly linked list called the task list. Each element in the task list is a process descriptor of the type struct task_struck, which is defined in <linux/sched.h>. The task structure contains all the information about a specific process.

The task_struct is a relatively large data structure, at around 1.7 kilobytes on a 32-bit machine. This size, however, is quite small considering that the structure contains all the information that the kernel has and needs about a process. The task structure contains the data that describes the executing program: open files, the process's address space, pending signals, the process's state, and much more

struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
        /*
         * For reasons of header soup (see current_thread_info()), this
         * must be the first element of task_struct.
         */
        struct thread_info              thread_info;
#endif
        /* -1 unrunnable, 0 runnable, >0 stopped: */
        volatile long                   state;
        void                            *stack;
        atomic_t                        usage;
        /* Per task flags (PF_*), defined further below: */
        unsigned int                    flags;
        unsigned int                    ptrace;

        ....

        ....

}

======================================

another understandable sample process table structure :

struct proc {
  struct stackframe_s p_reg;    /* process' registers saved in stack frame */

#if (CHIP == INTEL)
  reg_t p_ldt_sel;              /* selector in gdt giving ldt base and limit*/
  struct segdesc_s p_ldt[2];    /* local descriptors for code and data */
                                /* 2 is LDT_SIZE - avoid include protect.h */
#endif /* (CHIP == INTEL) */

#if (CHIP == M68000)
  reg_t p_splow;                /* lowest observed stack value */
  int p_trap;                   /* trap type (only low byte) */
#if (SHADOWING == 0)
  char *p_crp;                  /* mmu table pointer (really struct _rpr *) */
#else
  phys_clicks p_shadow;         /* set if shadowed process image */
  int align;                    /* make the struct size a multiple of 4 */
#endif
  int p_nflips;                 /* statistics */
  char p_physio;                /* cannot be (un)shadowed now if set */
#if defined(FPP)
  struct fsave p_fsave;         /* FPP state frame and registers */
  int align2;                   /* make the struct size a multiple of 4 */
#endif
#endif /* (CHIP == M68000) */

  reg_t *p_stguard;             /* stack guard word */

  int p_nr;                     /* number of this process (for fast access) */

  int p_int_blocked;            /* nonzero if int msg blocked by busy task */
  int p_int_held;               /* nonzero if int msg held by busy syscall */
  struct proc *p_nextheld;      /* next in chain of held-up int processes */

  int p_flags;                  /* P_SLOT_FREE, SENDING, RECEIVING, etc. */
  struct mem_map p_map[NR_SEGS];/* memory map */
  pid_t p_pid;                  /* process id passed in from MM */

  clock_t user_time;            /* user time in ticks */

  clock_t sys_time;             /* sys time in ticks */
  clock_t child_utime;          /* cumulative user time of children */
  clock_t child_stime;          /* cumulative sys time of children */
  clock_t p_alarm;              /* time of next alarm in ticks, or 0 */

  struct proc *p_callerq;       /* head of list of procs wishing to send */
  struct proc *p_sendlink;      /* link to next proc wishing to send */
  message *p_messbuf;           /* pointer to message buffer */
  int p_getfrom;                /* from whom does process want to receive? */
  int p_sendto;

  struct proc *p_nextready;     /* pointer to next ready process */
  sigset_t p_pending;           /* bit map for pending signals */
  unsigned p_pendcount;         /* count of pending and unfinished signals */

  char p_name[16];              /* name of the process */
};




The Process Table
04:52

The User Area
04:14

Physical Memory and Virtual Memory
03:19

Regions
06:47

The Kernel Layout
03:21

The Context Of A Process Part 1
13:05

The Context Of A Process Part 2
08:12

Context Switch
01:26

The System Call Table
01:34

The Region Table Entry
01:04

Exercise 1: Write a c program to implement a sample state machine
+
Process Control
6 Lectures 42:34

PRIVATE int do_fork(m_ptr)
register message *m_ptr;        /* pointer to request message */
{
/* Handle sys_fork().  m_ptr->PROC1 has forked.  The child is m_ptr->PROC2. */

#if (CHIP == INTEL)
  reg_t old_ldt_sel;
#endif
  register struct proc *rpc;
  struct proc *rpp;

  if (!isoksusern(m_ptr->PROC1) || !isoksusern(m_ptr->PROC2))
        return(E_BAD_PROC);
  rpp = proc_addr(m_ptr->PROC1);
  rpc = proc_addr(m_ptr->PROC2);

  /* Copy parent 'proc' struct to child. */
#if (CHIP == INTEL)
  old_ldt_sel = rpc->p_ldt_sel; /* stop this being obliterated by copy */
#endif

  *rpc = *rpp;                  /* copy 'proc' struct */

#if (CHIP == INTEL)
  rpc->p_ldt_sel = old_ldt_sel;
#endif
  rpc->p_nr = m_ptr->PROC2;     /* this was obliterated by copy */

#if (SHADOWING == 0)
  rpc->p_flags |= NO_MAP;       /* inhibit the process from running */
#endif

  rpc->p_flags &= ~(PENDING | SIG_PENDING | P_STOP);

  /* Only 1 in group should have PENDING, child does not inherit trace status*/
  sigemptyset(&rpc->p_pending);
  rpc->p_pendcount = 0;
  rpc->p_pid = m_ptr->PID;      /* install child's pid */
  rpc->p_reg.retreg = 0;        /* child sees pid = 0 to know it is child */

  rpc->user_time = 0;           /* set all the accounting times to 0 */
  rpc->sys_time = 0;
  rpc->child_utime = 0;
  rpc->child_stime = 0;

#if (SHADOWING == 1)
  rpc->p_nflips = 0;
  mkshadow(rpp, (phys_clicks)m_ptr->m1_p1);     /* run child first */
#endif

  return(OK);
}

=========================================================

In Linux:

#ifdef __ARCH_WANT_SYS_FORK
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
        return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
#else
        /* can not support in nommu mode */
        return -EINVAL;
#endif
}
#endif

long _do_fork(unsigned long clone_flags,
              unsigned long stack_start,
              unsigned long stack_size,
              int __user *parent_tidptr,
              int __user *child_tidptr,
              unsigned long tls)
{
        struct task_struct *p;
        int trace = 0;
        long nr;

        /*
         * Determine whether and which event to report to ptracer.  When
         * called from kernel_thread or CLONE_UNTRACED is explicitly
         * requested, no event is reported; otherwise, report if the event
         * for the type of forking is enabled.
         */
        if (!(clone_flags & CLONE_UNTRACED)) {
                if (clone_flags & CLONE_VFORK)
                        trace = PTRACE_EVENT_VFORK;
                else if ((clone_flags & CSIGNAL) != SIGCHLD)
                        trace = PTRACE_EVENT_CLONE;
                else
                        trace = PTRACE_EVENT_FORK;

                if (likely(!ptrace_event_enabled(current, trace)))
                        trace = 0;
        }

        p = copy_process(clone_flags, stack_start, stack_size,
                         child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
        add_latent_entropy();
        /*
         * Do this prior waking up the new thread - the thread pointer
         * might get invalid after that point, if the thread exits quickly.
         */
        if (!IS_ERR(p)) {
                struct completion vfork;
                struct pid *pid;

                trace_sched_process_fork(current, p);

                pid = get_task_pid(p, PIDTYPE_PID);
                nr = pid_vnr(pid);

                if (clone_flags & CLONE_PARENT_SETTID)
                        put_user(nr, parent_tidptr);

                if (clone_flags & CLONE_VFORK) {
                        p->vfork_done = &vfork;
                        init_completion(&vfork);
                        get_task_struct(p);
                }

                wake_up_new_task(p);

                /* forking complete and child started to run, tell ptracer */
                if (unlikely(trace))
                        ptrace_event_pid(trace, pid);

                if (clone_flags & CLONE_VFORK) {
                        if (!wait_for_vfork_done(p, &vfork))
                                ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
                }

                put_pid(pid);
        } else {
                nr = PTR_ERR(p);
        }
        return nr;
}




The Algorithm For Fork System Call
15:49

Inheritance of Child From Parent
05:37

In Linux:

#ifdef __ARCH_WANT_SYS_VFORK
SYSCALL_DEFINE0(vfork)
{
        return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
                        0, NULL, NULL, 0);
}

long _do_fork(unsigned long clone_flags,
              unsigned long stack_start,
              unsigned long stack_size,
              int __user *parent_tidptr,
              int __user *child_tidptr,
              unsigned long tls)
{
        struct task_struct *p;
        int trace = 0;
        long nr;

        /*
         * Determine whether and which event to report to ptracer.  When
         * called from kernel_thread or CLONE_UNTRACED is explicitly
         * requested, no event is reported; otherwise, report if the event
         * for the type of forking is enabled.
         */
        if (!(clone_flags & CLONE_UNTRACED)) {
                if (clone_flags & CLONE_VFORK)
                        trace = PTRACE_EVENT_VFORK;
                else if ((clone_flags & CSIGNAL) != SIGCHLD)
                        trace = PTRACE_EVENT_CLONE;
                else
                        trace = PTRACE_EVENT_FORK;

                if (likely(!ptrace_event_enabled(current, trace)))
                        trace = 0;
        }

        p = copy_process(clone_flags, stack_start, stack_size,
                         child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
        add_latent_entropy();
        /*
         * Do this prior waking up the new thread - the thread pointer
         * might get invalid after that point, if the thread exits quickly.
         */
        if (!IS_ERR(p)) {
                struct completion vfork;
                struct pid *pid;

                trace_sched_process_fork(current, p);

                pid = get_task_pid(p, PIDTYPE_PID);
                nr = pid_vnr(pid);

                if (clone_flags & CLONE_PARENT_SETTID)
                        put_user(nr, parent_tidptr);

                if (clone_flags & CLONE_VFORK) {
                        p->vfork_done = &vfork;
                        init_completion(&vfork);
                        get_task_struct(p);
                }

                wake_up_new_task(p);

                /* forking complete and child started to run, tell ptracer */
                if (unlikely(trace))
                        ptrace_event_pid(trace, pid);

                if (clone_flags & CLONE_VFORK) {
                        if (!wait_for_vfork_done(p, &vfork))
                                ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
                }

                put_pid(pid);
        } else {
                nr = PTR_ERR(p);
        }
        return nr;
}




The VFork System Call
03:46

The Process Termination
08:08

sample code for wait system call:

PUBLIC int do_waitpid()
{
/* A process wants to wait for a child to terminate. If one is already waiting,
 * go clean it up and let this WAIT call terminate.  Otherwise, really wait.
 * Both WAIT and WAITPID are handled by this code.
 */

  register struct mproc *rp;
  int pidarg, options, children, res2;

  /* A process calling WAIT never gets a reply in the usual way via the
   * reply() in the main loop (unless WNOHANG is set or no qualifying child
   * exists).  If a child has already exited, the routine cleanup() sends
   * the reply to awaken the caller.
   */

  /* Set internal variables, depending on whether this is WAIT or WAITPID. */
  pidarg  = (mm_call == WAIT ? -1 : pid);       /* first param of waitpid */
  options = (mm_call == WAIT ?  0 : sig_nr);    /* third param of waitpid */
  if (pidarg == 0) pidarg = -mp->mp_procgrp;    /* pidarg < 0 ==> proc grp */

  /* Is there a child waiting to be collected? At this point, pidarg != 0:
   *    pidarg  >  0 means pidarg is pid of a specific process to wait for
   *    pidarg == -1 means wait for any child
   *    pidarg  < -1 means wait for any child whose process group = -pidarg
   */
  children = 0;
  for (rp = &mproc[0]; rp < &mproc[NR_PROCS]; rp++) {
        if ( (rp->mp_flags & IN_USE) && rp->mp_parent == who) {
                /* The value of pidarg determines which children qualify. */
                if (pidarg  > 0 && pidarg != rp->mp_pid) continue;
                if (pidarg < -1 && -pidarg != rp->mp_procgrp) continue;

                children++;             /* this child is acceptable */
                if (rp->mp_flags & HANGING) {
                        /* This child meets the pid test and has exited. */
                        cleanup(rp);    /* this child has already exited */
                        dont_reply = TRUE;
                        return(OK);
                }

                if ((rp->mp_flags & STOPPED) && rp->mp_sigstatus) {
                        /* This child meets the pid test and is being traced.*/
                        res2 =  0177 | (rp->mp_sigstatus << 8);
                        reply(who, rp->mp_pid, res2, NIL_PTR);
                        dont_reply = TRUE;
                        rp->mp_sigstatus = 0;
                        return(OK);
                }
        }
  }

  /* No qualifying child has exited.  Wait for one, unless none exists. */
  if (children > 0) {
        /* At least 1 child meets the pid test exists, but has not exited. */
        if (options & WNOHANG) return(0);    /* parent does not want to wait */
        mp->mp_flags |= WAITING;             /* parent wants to wait */
        mp->mp_wpid = (pid_t) pidarg;        /* save pid for later */
        dont_reply = TRUE;                   /* do not reply now though */
        return(OK);                          /* yes - wait for one to exit */
  } else {
        /* No child even meets the pid test.  Return error immediately. */
        return(ECHILD);                      /* no - parent has no children */
  }
}


The Algorithm For Wait System Call
02:56

>> Please refer a.out.h file for knowing the structure of a.out file

----------------------------------------------------------------------------------------------------------

sample source code for exec system call:

PUBLIC int do_exec()
{
/* Perform the execve(name, argv, envp) call.  The user library builds a
 * complete stack image, including pointers, args, environ, etc.  The stack
 * is copied to a buffer inside MM, and then to the new core image.
 */

  register struct mproc *rmp;
  struct mproc *sh_mp;
  int m, r, fd, ft, sn;
  static char mbuf[ARG_MAX];    /* buffer for stack and zeroes */
  static char name_buf[PATH_MAX]; /* the name of the file to exec */
  char *new_sp, *basename;
  vir_bytes src, dst, text_bytes, data_bytes, bss_bytes, stk_bytes, vsp;
  phys_bytes tot_bytes;         /* total space for program, including gap */
  long sym_bytes;
  vir_clicks sc;
  struct stat s_buf;
  vir_bytes pc;

  /* Do some validity checks. */
  rmp = mp;
  stk_bytes = (vir_bytes) stack_bytes;
  if (stk_bytes > ARG_MAX) return(ENOMEM);      /* stack too big */
  if (exec_len <= 0 || exec_len > PATH_MAX) return(EINVAL);

  /* Get the exec file name and see if the file is executable. */
  src = (vir_bytes) exec_name;
  dst = (vir_bytes) name_buf;
  r = sys_copy(who, D, (phys_bytes) src,
                MM_PROC_NR, D, (phys_bytes) dst, (phys_bytes) exec_len);
  if (r != OK) return(r);       /* file name not in user data segment */
  tell_fs(CHDIR, who, FALSE, 0);        /* switch to the user's FS environ. */
  fd = allowed(name_buf, &s_buf, X_BIT);        /* is file executable? */
  if (fd < 0) return(fd);       /* file was not executable */

  /* Read the file header and extract the segment sizes. */
  sc = (stk_bytes + CLICK_SIZE - 1) >> CLICK_SHIFT;
  m = read_header(fd, &ft, &text_bytes, &data_bytes, &bss_bytes,

                                        &tot_bytes, &sym_bytes, sc, &pc);
  if (m < 0) {
        close(fd);              /* something wrong with header */
        return(ENOEXEC);
  }

  /* Fetch the stack from the user before destroying the old core image. */
  src = (vir_bytes) stack_ptr;
  dst = (vir_bytes) mbuf;
  r = sys_copy(who, D, (phys_bytes) src,
                        MM_PROC_NR, D, (phys_bytes) dst, (phys_bytes)stk_bytes);
  if (r != OK) {
        close(fd);              /* can't fetch stack (e.g. bad virtual addr) */
        return(EACCES);
  }

  /* Can the process' text be shared with that of one already running? */
  sh_mp = find_share(rmp, s_buf.st_ino, s_buf.st_dev, s_buf.st_ctime);

  /* Allocate new memory and release old memory.  Fix map and tell kernel. */
  r = new_mem(sh_mp, text_bytes, data_bytes, bss_bytes, stk_bytes, tot_bytes);
  if (r != OK) {
        close(fd);              /* insufficient core or program too big */
        return(r);
  }

  /* Save file identification to allow it to be shared. */
  rmp->mp_ino = s_buf.st_ino;
  rmp->mp_dev = s_buf.st_dev;
  rmp->mp_ctime = s_buf.st_ctime;

  /* Patch up stack and copy it from MM to new core image. */
  vsp = (vir_bytes) rmp->mp_seg[S].mem_vir << CLICK_SHIFT;
  vsp += (vir_bytes) rmp->mp_seg[S].mem_len << CLICK_SHIFT;
  vsp -= stk_bytes;
  patch_ptr(mbuf, vsp);
  src = (vir_bytes) mbuf;
  r = sys_copy(MM_PROC_NR, D, (phys_bytes) src,
                        who, D, (phys_bytes) vsp, (phys_bytes)stk_bytes);
  if (r != OK) panic("do_exec stack copy err", NO_NUM);

  /* Read in text and data segments. */
  if (sh_mp != NULL) {
        lseek(fd, (off_t) text_bytes, SEEK_CUR);  /* shared: skip text */
  } else {
        load_seg(fd, T, text_bytes);
  }
  load_seg(fd, D, data_bytes);

#if (SHADOWING == 1)
  if (lseek(fd, (off_t)sym_bytes, SEEK_CUR) == (off_t) -1) ;    /* error */
  if (relocate(fd, (unsigned char *)mbuf) < 0)  ;               /* error */
  pc += (vir_bytes) rp->mp_seg[T].mem_vir << CLICK_SHIFT;
#endif

  close(fd);                    /* don't need exec file any more */

  /* Take care of setuid/setgid bits. */
  if ((rmp->mp_flags & TRACED) == 0) { /* suppress if tracing */
        if (s_buf.st_mode & I_SET_UID_BIT) {
                rmp->mp_effuid = s_buf.st_uid;
                tell_fs(SETUID,who, (int)rmp->mp_realuid, (int)rmp->mp_effuid);
        }
        if (s_buf.st_mode & I_SET_GID_BIT) {
                rmp->mp_effgid = s_buf.st_gid;
                tell_fs(SETGID,who, (int)rmp->mp_realgid, (int)rmp->mp_effgid);
        }
  }

  /* Save offset to initial argc (for ps) */
  rmp->mp_procargs = vsp;

  /* Fix 'mproc' fields, tell kernel that exec is done,  reset caught sigs. */
  for (sn = 1; sn <= _NSIG; sn++) {
        if (sigismember(&rmp->mp_catch, sn)) {
                sigdelset(&rmp->mp_catch, sn);
                rmp->mp_sigact[sn].sa_handler = SIG_DFL;
                sigemptyset(&rmp->mp_sigact[sn].sa_mask);
        }
  }

 rmp->mp_flags &= ~SEPARATE;   /* turn off SEPARATE bit */
  rmp->mp_flags |= ft;          /* turn it on for separate I & D files */
  new_sp = (char *) vsp;

  tell_fs(EXEC, who, 0, 0);     /* allow FS to handle FD_CLOEXEC files */

  /* System will save command line for debugging, ps(1) output, etc. */
  basename = strrchr(name_buf, '/');
  if (basename == NULL) basename = name_buf; else basename++;
  sys_exec(who, new_sp, rmp->mp_flags & TRACED, basename, pc);
  return(OK);
}

==============================================




The Algorithm For Exec System Call
06:18

Exercise 1: Write a c program to implement your own ps command

Exercise 2: Write a c program to implement your own sleep command

Exercise 3: Write a c program to implement your own shell

Exercise 4: Write a c program to which demonstrates the functionality of daemons
+
Inter process Communication
17 Lectures 01:18:40
Introduction
01:38

Pipes
05:35

Named and Unnamed Pipes
00:45

sample code for pipe system call:

PUBLIC int do_pipe()
{
/* Perform the pipe(fil_des) system call. */

  register struct fproc *rfp;
  register struct inode *rip;
  int r;
  struct filp *fil_ptr0, *fil_ptr1;
  int fil_des[2];               /* reply goes here */

  /* Acquire two file descriptors. */
  rfp = fp;
  if ( (r = get_fd(0, R_BIT, &fil_des[0], &fil_ptr0)) != OK) return(r);
  rfp->fp_filp[fil_des[0]] = fil_ptr0;
  fil_ptr0->filp_count = 1;
  if ( (r = get_fd(0, W_BIT, &fil_des[1], &fil_ptr1)) != OK) {
        rfp->fp_filp[fil_des[0]] = NIL_FILP;
        fil_ptr0->filp_count = 0;
        return(r);
  }
  rfp->fp_filp[fil_des[1]] = fil_ptr1;
  fil_ptr1->filp_count = 1;

  /* Make the inode on the pipe device. */
  if ( (rip = alloc_inode(PIPE_DEV, I_REGULAR) ) == NIL_INODE) {
        rfp->fp_filp[fil_des[0]] = NIL_FILP;
        fil_ptr0->filp_count = 0;
        rfp->fp_filp[fil_des[1]] = NIL_FILP;
        fil_ptr1->filp_count = 0;
        return(err_code);
  }

  if (read_only(rip) != OK) panic("pipe device is read only", NO_NUM);

  rip->i_pipe = I_PIPE;
  rip->i_mode &= ~I_REGULAR;
  rip->i_mode |= I_NAMED_PIPE;  /* pipes and FIFOs have this bit set */
  fil_ptr0->filp_ino = rip;
  fil_ptr0->filp_flags = O_RDONLY;

  dup_inode(rip);               /* for double usage */

  fil_ptr1->filp_ino = rip;
  fil_ptr1->filp_flags = O_WRONLY;
  rw_inode(rip, WRITING);       /* mark inode as allocated */
  reply_i1 = fil_des[0];
  reply_i2 = fil_des[1];
  rip->i_update = ATIME | CTIME | MTIME;
  return(OK);
}

===$=====================================================

In Linux:

/*
 * sys_pipe() is the normal C calling standard for creating
 * a pipe. It's not the way Unix traditionally does this, though.
 */
SYSCALL_DEFINE2(pipe2, int __user *, fildes, int, flags)
{
        struct file *files[2];
        int fd[2];
        int error;

        error = __do_pipe_flags(fd, files, flags);
        if (!error) {
                if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {
                        fput(files[0]);
                        fput(files[1]);
                        put_unused_fd(fd[0]);
                        put_unused_fd(fd[1]);
                        error = -EFAULT;
                } else {
                        fd_install(fd[0], files[0]);
                        fd_install(fd[1], files[1]);
                }
        }
        return error;
}

OR

SYSCALL_DEFINE1(pipe, int __user *, fildes)
{
        return sys_pipe2(fildes, 0);
}



The Algorithm For Pipe System Call
05:34

Example: Pipes
07:36

Introduction to Signals
01:04

Classification of Signals
09:47

SYSCALL_DEFINE2(signal, int, sig, __sighandler_t, handler)
{
        struct k_sigaction new_sa, old_sa;
        int ret;

        new_sa.sa.sa_handler = handler;
        new_sa.sa.sa_flags = SA_ONESHOT | SA_NOMASK;
        sigemptyset(&new_sa.sa.sa_mask);

        ret = do_sigaction(sig, &new_sa, &old_sa);

        return ret ? ret : (unsigned long)old_sa.sa.sa_handler;
}

int do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact)
{
        struct task_struct *p = current, *t;
        struct k_sigaction *k;
        sigset_t mask;

        if (!valid_signal(sig) || sig < 1 || (act && sig_kernel_only(sig)))
                return -EINVAL;

        k = &p->sighand->action[sig-1];

        spin_lock_irq(&p->sighand->siglock);
        if (oact)
                *oact = *k;

        sigaction_compat_abi(act, oact);

        if (act) {
                sigdelsetmask(&act->sa.sa_mask,
                              sigmask(SIGKILL) | sigmask(SIGSTOP));
                *k = *act;
                /*
                 * POSIX 3.3.1.3:
                 *  "Setting a signal action to SIG_IGN for a signal that is
                 *   pending shall cause the pending signal to be discarded,
                 *   whether or not it is blocked."
                 *
                 *  "Setting a signal action to SIG_DFL for a signal that is
                 *   pending and whose default action is to ignore the signal
                 *   (for example, SIGCHLD), shall cause the pending signal to
                 *   be discarded, whether or not it is blocked"
                 */
                if (sig_handler_ignored(sig_handler(p, sig), sig)) {
                        sigemptyset(&mask);
                        sigaddset(&mask, sig);
                        flush_sigqueue_mask(&mask, &p->signal->shared_pending);
                        for_each_thread(p, t)
                                flush_sigqueue_mask(&mask, &t->pending);
                }
        }

        spin_unlock_irq(&p->sighand->siglock);
        return 0;
}


Signal Algorithm
12:25

Introduction To Message Queues
02:53

SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
{
        struct ipc_namespace *ns;
        static const struct ipc_ops msg_ops = {
                .getnew = newque,
                .associate = msg_security,
        };
        struct ipc_params msg_params;

        ns = current->nsproxy->ipc_ns;

        msg_params.key = key;
        msg_params.flg = msgflg;

        return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
}

/**
 * ipcget - Common sys_*get() code
 * @ns: namespace
 * @ids: ipc identifier set
 * @ops: operations to be called on ipc object creation, permission checks
 *       and further checks
 * @params: the parameters needed by the previous operations.
 *
 * Common routine called by sys_msgget(), sys_semget() and sys_shmget().
 */
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
                        const struct ipc_ops *ops, struct ipc_params *params)
{
        if (params->key == IPC_PRIVATE)
                return ipcget_new(ns, ids, ops, params);
        else
                return ipcget_public(ns, ids, ops, params);
}

/**
 * ipcget_new - create a new ipc object
 * @ns: ipc namespace
 * @ids: ipc identifier set
 * @ops: the actual creation routine to call
 * @params: its parameters
 *
 * This routine is called by sys_msgget, sys_semget() and sys_shmget()
 * when the key is IPC_PRIVATE.
 */
static int ipcget_new(struct ipc_namespace *ns, struct ipc_ids *ids,
                const struct ipc_ops *ops, struct ipc_params *params)
{
        int err;

        down_write(&ids->rwsem);
        err = ops->getnew(ns, params);
        up_write(&ids->rwsem);
        return err;
}

/**
 * ipcget_public - get an ipc object or create a new one
 * @ns: ipc namespace
 * @ids: ipc identifier set
 * @ops: the actual creation routine to call
 * @params: its parameters
 *
 * This routine is called by sys_msgget, sys_semget() and sys_shmget()
 * when the key is not IPC_PRIVATE.
 * It adds a new entry if the key is not found and does some permission
 * / security checkings if the key is found.
 *
 * On success, the ipc id is returned.
 */
static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
                const struct ipc_ops *ops, struct ipc_params *params)
{
        struct kern_ipc_perm *ipcp;
        int flg = params->flg;
        int err;

        /*
         * Take the lock as a writer since we are potentially going to add
         * a new entry + read locks are not "upgradable"
         */
        down_write(&ids->rwsem);
        ipcp = ipc_findkey(ids, params->key);
        if (ipcp == NULL) {
                /* key not used */
                if (!(flg & IPC_CREAT))
                        err = -ENOENT;
                else
                        err = ops->getnew(ns, params);
        } else {
                /* ipc object has been locked by ipc_findkey() */

                if (flg & IPC_CREAT && flg & IPC_EXCL)
                        err = -EEXIST;
                else {

                       err = 0;
                        if (ops->more_checks)
                                err = ops->more_checks(ipcp, params);
                        if (!err)
                                /*
                                 * ipc_check_perms returns the IPC id on
                                 * success
                                 */
                                err = ipc_check_perms(ns, ipcp, ops, params);
                }
                ipc_unlock(ipcp);
        }
        up_write(&ids->rwsem);

        return err;
}




The Msgget System Call
03:12

SYSCALL_DEFINE4(msgsnd, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
                int, msgflg)
{
        long mtype;

        if (get_user(mtype, &msgp->mtype))
                return -EFAULT;
        return do_msgsnd(msqid, mtype, msgp->mtext, msgsz, msgflg);
}

long do_msgsnd(int msqid, long mtype, void __user *mtext,
                size_t msgsz, int msgflg)
{
        struct msg_queue *msq;
        struct msg_msg *msg;
        int err;
        struct ipc_namespace *ns;
        DEFINE_WAKE_Q(wake_q);

        ns = current->nsproxy->ipc_ns;

        if (msgsz > ns->msg_ctlmax || (long) msgsz < 0 || msqid < 0)
                return -EINVAL;
        if (mtype < 1)
                return -EINVAL;

        msg = load_msg(mtext, msgsz);
        if (IS_ERR(msg))
                return PTR_ERR(msg);

        msg->m_type = mtype;
        msg->m_ts = msgsz;

        rcu_read_lock();
        msq = msq_obtain_object_check(ns, msqid);
        if (IS_ERR(msq)) {
                err = PTR_ERR(msq);
                goto out_unlock1;
        }

        ipc_lock_object(&msq->q_perm);

        for (;;) {
                struct msg_sender s;

                err = -EACCES;
                if (ipcperms(ns, &msq->q_perm, S_IWUGO))
                        goto out_unlock0;

                /* raced with RMID? */

                if (!ipc_valid_object(&msq->q_perm)) {
                        err = -EIDRM;
                        goto out_unlock0;
                }

                err = security_msg_queue_msgsnd(msq, msg, msgflg);
                if (err)
                        goto out_unlock0;

                if (msg_fits_inqueue(msq, msgsz))
                        break;

                /* queue full, wait: */
                if (msgflg & IPC_NOWAIT) {
                        err = -EAGAIN;
                        goto out_unlock0;
                }

                /* enqueue the sender and prepare to block */
                ss_add(msq, &s, msgsz);

                if (!ipc_rcu_getref(msq)) {
                        err = -EIDRM;
                        goto out_unlock0;
                }

                ipc_unlock_object(&msq->q_perm);
                rcu_read_unlock();
                schedule();

                rcu_read_lock();
                ipc_lock_object(&msq->q_perm);

                ipc_rcu_putref(msq, msg_rcu_free);
                /* raced with RMID? */
                if (!ipc_valid_object(&msq->q_perm)) {
                        err = -EIDRM;
                        goto out_unlock0;
                }
                ss_del(&s);

                if (signal_pending(current)) {
                        err = -ERESTARTNOHAND;
                        goto out_unlock0;
                }

        }

        msq->q_lspid = task_tgid_vnr(current);
        msq->q_stime = get_seconds();

        if (!pipelined_send(msq, msg, &wake_q)) {
                /* no one is waiting for this message, enqueue it */
                list_add_tail(&msg->m_list, &msq->q_messages);
                msq->q_cbytes += msgsz;
                msq->q_qnum++;
                atomic_add(msgsz, &ns->msg_bytes);
                atomic_inc(&ns->msg_hdrs);
        }

        err = 0;
        msg = NULL;

out_unlock0:
        ipc_unlock_object(&msq->q_perm);
        wake_up_q(&wake_q);
out_unlock1:
        rcu_read_unlock();
        if (msg != NULL)
                free_msg(msg);
        return err;
}




The Algorithm for MsgSnd System Call
04:19

SYSCALL_DEFINE5(msgrcv, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
                long, msgtyp, int, msgflg)
{
        return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}


long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, int msgflg,
               long (*msg_handler)(void __user *, struct msg_msg *, size_t))
{
        int mode;
        struct msg_queue *msq;
        struct ipc_namespace *ns;
        struct msg_msg *msg, *copy = NULL;
        DEFINE_WAKE_Q(wake_q);

        ns = current->nsproxy->ipc_ns;

        if (msqid < 0 || (long) bufsz < 0)
                return -EINVAL;

        if (msgflg & MSG_COPY) {
                if ((msgflg & MSG_EXCEPT) || !(msgflg & IPC_NOWAIT))
                        return -EINVAL;
                copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax));
                if (IS_ERR(copy))
                        return PTR_ERR(copy);
        }
        mode = convert_mode(&msgtyp, msgflg);

        rcu_read_lock();
        msq = msq_obtain_object_check(ns, msqid);
        if (IS_ERR(msq)) {
                rcu_read_unlock();
                free_copy(copy);
                return PTR_ERR(msq);
        }

        for (;;) {
                struct msg_receiver msr_d;

                msg = ERR_PTR(-EACCES);
                if (ipcperms(ns, &msq->q_perm, S_IRUGO))
                        goto out_unlock1;

                ipc_lock_object(&msq->q_perm);

                /* raced with RMID? */
                if (!ipc_valid_object(&msq->q_perm)) {
                        msg = ERR_PTR(-EIDRM);
                        goto out_unlock0;
                }

                msg = find_msg(msq, &msgtyp, mode);
                if (!IS_ERR(msg)) {
                        /*
                         * Found a suitable message.
                         * Unlink it from the queue.
                         */
                        if ((bufsz < msg->m_ts) && !(msgflg & MSG_NOERROR)) {
                                msg = ERR_PTR(-E2BIG);
                                goto out_unlock0;
                        }
                        /*
                         * If we are copying, then do not unlink message and do
                         * not update queue parameters.
                         */
                        if (msgflg & MSG_COPY) {
                                msg = copy_msg(msg, copy);
                                goto out_unlock0;
                        }

                        list_del(&msg->m_list);
                        msq->q_qnum--;
                        msq->q_rtime = get_seconds();
                        msq->q_lrpid = task_tgid_vnr(current);
                        msq->q_cbytes -= msg->m_ts;
                        atomic_sub(msg->m_ts, &ns->msg_bytes);
                        atomic_dec(&ns->msg_hdrs);
                        ss_wakeup(msq, &wake_q, false);

                        goto out_unlock0;
                }

                /* No message waiting. Wait for a message */
                if (msgflg & IPC_NOWAIT) {
                        msg = ERR_PTR(-ENOMSG);

                        goto out_unlock0;
                }

                list_add_tail(&msr_d.r_list, &msq->q_receivers);
                msr_d.r_tsk = current;
                msr_d.r_msgtype = msgtyp;
                msr_d.r_mode = mode;
                if (msgflg & MSG_NOERROR)
                        msr_d.r_maxsize = INT_MAX;
                else
                        msr_d.r_maxsize = bufsz;
                msr_d.r_msg = ERR_PTR(-EAGAIN);
                __set_current_state(TASK_INTERRUPTIBLE);

                ipc_unlock_object(&msq->q_perm);
                rcu_read_unlock();
                schedule();

                /*
                 * Lockless receive, part 1:
                 * We don't hold a reference to the queue and getting a
                 * reference would defeat the idea of a lockless operation,
                 * thus the code relies on rcu to guarantee the existence of
                 * msq:
                 * Prior to destruction, expunge_all(-EIRDM) changes r_msg.
                 * Thus if r_msg is -EAGAIN, then the queue not yet destroyed.
                 */
                rcu_read_lock();

                /*
                 * Lockless receive, part 2:
                 * The work in pipelined_send() and expunge_all():
                 * - Set pointer to message
                 * - Queue the receiver task for later wakeup
                 * - Wake up the process after the lock is dropped.
                 *
                 * Should the process wake up before this wakeup (due to a
                 * signal) it will either see the message and continue ...
                 */
                msg = READ_ONCE(msr_d.r_msg);

                if (msg != ERR_PTR(-EAGAIN))
                        goto out_unlock1;

                 /*
                  * ... or see -EAGAIN, acquire the lock to check the message
                  * again.
                  */
                ipc_lock_object(&msq->q_perm);

                msg = msr_d.r_msg;
                if (msg != ERR_PTR(-EAGAIN))
                        goto out_unlock0;

                list_del(&msr_d.r_list);
                if (signal_pending(current)) {
                        msg = ERR_PTR(-ERESTARTNOHAND);
                        goto out_unlock0;
                }

                ipc_unlock_object(&msq->q_perm);
        }

out_unlock0:
        ipc_unlock_object(&msq->q_perm);
        wake_up_q(&wake_q);
out_unlock1:
        rcu_read_unlock();
        if (IS_ERR(msg)) {
                free_copy(copy);
                return PTR_ERR(msg);
        }

        bufsz = msg_handler(buf, msg, bufsz);
        free_msg(msg);

        return bufsz;
}




The Algorithm for MsgRecv System Call
07:19

SYSCALL_DEFINE3(msgctl, int, msqid, int, cmd, struct msqid_ds __user *, buf)
{
        int version;
        struct ipc_namespace *ns;

        if (msqid < 0 || cmd < 0)
                return -EINVAL;

        version = ipc_parse_version(&cmd);
        ns = current->nsproxy->ipc_ns;

        switch (cmd) {
        case IPC_INFO:
        case MSG_INFO:
        case MSG_STAT:  /* msqid is an index rather than a msg queue id */
        case IPC_STAT:
                return msgctl_nolock(ns, msqid, cmd, version, buf);
        case IPC_SET:
        case IPC_RMID:
                return msgctl_down(ns, msqid, cmd, buf, version);
        default:
                return  -EINVAL;
        }
}


The MsgCtl System Call
02:19

Introduction To Shared Memory
02:47

SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
{
        struct ipc_namespace *ns;
        static const struct ipc_ops shm_ops = {
                .getnew = newseg,
                .associate = shm_security,
                .more_checks = shm_more_checks,
        };
        struct ipc_params shm_params;

        ns = current->nsproxy->ipc_ns;

        shm_params.key = key;
        shm_params.flg = shmflg;
        shm_params.u.size = size;

        return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
}

Shared Memory Header
04:54

SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
{
        unsigned long ret;
        long err;

        err = do_shmat(shmid, shmaddr, shmflg, &ret, SHMLBA);
        if (err)
                return err;
        force_successful_syscall_return();
        return (long)ret;

}


long do_shmat(int shmid, char __user *shmaddr, int shmflg,
              ulong *raddr, unsigned long shmlba)
{
        struct shmid_kernel *shp;
        unsigned long addr = (unsigned long)shmaddr;
        unsigned long size;
        struct file *file;
        int    err;
        unsigned long flags = MAP_SHARED;
        unsigned long prot;
        int acc_mode;
        struct ipc_namespace *ns;
        struct shm_file_data *sfd;
        struct path path;
        fmode_t f_mode;
        unsigned long populate = 0;

        err = -EINVAL;
        if (shmid < 0)
                goto out;

        if (addr) {
                if (addr & (shmlba - 1)) {
                        /*
                         * Round down to the nearest multiple of shmlba.
                         * For sane do_mmap_pgoff() parameters, avoid
                         * round downs that trigger nil-page and MAP_FIXED.
                         */
                        if ((shmflg & SHM_RND) && addr >= shmlba)
                                addr &= ~(shmlba - 1);
                        else
#ifndef __ARCH_FORCE_SHMLBA
                                if (addr & ~PAGE_MASK)
#endif
                                        goto out;
                }

                flags |= MAP_FIXED;
        } else if ((shmflg & SHM_REMAP))

                goto out;

        if (shmflg & SHM_RDONLY) {
                prot = PROT_READ;
                acc_mode = S_IRUGO;
                f_mode = FMODE_READ;
        } else {
                prot = PROT_READ | PROT_WRITE;
                acc_mode = S_IRUGO | S_IWUGO;
                f_mode = FMODE_READ | FMODE_WRITE;
        }
        if (shmflg & SHM_EXEC) {
                prot |= PROT_EXEC;
                acc_mode |= S_IXUGO;
        }

        /*
         * We cannot rely on the fs check since SYSV IPC does have an
         * additional creator id...
         */
        ns = current->nsproxy->ipc_ns;
        rcu_read_lock();
        shp = shm_obtain_object_check(ns, shmid);
        if (IS_ERR(shp)) {
                err = PTR_ERR(shp);
                goto out_unlock;
        }

        err = -EACCES;
        if (ipcperms(ns, &shp->shm_perm, acc_mode))
                goto out_unlock;

        err = security_shm_shmat(shp, shmaddr, shmflg);
        if (err)
                goto out_unlock;

        ipc_lock_object(&shp->shm_perm);

        /* check if shm_destroy() is tearing down shp */
        if (!ipc_valid_object(&shp->shm_perm)) {

                ipc_unlock_object(&shp->shm_perm);
                err = -EIDRM;
                goto out_unlock;
        }

        path = shp->shm_file->f_path;
        path_get(&path);
        shp->shm_nattch++;
        size = i_size_read(d_inode(path.dentry));
        ipc_unlock_object(&shp->shm_perm);
        rcu_read_unlock();

        err = -ENOMEM;
        sfd = kzalloc(sizeof(*sfd), GFP_KERNEL);
        if (!sfd) {
                path_put(&path);
                goto out_nattch;
        }

        file = alloc_file(&path, f_mode,
                          is_file_hugepages(shp->shm_file) ?
                                &shm_file_operations_huge :
                                &shm_file_operations);
        err = PTR_ERR(file);
        if (IS_ERR(file)) {
                kfree(sfd);
                path_put(&path);
                goto out_nattch;
        }

        file->private_data = sfd;
        file->f_mapping = shp->shm_file->f_mapping;
        sfd->id = shp->shm_perm.id;
        sfd->ns = get_ipc_ns(ns);
        sfd->file = shp->shm_file;
        sfd->vm_ops = NULL;

        err = security_mmap_file(file, prot, flags);
        if (err)
                goto out_fput;

        if (down_write_killable(¤t->mm->mmap_sem)) {
                err = -EINTR;
                goto out_fput;
        }

        if (addr && !(shmflg & SHM_REMAP)) {
                err = -EINVAL;
                if (addr + size < addr)
                        goto invalid;

                if (find_vma_intersection(current->mm, addr, addr + size))
                        goto invalid;
        }

        addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
        *raddr = addr;
        err = 0;
        if (IS_ERR_VALUE(addr))
                err = (long)addr;
invalid:
        up_write(¤t->mm->mmap_sem);
        if (populate)
                mm_populate(addr, populate);

out_fput:
        fput(file);

out_nattch:
        down_write(&shm_ids(ns).rwsem);
        shp = shm_lock(ns, shmid);
        shp->shm_nattch--;
        if (shm_may_destroy(ns, shp))
                shm_destroy(ns, shp);
        else
                shm_unlock(shp);
        up_write(&shm_ids(ns).rwsem);
        return err;

out_unlock:
        rcu_read_unlock();

out:
        return err;
}





The Algorithm For Shmat System Call
03:55

SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf)
{
        struct shmid_kernel *shp;
        int err, version;
        struct ipc_namespace *ns;

        if (cmd < 0 || shmid < 0)
                return -EINVAL;

        version = ipc_parse_version(&cmd);
        ns = current->nsproxy->ipc_ns;

        switch (cmd) {
        case IPC_INFO:
        case SHM_INFO:
        case SHM_STAT:
        case IPC_STAT:
                return shmctl_nolock(ns, shmid, cmd, version, buf);
        case IPC_RMID:
        case IPC_SET:
                return shmctl_down(ns, shmid, cmd, buf, version);
        case SHM_LOCK:
        case SHM_UNLOCK:
        {
                struct file *shm_file;

                rcu_read_lock();
                shp = shm_obtain_object_check(ns, shmid);
                if (IS_ERR(shp)) {
                        err = PTR_ERR(shp);
                        goto out_unlock1;
                }

                audit_ipc_obj(&(shp->shm_perm));
                err = security_shm_shmctl(shp, cmd);
                if (err)
                        goto out_unlock1;

                ipc_lock_object(&shp->shm_perm);

               /* check if shm_destroy() is tearing down shp */
                if (!ipc_valid_object(&shp->shm_perm)) {
                        err = -EIDRM;
                        goto out_unlock0;
                }

                if (!ns_capable(ns->user_ns, CAP_IPC_LOCK)) {
                        kuid_t euid = current_euid();

                        if (!uid_eq(euid, shp->shm_perm.uid) &&
                            !uid_eq(euid, shp->shm_perm.cuid)) {
                                err = -EPERM;
                                goto out_unlock0;
                        }
                        if (cmd == SHM_LOCK && !rlimit(RLIMIT_MEMLOCK)) {
                                err = -EPERM;
                                goto out_unlock0;
                        }
                }

                shm_file = shp->shm_file;
                if (is_file_hugepages(shm_file))
                        goto out_unlock0;

                if (cmd == SHM_LOCK) {
                        struct user_struct *user = current_user();

                        err = shmem_lock(shm_file, 1, user);
                        if (!err && !(shp->shm_perm.mode & SHM_LOCKED)) {
                                shp->shm_perm.mode |= SHM_LOCKED;
                                shp->mlock_user = user;
                        }
                        goto out_unlock0;
                }

                /* SHM_UNLOCK */
                if (!(shp->shm_perm.mode & SHM_LOCKED))
                        goto out_unlock0;
                shmem_lock(shm_file, 0, shp->mlock_user);
                shp->shm_perm.mode &= ~SHM_LOCKED;

                shp->mlock_user = NULL;
                get_file(shm_file);
                ipc_unlock_object(&shp->shm_perm);
                rcu_read_unlock();
                shmem_unlock_mapping(shm_file->f_mapping);

                fput(shm_file);
                return err;
        }
        default:
                return -EINVAL;
        }

out_unlock0:
        ipc_unlock_object(&shp->shm_perm);
out_unlock1:
        rcu_read_unlock();
        return err;
}


The Shmctl System Call
02:38

Exercise 2: Implement client server program using FIFO

Exercise 4: Write a C program to demonstrate pipes using child and parent

Exercise 5: Write a C program to demonstrate fifos

Exercise 6: Implement client server program with message queues using semaphore

Exercise 7: Write a program to demonstrate shared memory using semaphores
About the Instructor
Satish Venkatesh
4.1 Average rating
61 Reviews
6,063 Students
5 Courses
Engineer at Udemy

Hi,  

This is Satish. I am a Software Developer and an Automation Engineer based in India. I have got around 13 years of software development and automation experience in programming languages like Python, Perl, C, C++, Java, Unix, Linux, Shell Scripting,Selenium web driver. I have worked as a Software Developer, Manual Tester, Automation Tester, Automation Engineer and had a good experience working with many software companies.  I am basically from Bangalore and my hobbies include teaching, trekking and reading books.

Thanks & Regards,

Satish