How to Tell a File's Format: Five Open Source Tools

A practical introduction to five software tools to identify file formats and extract metadata
5.0 (1 rating)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
21 students enrolled
Instructed by Gary McGath IT & Software / Other
40% off
Take This Course
  • Lectures 21
  • Length 1.5 hours
  • Skill Level Intermediate Level
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works


Find online courses made by experts from around the world.


Take your courses with you and learn anywhere, anytime.


Learn and practice real-world skills and achieve your goals.

About This Course

Published 12/2015 English

Course Description

You will learn how to use five free, open-source tools to identify the format, version, and profile of document files and obtain their metadata. If you're working in library and archive technology, or if you're a student preparing for this career, the course will give you a strong start in using those tools and understanding their strengths and weaknesses. The five central sections each cover one of these tools:

file: A command line tool included in Linux and Unix for simple file identification.

DROID: A batch-oriented tool from the UK National Archives, using the PRONOM format registry.

ExifTool: A metadata extraction tool that recognizes a broad range of formats.

JHOVE: Software developed at the Harvard University Library for careful validation of certain formats. I wrote most of the code for JHOVE.

Apache Tika: Content extraction software which can identify many formats.

For each tool, there's a discussion of how to use it followed by an on-screen demonstration of installing and using it, as well as a downloadable PDF summarizing the material.

You should be comfortable with installing software on your computer. Familiarity with the Unix/Linux command line is strongly recommended. Most, but not all, of the tools described can run on Windows. All will run on a Macintosh or Linux system.

What are the requirements?

  • Students should be comfortable with downloading and installing software. Familiarity with the Linux/Unix command line is a great help.

What am I going to get from this course?

  • install and use software tools to identify file formats and extract metadata.

Who is the target audience?

  • Students and professionals working with digital archives are the main audience for this course. Others whose work involves file identification, e.g., in digital forensics, may find value in it.

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.


Section 1: Introduction

A general description of the course and the material it will cover.

3 pages

A review of common Linux/Unix command lines for those who would like a refresher.


Concepts of file format identification: Format descriptions, versions, profiles, and metadata. The relationship of the specification to real-life files. Tools with broad coverage vs. tools that examine a few formats in detail.

Section 2: file

What file is, its strengths and weaknesses.


A demonstration, using screen capture, of how to use file. After completing this lecture, you should be able to use file on your own.

Section 3: DROID

What DROID is, its relationship to the PRONOM format registry, and how to download it.


How to install DROID on a Unix/Linux/Mac system.


How to create, add files to, and run profiles in the DROID GUI application.


How to export data and get reports from the DROID GUI application.

Running DROID from the command line
Section 4: ExifTool

What ExifTool is, what's required to run it, and what it's useful for.


How to download and install ExifTool.


How to use ExifTool to get information about files.

Section 5: JHOVE

What JHOVE is, a bit of its history, and a discussion of its strengths and weaknesses.


How to download JHOVE, install it, and launch the command line and GUI versions.


Using JHOVE from the command line to identify and validate files, with some tricks for difficult file names.


How to use the JHOVE GUI application to identify and validate files, using all modules or selecting a module.

Section 6: Apache Tika

What Tika is, its strengths and weaknesses, and why using it in server mode is the way to go.


How to download Apache Tika, download it, and run the GUI application.


How to run Tika from the command line. A script is provided (see resource file) to make this easier.

Section 7: Review
5 questions

Review questions on the course.


Review and recommendations for further study.

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Gary McGath, Software Engineer

I'm an experienced software developer with a strong background in Java, library software, and digital preservation. For eight years I was a software engineer for the Harvard Library. I wrote the bulk of the code for JHOVE, a file identification and analysis tool widely used by libraries and archives. My written work includes the e-book Files that Last and the blog Mad File Format Science.

Ready to start learning?
Take This Course