You will learn how to use five free, open-source tools to identify the format, version, and profile of document files and obtain their metadata. If you're working in library and archive technology, or if you're a student preparing for this career, the course will give you a strong start in using those tools and understanding their strengths and weaknesses. The five central sections each cover one of these tools:
file: A command line tool included in Linux and Unix for simple file identification.
DROID: A batch-oriented tool from the UK National Archives, using the PRONOM format registry.
ExifTool: A metadata extraction tool that recognizes a broad range of formats.
JHOVE: Software developed at the Harvard University Library for careful validation of certain formats. I wrote most of the code for JHOVE.
Apache Tika: Content extraction software which can identify many formats.
For each tool, there's a discussion of how to use it followed by an on-screen demonstration of installing and using it, as well as a downloadable PDF summarizing the material.
You should be comfortable with installing software on your computer. Familiarity with the Unix/Linux command line is strongly recommended. Most, but not all, of the tools described can run on Windows. All will run on a Macintosh or Linux system.
A review of common Linux/Unix command lines for those who would like a refresher.
Concepts of file format identification: Format descriptions, versions, profiles, and metadata. The relationship of the specification to real-life files. Tools with broad coverage vs. tools that examine a few formats in detail.
What DROID is, its relationship to the PRONOM format registry, and how to download it.
How to install DROID on a Unix/Linux/Mac system.
How to create, add files to, and run profiles in the DROID GUI application.
How to export data and get reports from the DROID GUI application.
What ExifTool is, what's required to run it, and what it's useful for.
How to download and install ExifTool.
How to use ExifTool to get information about files.
What JHOVE is, a bit of its history, and a discussion of its strengths and weaknesses.
How to download JHOVE, install it, and launch the command line and GUI versions.
Using JHOVE from the command line to identify and validate files, with some tricks for difficult file names.
How to use the JHOVE GUI application to identify and validate files, using all modules or selecting a module.
What Tika is, its strengths and weaknesses, and why using it in server mode is the way to go.
How to download Apache Tika, download it, and run the GUI application.
How to run Tika from the command line. A script is provided (see resource file) to make this easier.
Review questions on the course.
Review and recommendations for further study.
I'm an experienced software developer with a strong background in Java, library software, and digital preservation. For eight years I was a software engineer for the Harvard Library. I wrote the bulk of the code for JHOVE, a file identification and analysis tool widely used by libraries and archives. My written work includes the e-book Files that Last and the blog Mad File Format Science.