Have you ever seen some ordinary text like 'ça va' go through a program or website and get totally ruined, ending up like 'ï»¿Ã§a va'?
This is due to broken handling of encoding. If you write software that handles text, you need to understand how to use Unicode and its encodings.
Unicode is the worldwide standard for character representations, supporting over 135 different writing systems and over 128,000 individual characters. This course provides an in-depth explanation of unicode, encoding and how it all works together, with practical examples in both Python 2 and Python 3.
By the end of this short course you will know everything you need to write software that handles text correctly, every time!
An introduction to cover the outline of the course. The code samples I use are available on my GitHub repository, linked in the resources section of this lecture.
I describe how characters are laid out in computer memory as raw bytes and provide an example of how you might represent a string of text.
This lecture covers the fundamentals of encoding, teaching you how to go from a character all the way through to a sequence of bytes in memory. I look at specific examples in detail, covering both ASCII and UTF-8.
Building on the first part of the course, I describe the different ways of representing textual data in both Python 2 and Python 3. This is particularly important to know if you're migrating code between Python 2 and Python 3.
This lecture introduces a number of useful utility functions that can be used to investigate and analyse strings in Python, which can really help you understand what's happening with the encoding process.
This lecture teaches you how to confidently move between byte strings and Unicode strings in both major versions of Python. I discuss some general strategies for converting strings between the different encoded and unencoded formats.
This lecture covers reading and writing UTF-8 encoded files, and provides some general rules for dealing with encoded text to make sure you always get it right. The example code I use is available on my GitHub repository, linked in the resources section of the first lecture.
I wrap up the course with some points about handling Unicode characters in Python source files and cover a few common errors such as the well-known UnicodeDecodeError.
Jordan has 6 years of professional software development experience across a number of industries and enjoys programming in a variety of languages, including Python, Java, Lisp and C.
He recently became a Director of a tech start-up, Zumatech Ltd, and works on projects of all types and scales, from enterprise-scale web applications to small throwaway utility scripts. The most important thing for Jordan is to write high quality code and have fun doing so. A self-confessed geek, he has recently been getting a kick out of writing his own operating system and learning x86 assembly for his own amusement.
He believes that programming is a creative and dynamic process, more akin to composing a piece of music or creating a work of art than calculating formulas in a spreadsheet. He feels that creating elegant software solutions is something most people can enjoy if given the correct instruction and guidance and programming is not just for the mathematically or logically inclined.
He lives in the beautiful city of Brighton, on the South Coast of England, which is also known as “Silicon Beach”, and spends most of his time there hiding from the sun and sampling the excellent beers on offer across the many pubs and bars in the city.