Thursday, October 06, 2016

How to retrieve photo metadata in Python

You'd think it would be easy to retrieve and even edit photo metadata. After all, we are living in the 21st century. But, no, some things prove more difficult than they should. A search for applications turned up quite a few that would display the metadata but none that would easily edit it.

OK, so there's always the programmatic approach. And for that I turn to Python. Let's see what the state of the art holds for us. (Hint: It's a bumpy ride.)


First, a couple of constraints. I develop on a Windows 10 machine, largely because that's the same computer all my other goodies are on. Yes, LINUX might be better for development, but not for desktop use. (Cue old debate.)

Second, because I am indeed living in this century, I prefer to use Python 3, the version that broke backwards compatibility. It's been around for 8 years, so is not exactly new.

What is metadata?
Metadata is simply a list of strings stored in an image file. These strings are carried along with the image, and can identify the author, camera characteristics, copyright information, and so on. There are two main metadata standards.

Exif, the Exchangeable image file format, has been around since 1995. It works with media such as WAV sound files, TIFF images, and JPG.

IPTC, defined by the International Press Telecommunications Council, is designed to standardise data for news gathering and journalism. There are two main parts, IPTC Core and IPTC Extension.

The remainder of this article will investigate methods of reading this data, in Python.

Take a PIL
When we think of images and Python we think first of the Python Imaging Library (PIL). Or, rather, it's more current fork, named Pillow.

You can install this useful library using the simple mechanism of typing at the command line:
pip install Pillow
This works across platforms.

In fact, if pip fails, I usually give up right away. Not because there aren't other ways to install. But if pip fails, it is a good indication that the library is not well maintained. As we shall see.

In any case, here's my test code. It relies on the fact I have defined a path to a good test file.

fn = 'path/to/some/file/tester.jpg'

def test_PIL():
    # test PIL
    from PIL import Image
    from PIL.ExifTags import TAGS
    print( '\n<< Test of PIL >> \n' )

    img = Image.open(fn)
    info = img._getexif()
    for k, v in info.items():
        nice = TAGS.get(k, k)
        print( '%s (%s) = %s' % (nice, k, v) )

Interrogating the image for Exif information returns a dictionary. We can iterate over this to see all the meta-tags. In this case a useful TAGS dictionary converts the numeric keys to English equivalents. So, instead of wondering what tag 315 means, we know that it is "artist".

Unfortunately, with my test data I noticed problems. (My programme output is at the bottom of this post, for convenience.) First, the "copyright" field contained scrambled text. Second, the "comment" field did not show up at all. This could perhaps be because Pillow reports only Exif and not IPTC. In any case, it is insufficient and unreliable.


Some dead ends

At this point I did a web search and came up with several likely candidates. But they soon proved frustrating.

The library pyexiv2 is deprecated in favour of GExiv2, part of the Gnome set and hence without a Windows installer nor any way to easily compile.

IPTCInfo is recommended in certain blog articles, like this one, already out-dated, though only four years old.

The automatic install for IPTCInfo failed, so I checked and discovered that the last code update was back in 2011. As a single module, it was easy enough to install manually. But then I discovered that it was not at all Python 3 compatible. My attempts to change the code manually ended in failure.


A piece of the Piexif

Piexif has been tested across platforms and has no dependencies. The documentation is a bit terse, but helpfully indicates that the main "load" function returns several dictionaries, plus a byte dump that forms a thumbnail. I wrote my code to avoid this.

def test_piexif():
    # test Piexif
    import piexif
    print( '\n<< Test of Piexif >>' )

    data = piexif.load(fn)
    for key in ['Exif', '0th', '1st', 'GPS', 'Interop']:
        subdata = data[key]
        print( '\n%s:' % key )
        for k, v in subdata.items():
            print( '%s = %s' % (k, v) )

I really don't know what "0th" and "1st" mean as dictionary names, but it does appear that I get out all of the meta tags I expect. In particular, the tag marked 37510 contains my comment.

Like PIL, this library has a dictionary to map the obscure codes to names. I thought I should interrogate this.

def test_piexif_inspect():
    # display all metadata names
    import piexif
    print( '\n<< Inspect piexif >>\n' )

    info = piexif.ImageIFD.__dict__
    l = ['%s = %s' % (v, k) for k, v in info.items()]
    l.sort()
    for item in l:
        print(item)

The result is missing a mapping for tag 37510, the very one I want to use!

OK, not such a big deal in this case. But what if I start using other tags and have to decipher the codes manually? Rather annoying.

You will also notice an odd encoding problem. Rather than contain my comment as is, the tag reads...
b'ASCII\x00\x00\x00MY TEST COMMENT!'
The b marks the string as binary, which is some odd Python 2 designation. The smart thing to do is decode this to a proper code page, but then we have the prefix cruft.

The following will do the trick, but I am again disliking the arbitrary nature of this decoding.

def test_piexif_use():
    import piexif
    print( '\n<< Usage of piexif >>' )
    data = piexif.load(fn)
    exif = data['Exif']
    comment = exif.get(37510, '').decode('UTF-8')
    comment = comment[8:]
    print( comment )


Try exifread

Finally, I stumbled upon the library exifread.

Here again is my test script. As before, I skip past some tags that are going to be long boring byte strings. And I progress in sorted order, just for convenience.

def test_exifread():
    import exifread
    print( '\n<< Test of exifread >>\n' )

    with open(fn, 'rb') as f:
        exif = exifread.process_file(f)

    for k in sorted(exif.keys()):
        if k not in ['JPEGThumbnail', 'TIFFThumbnail', 'Filename', 'EXIF MakerNote']:
            print( '%s = %s' % (k, exif[k]) )

The result? All of the tags I expect are present, in human-readable encoding. It seems that this obscure project is the winner. Some of the more popular libraries need to do some catching up!

Though, one big limitation exists even here. This library does not support editing the tags. For that, you will need to use one of the previous choices and work around the cruft.

Nonetheless, I hope this article saves you the time I unfortunately spent.


Output

Here follows my test output, for reference:

<< Test of PIL >>

ExifVersion (36864) = b'0230'
ShutterSpeedValue (37377) = (9965784, 1000000)
ExifImageWidth (40962) = 600
DateTimeOriginal (36867) = 2011:06:09 01:20:59
DateTimeDigitized (36868) = 2011:06:09 01:20:59
MaxApertureValue (37381) = (0, 256)
SceneCaptureType (41990) = 0
MeteringMode (37383) = 5
LightSource (37384) = 0
Flash (37385) = 24
FocalLength (37386) = (77, 1)
CFAPattern (41730) = b'\x02\x00\x02\x00\x00\x01\x01\x02'
Make (271) = OLYMPUS IMAGING CORP.
Model (272) = E-P1
Orientation (274) = 1
ExifImageHeight (40963) = 600
Contrast (41992) = 0
Copyright (33432) = Robin Parmar  mar
ExposureBiasValue (37380) = (-3, 10)
XResolution (282) = (720000, 10000)
YResolution (283) = (720000, 10000)
ExposureTime (33434) = (1, 1000)
DigitalZoomRatio (41988) = (100, 100)
FocalLengthIn35mmFilm (41989) = 116
ExposureProgram (34850) = 3
ColorSpace (40961) = 65535
BodySerialNumber (42033) = H52502123
ResolutionUnit (296) = 2
WhiteBalance (41987) = 0
GainControl (41991) = 1
Software (305) = Adobe Photoshop CS5 Windows
DateTime (306) = 2011:08:22 21:39:05
LensMake (42035) = Pentax
LensModel (42036) = smc Pentax F A77 Limited
Saturation (41993) = 0
Artist (315) = Robin Parmar
Sharpness (41994) = 0
FileSource (41728) = b'\x03'
CustomRendered (41985) = 0
ExposureMode (41986) = 1
ExifOffset (34665) = 268
ISOSpeedRatings (34855) = 200

<< Test of Piexif >>

Exif:
36864 = b'0230'
37377 = (9965784, 1000000)
40962 = 600
36867 = b'2011:06:09 01:20:59'
36868 = b'2011:06:09 01:20:59'
37381 = (0, 256)
37510 = b'ASCII\x00\x00\x00MY TEST COMMENT!'
37383 = 5
37384 = 0
37385 = 24
37386 = (77, 1)
41988 = (100, 100)
41986 = 1
40963 = 600
37380 = (-3, 10)
41730 = b'\x02\x00\x02\x00\x00\x01\x01\x02'
33434 = (1, 1000)
41728 = b'\x03'
41989 = 116
34850 = 3
42033 = b'H52502123'
40961 = 65535
41990 = 0
34855 = 200
41987 = 0
41991 = 1
41992 = 0
42035 = b'Pentax'
42036 = b'smc Pentax F A77 Limited'
41993 = 0
41994 = 0
41985 = 0

0th:
283 = (720000, 10000)
296 = 2
34665 = 11444
306 = b'2011:08:22 21:39:05'
270 = b''
271 = b'OLYMPUS IMAGING CORP.'
272 = b'E-P1'
305 = b'Adobe Photoshop CS5 Windows'
274 = 1
33432 = b'Robin Parmar'
282 = (720000, 10000)
315 = b'Robin Parmar'

1st:
513 = 878
514 = 10416
259 = 6
296 = 2
282 = (72, 1)
283 = (72, 1)

GPS:

Interop:

<< Test of exifread >>

EXIF BodySerialNumber = H52502123
EXIF CVAPattern = [2, 0, 2, 0, 0, 1, 1, 2]
EXIF ColorSpace = Uncalibrated
EXIF Contrast = Normal
EXIF CustomRendered = Normal
EXIF DateTimeDigitized = 2011:06:09 01:20:59
EXIF DateTimeOriginal = 2011:06:09 01:20:59
EXIF DigitalZoomRatio = 1
EXIF ExifImageLength = 600
EXIF ExifImageWidth = 600
EXIF ExifVersion = 0230
EXIF ExposureBiasValue = -3/10
EXIF ExposureMode = Manual Exposure
EXIF ExposureProgram = Aperture Priority
EXIF ExposureTime = 1/1000
EXIF FileSource = Digital Camera
EXIF Flash = Flash did not fire, auto mode
EXIF FocalLength = 77
EXIF FocalLengthIn35mmFilm = 116
EXIF GainControl = Low gain up
EXIF ISOSpeedRatings = 200
EXIF LensMake = Pentax
EXIF LensModel = smc Pentax F A77 Limited
EXIF LightSource = Unknown
EXIF MaxApertureValue = 0
EXIF MeteringMode = Pattern
EXIF Saturation = Normal
EXIF SceneCaptureType = Standard
EXIF Sharpness = Normal
EXIF ShutterSpeedValue = 1245723/125000
EXIF UserComment = MY TEST COMMENT!
EXIF WhiteBalance = Auto
Image Artist = Robin Parmar
Image Copyright = Robin Parmar
Image DateTime = 2011:08:22 21:39:05
Image ExifOffset = 11444
Image ImageDescription =
Image Make = OLYMPUS IMAGING CORP.
Image Model = E-P1
Image Orientation = Horizontal (normal)
Image ResolutionUnit = Pixels/Inch
Image Software = Adobe Photoshop CS5 Windows
Image XResolution = 72
Image YResolution = 72
Thumbnail Compression = JPEG (old-style)
Thumbnail JPEGInterchangeFormat = 878
Thumbnail JPEGInterchangeFormatLength = 10416
Thumbnail ResolutionUnit = Pixels/Inch
Thumbnail XResolution = 72
Thumbnail YResolution = 72

RELATED POSTS

5 comments:

A. Jesse Jiryu Davis said...

Thanks for writing this! When I needed this I relied on ImageMagick's "identify" command-line tool - I call it with subprocess from Python and parse its output. It's definitely surprising how painful basic image information is with Python in the modern era.

robin said...

Ah yes, a good method as well. I was going to add an example like that using exiftools, the Perl library.

My next article will be quite positive. I think that the ecosystem around Python leaves me rather spoiled, so when I find a gap, I am amazed.

robin said...

Code has been added here for your convenience:

https://gist.github.com/robinparmar/2e19037e728b6783769598c9e62f4f3b

Anonymous said...

you can found the GExiv2 for windows in this package:
https://wiki.gnome.org/action/show/Projects/PyGObject
I've not try it, only the old version (pyexiv2)

robin said...

Thanks for that info, which I am sure will help some readers.

Post a Comment