Discrepancies in extracted photo-id images from dumps

Stefan Claas sac at 300baud.de
Sat Jan 19 17:47:42 CET 2019


On Sat, 19 Jan 2019 11:23:33 -0500, Daniel Kahn Gillmor wrote:
> On Sat 2019-01-19 17:10:38 +0100, Stefan Claas wrote:
> > Now i wonder why i have such high discrepancies in the numbers?  
> 
> jpegextractor looks like it uses a simple heuristic to find jpegs.
> 
> in particular (quoting from
> https://www.digiater.nl/openvms/decus/vmslt02a/net/jpeg-extractor.html):
> 
>      jpegextractor uses the fact that valid binary JPEG streams start
>      with the byte sequence ff d8 ff and end with the byte sequence ff
>      d9. It copies all of those streams to new files. As jpegextractor
>      simply looks for the two sequences it does not have to know the
>      format of the encapsulating file and thus works with all formats
>      that embed JPEG streams.
> 
> consider that a lot of OpenPGP key material is high-entropy -- public
> keys, cryptographic signatures, etc are all essentially random bytes.
> hand-wavy approximations follow, i'd be happy if someone wants to make
> them more rigorus.
> 
> If we look at triplets of three consecutive octets, each such sequence
> should appear roughly once every 2^(8*3) == 16777216 triplets.  and
> any specific pair of octets will appear roughly once every 2^(8*2) ==
> 65536 pairs.
> 
> 
> So about every 16 million octets of high-entropy data, you'll find that
> starting "ff d8 ff" triplet, and much more frequently you'll find the
> ending "ff d9" pair.
> 
> So assuming that the bulk of a 1GiB dump is high-entropy data *with no
> actual JPEGs in it at all*, you should expect to see jpegextractor have
> at least 1G/16M == 64 false positive matches.

Thank you very much for your explanation, much appreciated!

> that doesn't quite add up to the number of extras that you're seeing
> from jpegextractor, but it suggests that there will be a large number of
> false positives by that mechanism at any rate.
> 
> have you tried looking at the jpegs that jpegextractor produces?

Yes, i scrolled through half of them and you are correct, there are
some false positives in it. But not so much that they would explain
the difference in the numbers.

Could it maybe the case that my old iMac (ten years old) is to slow
for this task with GnuPG and cat, so that my rig can't keep up
parsing and then writing to disk etc.?

To be sure i tried also only to parse a single dump (keydump-sks-0000)
and it showed me also (with another tool) more images than GnuPG
would have shown me.

Regards
Stefan




More information about the Gnupg-users mailing list