Proofreadable base64 (was Re: Printing Keys and using OCR.)

Peter S. May me at psmay.com
Tue May 29 16:24:26 CEST 2007


Casey Jones wrote:
> That's a clever way of dramatically increasing the "uniqueness" of each 
> character to reduce the ambiguity of the OCR. It would be useful for 
> both error detection and error correction. If it could be integrated 
> into the OCR engine itself, it would be even more effective. Although 
> Gallager or Turbo Codes would give much better error correction for a 
> given storage space, your method would be way easier to implement.
> 
> I'm leaning strongly against base64. There are just too many characters 
> that can be easily confused. Base32 would be nearly as dense (5 bits 
> instead of 6, per char) and would allow many tough characters to be left 
> out. A simple conversion chart for base32 chars could take up just one 
> line at the bottom of the page. The conversion to base32 and back would 
> be very easy. Selecting the unambiguous 32 characters to use as the 
> symbol set would require some care. Maybe some testing to find out which 
> symbols the OCR programs get wrong most often.

Information density isn't the goal here.  My general strategy, to lay
out my context, is to encrypt my big .tar nightlies and offsite
them--the survivability of the media the big stuff is on is effectively
someone else's problem.  (Not perfect, but good enough, and if you keep
everything redundant, there's no real issue.)  But you can't reasonably
offsite the private key in the same way...otherwise, how do you open
everything when the time comes?  Via the system I've concocted,
secring.gpg can be printed in under 300 lines.  I peg that at around 4
one-sided pages of recoverable text--a small price to pay to maintain
control of a key.

Actually, the draw of this idea as far as I'm concerned is that it's
highly translucent:  I'm very interested in ideas like PDF417 and QR,
but there's a lot of support software involved that might not be so
readily available--or compilable--in a pinch.  Base64, on the other
hand, fits in my head with very little effort.  This means that, even in
the outright absence of software that will actually handle base64, I
could MacGyver something up without too much trouble in nearly any
programming language that makes sense (I'm generally YAPH, but I've been
messing with awk a lot lately, considering that it's ubiquitous on any
platform with an X in its name.  But b64 is simple enough to do in C, or
even VB if you must, or perhaps INTERCAL/brainf*ck/... if you enjoy an
insane challenge).

It must be noted that there's often a much easier way, though--base64
can be jimmied into a .eml-format file by using a mail client to create
an e-mail with a dummy attachment, then changing the contents with a
text editor and re-opening.  (This trick has actually gotten me through
some jams before!)  In this way, it helps that base64 also happens to be
extremely ubiquitous; there's almost doubtlessly an implementation
already on your machine.

Getting base64 data into a machine isn't trivial, but it can again be
argued that you have most or all of what you need at any workstation
(unless you're blind, but even then it's not out of the question).
Barcodes and data matrix standards may wax and wane, but we can
hopefully agree that OCR isn't leaving anytime soon.  Besides, even if
by some freak accident OCR were to drop off the face of the Earth, there
are still human eyes, human minds, and possibly even administrative
assistants willing to take dictation. ;-)

The translated digraph base64 in the third column would probably be easy
enough to figure out even without the translation key via some simple
"cryptanalysis" (I'm not suggesting the tr step is a cipher, but it does
act like one); if the message is clear enough to be human readable, it
itself would provide a more or less complete ciphertext-to-plaintext
mapping.

I haven't done a great deal of research into how valuable the mapping I
chose (tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z') actually is, but it's not an
entirely random choice.  In particular, it makes sure that A-Z and a-z
aren't adjacent, so that, for example, S and s don't map to an
equivalently similar upper/lower-case pair.  It probably merits more
investigation, but I want to implement the original thing first and do
some live testing to verify that there's even any problem to correct.

Probably the only complicated part is the CRC-24; you might have to be
just slightly hardcore just to memorize the XORing polynomial involved
(though the rest isn't that hard; I'm just not the "digits of pi" type).
 But that's mostly a tool for auto-correction anyway; you could get a
long way with just the first and third columns.

By the way, last night I decided to try to implement CRC-24 in awk.  It
seems to have worked.  It's not terribly efficient; I tried to stick to
POSIX rules for portability, and POSIX awk has no XOR operator.
Implementing XOR using substr is a rather humorous farce, I must say...

So, long and short, stay tuned.  I'm close to a first implementation and
test messages will be passed. :-)

Thanks
PSM

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
Url : /pipermail/attachments/20070529/b572c987/attachment.pgp 


More information about the Gnupg-users mailing list