24. Practical: Parsing Binary Files

5 stars based on 60 reviews

Nine developers out of ten will form the basis of the protocol around XML. Very few developers will actually stop and consider alternative solutions.

However, unlike XML, the design focus of the format was not to produce verbose, java binary parser library documents, but rather to encode data in the most concise manner possible. To that end, the core bencode specification only includes four data types, two simple and two composite structures. Unfortunately, outside of applications like BitTorrent, this elegant binary format has seen remarkably little adoption. Because of this state of affairs, it can be extremely difficult to find libraries to actually process bencode data.

Not too long ago, I ran into a production use-case which required both parsing and generation of bencode-formatted files. The second hang-up was really a more significant motivator than the first, due to the fact that I knew I would be dealing with bencode streams potentially gigabytes in size. The first thing I needed to do was build the generation half of the library. I decided that it would be easier if I avoided trying to use the same backend framework classes with both the generator and the parser.

For example, there are actually two classes in the framework which contain the logic for handling an integer: The former is for use in the parser, while the latter is for use in the generator. This separation of logic may seem a little strange, but it actually simplifies things tremendously. Remember my primary requirement: Since I needed the functionality of bencode java binary parser library generation before I needed parsing, I started with that aspect of the framework.

When we think of generating output in a structured format programmatically, we naturally imagine a DOM-like tree representation preferably framework-agnostic which is then walked by the framework to produce the output. The major disadvantage to this approach is that it requires paging everything into memory. This works java binary parser library smaller applications or situations where the data is already in memory, but for my particular use-case, it would have been disastrous. In other words, the data itself has to be lazy-loaded, using callbacks to grab the data as-needed and hold it in memory only as long as is absolutely necessary.

In a functional language, this would be done with closures or even normal data types in a pure-functional language. However, as we all know, Java does not support such time-saving features.

The only recourse is to use abstract classes and interfaces which can be overridden in anonymous inner-classes as well as top-level classes as necessary. After a bit of experimentation, the finalized hierarchy looks something like this. Logically, every type must be able to query its abstract method for data of a certain Java type long for IntegerTypeInputStream for StringTypeetcconvert this data java binary parser library bencode with the appropriate meta, and then write the result to a given OutputStream.

Since this is in compliance with the bencode specification, we are safe in extracting this functionality into a superclass. This is where you should begin to notice that I have over-engineered this library to some degree. This comes back to my initial requirements. Priority one was to create a framework which was blazingly fast, but priority two was to ensure that it was extensible java binary parser library the type level.

For the particular application I was interested in, I required more than just the core bencode java binary parser library. Also on the agenda were proper UTF-8 strings, dates, and support for null. To accommodate all of this without too much code duplication, I knew Java binary parser library would have to extract a lot of the functionality into generic superclasses.

Anyway, back to the problem at hand. For one thing, the specification requires that dictionary entries be sorted by key, java binary parser library implying some sort of Comparable java binary parser library.

Are we starting to see the problems with space-efficient implementations? The API may seem a little clumsy, but most of that is caused by the conniptions required to make the generator lazily pull the data, rather than paging it all into memory ahead of time. Throwing that aside, the rest of the verbosity seems to come from the need for LiteralStringTyperather than just having a StringType which could handle this for us.

Here again, there was a need for the parser to be extremely efficient, especially in terms of memory. The parser only does exactly what work you ask of it, nothing more. My initial designs for the parser attempted to follow the example set by the generator: As it turns out, this can be difficult to accomplish. I could have expanded slightly on the parser combinator concept, but monads are very clumsy to achieve in Java, which led me to rule out that option. In the end, I took a middle ground.

As before, a common superinterface sits above the entire representative hierarchy. To understand this hierarchy a little better, perhaps it would be helpful to look at the full source for Value:. The resolve method java binary parser library really the core of the entire parser.

The concept is that each value will be able to consume the bytes necessary to determine its own value, which is converted and returned. This is extremely convenient because it enables VariantValue s such as string to carry the logic for parsing to a specific length, rather java binary parser library the conventional e terminator.

In order to avoid clogging up memory, the return value of resolve should not be memoized though, there is nothing in the framework to prevent it. Conventionally, values which are already resolved should throw an exception if they are resolved a second time.

This prevents the framework from holding onto values which are no longer needed. Logically, a composite value is a linear collection of values, consumed one at a time. To me, that sounds a lot like a unidirectional iterator. Returning to primitive values, the resolve method for IntegerValue is worthy of note, not so much for its uniqueness, but because it is very similar to the java binary parser library technique used in all the other values:.

The i prefix itself is consumed before control flow even enters this method. This is because the prefix is required to determine the appropriate value implementation to use. Specifically, the logic to perform this determination is contained within the Parser class, which maintains a map of Value s and their associated prefixes.

String values have special logic associated with them, as they do not have a prefix. We start out by assuming that the integer value extends to the end of the stream, then we set about to find a premature end to the integer, at which point we break out and call it a day. Since we are moving from left to right through a base integer, we must multiply the current accumulator by 10 prior to adding the new digit.

Actually, the real heart of the parser framework is CompositeValue. This class is inherited by Parser to define a special value encompassing the stream itself which is viewed as a composite value with no delimiters and only a single child. This unification allows us to keep the code for parsing a java binary parser library stream in a single location.

This implementation is a little less concise than the code for parsing an integer, but it follows the same pattern and is fairly instructive:. It seems a bit imposing, but really this java binary parser library is more of the same logic java binary parser library saw previously when dealing with integers.

The only value type which really gives us trouble here is string. For this reason, we must assume that any unbound integer is an inclusive prefix for a string. In most parser implementations, this would require backtracking, but because we are doing this by hand, we can condense the backtrack into an inherited parameter borrowing terminology from attribute grammarsavoiding the performance hit. Intuitively, a dictionary value should be parsed into a Java Mapor some java binary parser library of associative data structure.

Unfortunately, a map is by definition a random access data structure. Since we are dealing with a sequential bencode streamthe only recourse to satisfy this property would be to page the entire dictionary into memory. This of course violates one of the primary requirements which is to java binary parser library using more memory than necessary.

The solution I eventually chose to this problem was to limit dictionary access to sequential, which translates java binary parser library alphabetical given the nature of bencode dictionaries. Java binary parser library, a dictionary can be parsed in the same way as a list, where each element is a sequential key and value, jointly represented by Java binary parser library.

To make java binary parser library patterns slightly easier, EntryValue memoizes the key and value. Due to the java binary parser library that both of these objects are themselves Value sthis does not lead to inadvertent memory bloat. Hopefully the parser and generator presented here will be of some utility in situations where you have to parse large volumes of java binary parser library data. The API is admittedly bizarre and difficult to deal with, but the performance results are difficult to deny.

This framework is currently deployed in production, where benchmarks have shown that it imposes little-to-no runtime overhead, and practically zero memory overhead despite the sizeable amounts of data being processed. For convenience, I actually created a Google Code project for this framework so as to facilitate its development internally to the project I was working on.

The end result of this is unlike most of my experiments, there is actually a proper SVN from which java binary parser library source may be obtained! A packaged JAR may be java binary parser library from the downloads section. Google protocol buffers http: But for this particular application, I needed an implementation of bencode. For another, bencode is slightly easier to read and debug. Apache Mina is a Java library for building network protocol handlers. It might make a good framework for implementing a bencode-based server process, but when I built this, I needed to work at the file level.

Would you mind to tell what code coloring plugin you use? I have slightly modified GeSHi the core engine for wp-syntaxdefined my own styles based on jEdit as well as adding support for more languages. If you like, I can send you the modified plugin. Could send me at guiherme dot chapiewski gmail? Really like the API weirdness. Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

Notify me of followup comments via e-mail. He currently spends most of his free java binary parser library researching parser theory and methodologies, particularly areas where the field intersects with functional language design, domain-specific languages and type theory.

He can be reached by email. If you're feeling particularly masochistic, you can follow Daniel on Twitter djspiewak.

Comments Google protocol buffers http:

Fx options broker singapore

  • Binary option strategies uk top

    How option qhttp trading binary-optionsru works

  • Beleggen tbinary optionss

    How to become a certified trade broker

Las opciones sobre acciones que se ejercen mejorana

  • Binary options main strategy 2015

    Tipo de cambio dong forex mexico

  • Simple binary options trading strategy indicator with 83 2

    Entradera en ramos mejia

  • Watch how to trade binary options with candlesticks

    The best binary option bonus brokers in binary option bonus

Cara menjadi trader forex sukses dubai

46 comments Kiameh trading options

News for binary options trader finmax

In this chapter I'll show you how to build a library that you can use to write code for reading and writing binary files. You'll use this library in Chapter 25 to write a parser for ID3 tags, the mechanism used to store metadata such as artist and album names in MP3 files.

This library is also an example of how to use macros to extend the language with new constructs, turning it into a special-purpose language for solving a particular problem, in this case reading and writing binary data. Because you'll develop the library a bit at a time, including several partial versions, it may seem you're writing a lot of code.

But when all is said and done, the whole library is fewer than lines of code, and the longest macro is only 20 lines long. At a sufficiently low level of abstraction, all files are "binary" in the sense that they just contain a bunch of numbers encoded in binary form. However, it's customary to distinguish between text files , where all the numbers can be interpreted as characters representing human-readable text, and binary files , which contain data that, if interpreted as characters, yields nonprintable characters.

Binary file formats are usually designed to be both compact and efficient to parse--that's their main advantage over text-based formats. To meet both those criteria, they're usually composed of on-disk structures that are easily mapped to data structures that a program might use to represent the same data in memory. The library will give you an easy way to define the mapping between the on-disk structures defined by a binary file format and in-memory Lisp objects.

Using the library, it should be easy to write a program that can read a binary file, translating it into Lisp objects that you can manipulate, and then write back out to another properly formatted binary file. The starting point for reading and writing binary files is to open the file for reading or writing individual bytes. When you're dealing with binary files, you'll specify unsigned-byte 8. An input stream opened with such an: Above the level of individual bytes, most binary formats use a smallish number of primitive data types--numbers encoded in various ways, textual strings, bit fields, and so on--which are then composed into more complex structures.

So your first task is to define a framework for writing code to read and write the primitive data types used by a given binary format. To take a simple example, suppose you're dealing with a binary format that uses an unsigned bit integer as a primitive data type. To read such an integer, you need to read the two bytes and then combine them into a single number by multiplying one byte by , a. For instance, assuming the binary format specifies that such bit quantities are stored in big-endian 3 form, with the most significant byte first, you can read such a number with this function:.

However, Common Lisp provides a more convenient way to perform this kind of bit twiddling. The function LDB , whose name stands for load byte, can be used to extract and set with SETF any number of contiguous bits from an integer. BYTE takes two arguments, the number of bits to extract or set and the position of the rightmost bit where the least significant bit is at position zero.

LDB takes a byte specifier and the integer from which to extract the bits and returns the positive integer represented by the extracted bits. Thus, you can extract the least significant octet of an integer like this:.

To write a number out as a bit integer, you need to extract the individual 8-bit bytes and write them one at a time. To extract the individual bytes, you just need to use LDB with the same byte specifiers. Of course, you can also encode integers in many other ways--with different numbers of bytes, with different endianness, and in signed and unsigned format.

Textual strings are another kind of primitive data type you'll find in many binary formats. When you read files one byte at a time, you can't read and write strings directly--you need to decode and encode them one byte at a time, just as you do with binary-encoded numbers. And just as you can encode an integer in several ways, you can encode a string in many ways.

To start with, the binary format must specify how individual characters are encoded. To translate bytes to characters, you need to know both what character code and what character encoding you're using. A character code defines a mapping from positive integers to characters.

Each number in the mapping is called a code point. For instance, ASCII is a character code that maps the numbers from to particular characters used in the Latin alphabet. A character encoding, on the other hand, defines how the code points are represented as a sequence of bytes in a byte-oriented medium such as a file.

Nearly as straightforward are pure double-byte encodings, such as UCS-2, which map between bit values and characters. The only reason double-byte encodings can be more complex than single-byte encodings is that you may also need to know whether the bit values are supposed to be encoded in big-endian or little-endian format.

Variable-width encodings use different numbers of octets for different numeric values, making them more complex but allowing them to be more compact in many cases. For instance, UTF-8, an encoding designed for use with the Unicode character code, uses a single octet to encode the values while using up to four octets to encode values up to 1,, On the other hand, texts consisting mostly of characters requiring four bytes in UTF-8 could be more compactly encoded in a straight double-byte encoding.

Common Lisp provides two functions for translating between numeric character codes and character objects: The language standard doesn't specify what character encoding an implementation must use, so there's no guarantee you can represent every character that can possibly be encoded in a given file format as a Lisp character. In addition to specifying a character encoding, a string encoding must also specify how to encode the length of the string.

Three techniques are typically used in binary file formats. The simplest is to not encode it but to let it be implicit in the position of the string in some larger structure: Both these techniques are used in ID3 tags, as you'll see in the next chapter.

The other two techniques can be used to encode variable-length strings without relying on context. One is to encode the length of the string followed by the character data--the parser reads an integer value in some specified integer format and then reads that number of characters. Another is to write the character data followed by a delimiter that can't appear in the string such as a null character. The different representations have different advantages and disadvantages, but when you're dealing with already specified binary formats, you won't have any control over which encoding is used.

However, none of the encodings is particularly more difficult to read and write than any other. To write a string back out, you just need to translate the characters back to numeric values that can be written with WRITE-BYTE and then write the null terminator after the string contents. As these examples show, the main intellectual challenge--such as it is--of reading and writing primitive elements of binary files is understanding how exactly to interpret the bytes that appear in a file and to map them to Lisp data types.

If a binary file format is well specified, this should be a straightforward proposition. Actually writing functions to read and write a particular encoding is, as they say, a simple matter of programming.

Now you can turn to the issue of reading and writing more complex on-disk structures and how to map them to Lisp objects. Since binary formats are usually used to represent data in a way that makes it easy to map to in-memory data structures, it should come as no surprise that composite on-disk structures are usually defined in ways similar to the way programming languages define in-memory structures. Usually a composite on-disk structure will consist of a number of named parts, each of which is itself either a primitive type such as a number or a string, another composite structure, or possibly a collection of such values.

For instance, an ID3 tag defined in the 2. Following the header is a list of frames , each of which has its own internal structure. After the frames are as many null bytes as are necessary to pad the tag out to the size specified in the header.

If you look at the world through the lens of object orientation, composite structures look a lot like classes. For instance, you could write a class to represent an ID3 tag. An instance of this class would make a perfect repository to hold the data needed to represent an ID3 tag. You could then write functions to read and write instances of this class.

For example, assuming the existence of certain other functions for reading the appropriate primitive data types, a read-id3-tag function might look like this:. It's not hard to see how you could write the appropriate classes to represent all the composite data structures in a specification along with read-foo and write-foo functions for each class and for necessary primitive types.

But it's also easy to tell that all the reading and writing functions are going to be pretty similar, differing only in the specifics of what types they read and the names of the slots they store them in. It's particularly irksome when you consider that in the ID3 specification it takes about four lines of text to specify the structure of an ID3 tag, while you've already written eighteen lines of code and haven't even written write-id3-tag yet.

What you'd really like is a way to describe the structure of something like an ID3 tag in a form that's as compressed as the specification's pseudocode yet that can also be expanded into code that defines the id3-tag class and the functions that translate between bytes on disk and instances of the class. Sounds like a job for a macro. Since you already have a rough idea what code your macros will need to generate, the next step, according to the process for writing a macro I outlined in Chapter 8, is to switch perspectives and think about what a call to the macro should look like.

Since the goal is to be able to write something as compressed as the pseudocode in the ID3 specification, you can start there. The header of an ID3 tag is specified like this:. The version consists of two bytes, the first of which--for this version of the specification--has the value 2 and the second of which--again for this version of the specification--is 0. The flags slot is eight bits, of which all but the first two are 0, and the size consists of four bytes, each of which has a 0 in the most significant bit.

Some information isn't captured by this pseudocode. For instance, exactly how the four bytes that encode the size are to be interpreted is described in a few lines of prose. Likewise, the spec describes in prose how the frame and subsequent padding is stored after this header.

But most of what you need to know to be able to write code to read and write an ID3 tag is specified by this pseudocode. Thus, you ought to be able to write an s-expression version of this pseudocode and have it expanded into the class and function definitions you'd otherwise have to write by hand--something, perhaps, like this:. Since this is just a bit of fantasizing, you don't have to worry about exactly how the macro define-binary-class will know what to do with expressions such as isostring: Okay, enough fantasizing about good-looking code; now you need to get to work writing define-binary-class --writing the code that will turn that concise expression of what an ID3 tag looks like into code that can represent one in memory, read one off disk, and write it back out.

To start with, you should define a package for this library. Here's the package file that comes with the version you can download from the book's Web site:. Since you already have a handwritten version of the code you want to generate, it shouldn't be too hard to write such a macro. If you look back at the define-binary-class form, you'll see that it takes two arguments, the name id3-tag and a list of slot specifiers, each of which is itself a two-item list.

A single slot specifier from define-binary-class looks something like this:. Instead, you need something like this:.

First define a simple function to translate a symbol to the corresponding keyword symbol. The result, slightly reformatted here for better readability, should look familiar since it's exactly the class definition you wrote by hand earlier:.

Next you need to make define-binary-class also generate a function that can read an instance of the new class. Looking back at the read-id3-tag function you wrote before, this seems a bit trickier, as the read-id3-tag wasn't quite so regular--to read each slot's value, you had to call a different function.

Not to mention, the name of the function, read-id3-tag , while derived from the name of the class you're defining, isn't one of the arguments to define-binary-class and thus isn't available to be interpolated into a template the way the class name was. You could deal with both of those problems by devising and following a naming convention so the macro can figure out the name of the function to call based on the name of the type in the slot specifier.

However, this would require define-binary-class to generate the name read-id3-tag , which is possible but a bad idea. Macros that create global definitions should generally use only names passed to them by their callers; macros that generate names under the covers can cause hard-to-predict--and hard-to-debug--name conflicts when the generated names happen to be the same as names used elsewhere.

You can avoid both these inconveniences by noticing that all the functions that read a particular type of value have the same fundamental purpose, to read a value of a specific type from a stream.