Tuesday, October 28, 2014

The Internet is running in debug mode

With the rise of the Web, textual encodings like xml/JSON have become very popular. Of course textual message encoding has many advantages from a developer's perspective. What most developers are not aware of, is how expensive encoding and decoding of textual messages really is compared to a well defined binary protocol.

Its common to define a system's behaviour by its protocol. Actually, a protocol messes up two distinct aspects of communication, namely:

  • Encoding of messages
  • Semantics and behavior (request/response, signals, state transition of communication parties ..)

Frequently (not always), these two very distinct aspects are mixed up without need. So we are forced to run the whole internet in "debug mode", as 99% of webservice and webapp communication is done using textual protocols.

The overhead in CPU consumption compared to a well defined binary encoding is factor ~3-5 (JSON) up to >10-20 (XML). The unnecessary waste of bandwith also adds to that greatly (yes you can zip, but this in turn will waste even more CPU).

I haven't calculated the numbers, but this is environmental pollution at big scale. Unnecessary CPU consumption to this extent wastes a lot of energy (global warming anyone ?).

Solution is easy:
  • Standardize on some simple encodings (pure binary, self describing binary ("binary json"), textual)
  • Define the behavioral part of a protocol (exclude encoding)
  • use textual encoding during development, use binary in production.
Man we could save a lot of webserver hardware if only http headers could be binary encoded ..

22 comments:

  1. There go all our pretty logs and quick curl tests and such. Text-protocols allow quick look into what's going on binary protocols don't. As far as I last checked services don't self heel - a human has to jump in. Which might be a reflection of other things going wrong but that is how it is :)

    ReplyDelete
    Replies
    1. I think that's a narrow view. In fact, text is also binary encoded. Its because of standardization on how to read e.g. UTF-8, tools "know" how to decode it. One could standardize on a self-describing binary format and provide a similar tool chain of "pretty printers" without problems (think of it like a "special" charset encoding). However in the first place protocol behaviour must not be mixed up with message encoding

      Delete
    2. """ and provide a similar tool chain of "pretty printers" without problems (think of it like a "special" charset encoding)"""

      This is an even more narrow view. Text formats work with what we already have, here and now. What you propose is replacing for something that's not out there, is not inoperable, and most tools don't know about it.

      First get the tooling you say it's easy to provide "without problems", then ask for the change to binary formats.

      Delete
    3. I'd like to remind you the internet should serve users not developers :). Anyway Http 2 will solve the issues. Took only like 15 years and will take like 10 years until its widely adopted :)

      Delete
  2. I fully agree in your desire to separate protocol handling from that of encoding (format). Doing this is under-appreciated way to both decouple handling, and to allow for more efficient handling, by choosing the most suitable encoding for context. Sometimes this should be a well-known textual format (for inter-operability, flexibility and diagnostics); other times a more compact binary representation makes more sense. Protocols should not unnecessarily limit the choice.

    I am not sure I share your concern on performance aspects: without disputing specific encoding numbers, I just question their significant in big picture. In case of HTTP, for example, decoding of Header values is a miniscule part of handling that even if it was changed to binary encoding, benefits in many cases would not be significant at all. Payload remains as-is, and the real complexity in HTTP comes from managing connections, state, liveness checks and other protocol level aspects.

    ReplyDelete
    Replies
    1. regarding the Http-Header I left out context. E.g. for a low-latency, many client long polling http server, the payload consists of a single sequence number (of last received message), so header parsing indeed adds significant overhead in case there are no pending messages (processing is session lookup+sequence number comparision) for this special case. Another example is DOS-protection. It eats significant processng time to weed out DOS requests from application requests.

      Delete
    2. Ofc in generaö you are right in that "header parsing" is not a good example regarding the big picture...

      Delete
    3. Any in-between component (proxy, reverseproxy) needs to encode and decodes (parse) headers, it's probably f relevance especially for tiny payloads typical for e.g. webservice remote calls

      Delete
  3. My best citation on the mec.symp.group ...The next time i've to write another Json parser/serializer (brrrr Jackson...:() i'll pray that someone listen to you..
    :) thanks Rüdiger

    ReplyDelete
    Replies
    1. You are doing jackson wrong, it helps dealing with json and afaik is pretty fast, it has not invented JSon ..

      Delete
    2. The java binding and the streaming API are not free..in any way.Although Jackson is "pretty" fast it not deliver any zero-copy (AFAIK) ability in the serialisation stage...and produce TONS of garbage.Comparing it with a serious serializer ( hand-made?) and really GC-free is simple. But that's another story...so far if i'll have to send a long why on earth i've to send more than 8 bytes? ^^ P.s. Jackson is not a bad tool per se and i avere with you that is a great help if you don't want to deal directly with Json...

      Delete
    3. Hm .. do you have kind of reusable opensource variant of a zero copy, low garbage json parser ? I'd be interested in something like that, as Jackson doubles some stuff i already do in the serialization layer, such that JSon-Codec is well below what would be possible.

      Delete
    4. Hi Rüdiger,
      I've only custom own-rolled libs that i've developed for my own needs...undocumented too :P
      But TextWire of https://github.com/OpenHFT/Chronicle-Bytes looks very promising...if you'll wait few weeks i've contacted Peter Lawrey to contribute to this repo and maybe there will be a little more docs and example for it :)

      Delete
    5. Wrong repo sorry :P
      I really need a coffee this morning...https://github.com/OpenHFT/Chronicle-Wire

      Delete
    6. Interesting bottom up.approach (many "planned" features though ;) ). In contradiction fast-serialization goes top down providing different wire formats to represent serialized object graphs (binary,json). Maybe i could add a chronicle wire Codec to fst once C-wire is in a more mature state

      Delete
  4. MsgPack is efficient, schema-less, and has a 1:1 mapping with JSON (unlike BSON despite its name).

    ReplyDelete
    Replies
    1. I am aware of msgpack. I even tried to build a codec for fast serialization based on msgpack but somehow lost track. Might be a better alternative to json for actor <=> javascript remoting. Are there any java benchmarks ?

      Delete
  5. I used to push Sun's XDR [1] for this very reason. Client-server apps with binary protocols, too. That's because the web was more inefficient, complex, and insecure in about every area. It was crap. It's main advantages were the networking effect, instant distribution, and widespread compatibility of HTML/JS. We could've just fixed the problems in our native C/S and P2P models but adopted web instead.

    Two other good alternatives from long ago were Juice [2] for applets and Globe [3] for WAN architecture.

    [1] https://en.wikipedia.org/wiki/External_Data_Representation

    [2] http://www.modulaware.com/mdlt69.htm

    [3] http://www.cs.vu.nl/~philip/globe/

    Nick P

    ReplyDelete
    Replies
    1. wow, the globe project looks interesting. Annoyed by lack of abstraction and poor performance of existing distributed application products, I am active in a somewhat similar direction: http://ruedigermoeller.github.io/kontraktor/ . Well its actually mostly JVM bound erm and not that global ;).

      Delete
  6. We use binary protocols from day 1 at Aerospike - that's one of the "small" reasons we massively reduce server counts compared to other databases. You might be surprised how hard it is to fight against an entire industry - hardware companies don't like getting cut out, cloud companies don't like getting cut out, open source companies that charge by node count don't like getting cut out. All of these guys pay lip service to efficiency, then bury technology that is actually more efficient. Flash (SSD) storage is similar - you end up paying a lot less for most use cases compared to DRAM and compared to Rotational, but only a handful of database companies have optimized for Flash.

    ReplyDelete
    Replies
    1. Agree. In addition there is widespread lack of knowledge of what is technically possible. People's gut has adopted to crappy tech. Premature scaleout dominates.

      Delete
  7. It's better to use binary protocol for service to service communication, much more efficient than any text protocol, takes less bandwidth too.

    ReplyDelete