Monday, October 28, 2013

Beware the void* trap

I'd like to take a little bit of time today to share a problem that took me numerous days to track down and solve because it was so obscure and stealthy.  At work, we use ZeroMQ to handle any inter-process communication, and our primary language is C++.  I had begun a fairly intense project to remove ZeroMQ from the intra-process communications of a particular daemon, leaving it only for when we need to cross process boundaries.  Basically, once the messages that we want get into our daemon, there is no reason to pass them off to threads and such by serializing them to byte arrays (since that what a ZeroMQ message is) when they could be added (as objects) the lists, passed around to thread pools, etc.

As I was nearing the completion of the project, I found that all outbound communication from one of the sections of code seemed to be lost.  This was strange because the inbound communication was handled properly.  I had refactored both directions, so it was quite possible that I had messed something up.  However, no matter how many times I checked the logic, no matter how many log statements I made at each line, nothing looked amiss.  I struggled for days trying to understand why the "send()" calls seemed to be going nowhere, and the answer shocked me.

Here is a simplistic version of the code that I had.  Basically, this code receives (from somewhere else in the program) a list of messages to send to a ZeroMQ socket as a single atomic message.  This is accomplished by adding the "ZMQ_SNDMORE" ("send more") flag to the call.
void sendMessageToEndpoint1( std::list<zmq::message_t*>& messages ) {
   while( messages.size() > 0 ) {
      //! This is the ZeroMQ message that we're going to send.
      //! It has been prepared for us elsewhere.
      zmq::message_t* message = messages.front();
      // Remove the message from the list.
      // The size of this list is now the number of remaining messages to send.
      message.pop_front();
      
      //! This contains the flags for the "send" operation.  The only flag that
      //! we're actually going to set is whether or not we have more messages
      //! coming, and that's for every message except the last one.
      int flags = messages.size() > 0 ? ZMQ_SNDMORE : 0;
      // Send the message to the endpoint.
      // Note that "endpoint1" is of type "zmq::socket_t*".
      endpoint1->send( message, flags );
      
      // We can now delete the message.
      delete message;
   }
}

I promise you that I had logged something before and after every statement, and everything was exactly as I expected it to be.  No exceptions were thrown.  There were no compiler warnings.  But no client on the other side of "endpoint1" ever got any of the messages that were being sent.  This drove me crazy.

The answer is that I was passing the wrong thing to "zmq::socket_t::send()".  Unlike the "recv()" ("receive") call, which takes a pointer to a "zmq::message_t", the "send()" call merely takes a reference to a "zmq::message_t".  Clearly this is a type error, so the compiler should have caught it.  But it didn't.

Here's the signature of the "send()" function.  Basically, it sends a message with some optional flags.
bool send( zmq::message_t& message, int flags = 0 );

I was sending it a "zmq::message_t*" and "int", so the compiler should have reported an error, since the type of the first argument was incorrect.  However, no error (or warning) was printed, and it compiled fine.  Even stranger, nothing bad happened when I called "send()".  Nothing good happened, either, but the code ran with the only strange symptom being that my "send()" calls seemed to do nothing.  The client on the other side of the ZeroMQ socket simply never received the message.

So, what's up with that?

It turns out that there is another "send()" function, one that takes three parameters.  It sends an arbitrary number of bytes with some optional flags.
size_t send( void* buffer, size_t length, int flags = 0 );

And there's the rub.

We've already established that the first "send()" function shouldn't work.  But here's a second "send()" function that does meet our signature.  As for the first parameter, a "zmq::message_t*" will be implicitly cast to "void*" in C++.  As for the second parameter, "int" will be implicitly cast to "size_t", which is just an unsigned integral type.  As for the third parameter, it is not specified, so it'll be set to zero.

This second "send()" is clearly not what I wanted to use, but the compiler doesn't know that I thought that the function required a pointer to a message, not a reference to it.  Since "ZMQ_SNDMORE" is defined to be the number 2, this call to "send()" only attempts to transmit two bytes.  And because a "zmq::message_t" is certainly larger than two bytes (it is actually at least 32 bytes), the data to copy, from the second "send()" function's perspective, is always present.  This means that in addition to not getting any warnings or errors, I am also guaranteed to have this code never crash, since it will always send the first two bytes of the "zmq::message_t" structure.

Naturally, the fix was to send the dereferenced version of the message, and everything worked fine after that.  The moral of the story here is to watch out for implicit "void*" conversion.  And if you are making a library that accepts a byte buffer for reading/writing purposes, please set the type of that buffer to some byte-oriented type, such as "char*" or "uint8_t*".  These would require explicit casts, thus preventing accidental use as in my case.