Why is at-most-once delivery the default for actor systems?-CodePudding

I am working with actor systems for an event sourced service and I'd like to understand why the default is at-most-once delivery, i.e., a message from one actor to another may arrive at most once, but potentially not at all. I am working with Akka, but I understand this is in general the default for actor model implementations.

It seems to me that at-most-once delivery can easily result in silent failures/data corruption, and that the issues from at-least-once delivery can be easily fixed by versioning and tagging messages. Quoted from the How Akka Works manual:

(at-most-once works) especially in situations where the occasional lose of a message does not leave a system in an inconsistent state

I don't understand how to make the logical jump from "occasionally losing a message is no big deal" to "losing this particular message is no big deal". It seems to me: if I don't actually care about receiving this message, why send it at all?

Example: We are building a system to count the number of words in a document. We have an actor (M) that receives a document, splits it into lines, and then sends each line to a child actor (C) to count the words in the line. The state of M is how many words are in the document; it updates this state by receiving messages from its children C containing the number of words in a line and adds that number to the total.

In at-most-once delivery: if a child actor lost the message due to network error (if it was an error on the actual actor, the parent could fix using supervision), the parent wouldn't know. It could keep track of state by keeping a map of each child actor and whether it was done and update the map as replies come in, but if we make the computations slightly more complex and have a variable number of replies from the children, this gets out of hand quickly. Also, we just built a system to ensure that the message is received, so really we have started to venture out of the world of at-most-once delivery, especially if we rely on it to re-deliver the compute message to C as opposed to just knowing that the current count is corrupt.

At-least-once delivery: we can give the messages to M a serial ID and keep a log (C -> ID) in the parent M. This lets us know if a message arrives twice, we should discard it. To me this seems simpler and also more generally scalable if the task for the children becomes more complex.

CodePudding user response：

The short answer for why it's the default is that:

It's not difficult to implement at-least-once on top of at-most-once (the ask pattern of pairing requests and replies is most of the way there), but there's no great way to get at-most-once out of at-least-once (e.g. without incurring the cost of at-least-once)
At-most-once means more or less one thing regardless of context, while at-least-once means an effectively infinite number of things (e.g. after how much time without a reply can a process decide that the send failed, and how many times should it be retried?)... There's no reasonable default meaning and it's reasonably likely that there isn't even a reasonable default for a given application/service.

Note that at-least-once can lead to at least as much inconsistency as at-most-once, especially when dealing with non-idempotency and the generic ways of adding idempotency are really heavyweight; conversely designing idempotency at a higher level (and where needed) is typically able to be far lighter weight.

CodePudding user response：

Most common use case for at-most-once delivery is counting (or collecting other statistics of) high volumes of events (e.g., page views).

An occasional loss of a message is indeed not that big of a deal to these applications: if you end up with 999999 events for a given minute rather than a million, no one will notice, and you can also restate your stats later without much of an impact on anything.

Overcounting would be a bigger problem: when data is restated, the numbers would go down (usually not by just 1, there can easily be massive discrepancies created by overcounting due to lost or timed out acks - see below). And it is not as easy to avoid as you imagine (how many ids would you keep in that "log" you suggest and for how long would you keep it around? How much additional memory would that require? How slow lookups would be? How would component restarts be handled? How would you ensure that duplicate events are always sent to the same component instance? Also, you would have to implement this record-keeping in every single component of your system)

But more importantly, enforcing at-least-once contract makes for a much more complex and consequently a much less reliable system.

In this system, a component cannot just send a message into a pipe, and forget about it. It has to wait to receive an ack from every component downstream from it. If the ack does not arrive in time, it must send the same message again, and keep doing it until the ack is received. Imagine that some of the components downstream is slow for some reason (perhaps, high memory utilization). The ack times out, and upstream starts resending the same event again and again. Meanwhile, new messages keep piling up, and are not being acknowledged, so they get resent too. It grows exponentially and quickly. This increases the load on the component that is already having trouble (making it even slower), as well as on all the others, quickly making an already bad situation much worse, and eventually bringing the entire system down.