In our lecture we were discussing the inner possible implementation of std::list. The lecturer showed the approach where a dummy node is created to indicate the end of the list:
struct Node { Node* prev; ... }
Node* dummy = reinterpret_cast<Node*>(new int8_t[sizeof(Node)]);
dummy->prev = ... /* last node */;
They claimed that the last line might cause undefined behaviour. I cannot see why this could be unsafe. At the end of the day, we are just overwriting several bits without dereferencing the preseting. If this really was problem, wouldn't it have occured at the point of reinterpret_casting?
So, does udefined behaviour actually take place here, and, if so, why?
CodePudding user response:
First, for your second line
Node* dummy = reinterpret_cast<Node*>(new int8_t[sizeof(Node)]);
by itself.
new returns a pointer to the first int8_t object in the array of int8_t objects it created.
reinterpret_cast's behavior depends on the alignment of the address represented by the pointer. If it is suitably aligned for an object of type Node, then it will leave the pointer value unchanged (since there is definitively no Node object at the location which is pointer-interconvertible with the int8_t object). If it is not suitably aligned, the returned pointer value will be unspecified.
Unspecified means that we won't know what the value will be, but it wont cause undefined behavior.
Therefore, in any case, the second line and the cast by itself do not have undefined behavior.
The line
dummy->prev = ... /* last node */;
requires that the object dummy points to is actually a Node object. Otherwise it has undefined behavior. As mentioned above, reinterpret_cast gives us either an unspecified value or a pointer to the int8_t object. This already is an issue, that I think at least requires a std::launder call.
Even if the pointer returned from new is correctly aligned, then we still need to check whether a Node object is present. We certainly did not create any such object in any of the shown operations explicitly, but there is implicit object creation which may help out (at least since C 20, but I suppose this was supposed to be a defect report against older standard versions).
Specifically, objects may be created implicitly inside an array of types unsigned char, std::byte and, with some limitations, char (CWG 2489) when the lifetime of the array is started. int8_t is usually signed char and I think is not allowed to be either of the three previously mentioned types (see e.g. this question). This removes the only possible way out of UB.
So your third code line does have undefined behavior.
Even if you remedy this by changing the type form int8_t to std::byte, there are other constraints on the details of Node to make the implicit object creation possible. It may also be necessary to add a std::launder call.
All of this doesn't consider the alignment yet, because although new[] obtains memory with some alignment requirements, I think the standard mandates new[] itself to return a pointer with stronger alignment than required for the element type only for char, unsigned char and std::byte array new.
Many of these issues can probably be avoided by using e.g. operator new directly, possibly with provided alignment request, and making sure that Node is an aggregate.
In any case writing code like this is very risky because it is difficult to be sure that it isn't UB. It should be avoided when ever possible.
CodePudding user response:
This question is academical in the sense that there is no practical use case for the above example. Instead of writing
Node* dummy = reinterpret_cast<Node*>(new int8_t[sizeof(Node)]);
people would always use the correct approach and write:
Node* dummy = new Node;
Then there would be nothing to discuss. But here we have a academical or theoretical or language specific question.
Let us first evaluate the given source code piece by piece. struct Node { Node* prev; ... }; with whatever further definitions will of course never create undefined behaviour. It will declare a struct, which may contain padding, and has a clear defined size. There will be no discussion about alignment or whatever, because we do not yet have an instance of the struct.
Then, let us split the next line into 3 parts. First, new int8_t[sizeof(Node)]);. This will allocate a number of bytes in memory and return a pointer to it. The size of the allocated memory, which may include padding, is given by sizeof Node.
Now, with the new in general, we may have alignment. This depends on the used allocation function.
- Called by non-array new-expressions to allocate storage required for a single object. The standard library implementation allocates count bytes from free store. In case of failure, the standard library implementation calls the function pointer returned by std::get_new_handler and repeats allocation attempts until new handler does not return or becomes a null pointer, at which time it throws std::bad_alloc. This function is required to return a pointer suitably aligned to point to an object of the requested size.
- Called by the array form of new[]-expressions to allocate all storage required for an array (including possible new-expression overhead). The standard library implementation calls version (1)
New will return a
1-4) non-null pointer to suitably aligned memory of size at least size
So, now we have a pointer to a piece of memory (maybe aligned) with at least the size of the (maybe padded) struct.
Ok, we have pointer now.
Next, the reinterpret_cast.
This will do nothing.
Please read here.
Unlike static_cast, but like const_cast, the reinterpret_cast expression does not compile to any CPU instructions (except when converting between integers and pointers or on obscure architectures where pointer representation depends on its type). It is purely a compile-time directive which instructs the compiler to treat expression as if it had the type new-type.
An that is very obvious. If we reinterpret-cast a pointer, let us assume a 32 bit prvalue, to another pointer, which is also a 32 bit prvalue with the same content, what else should happen?
Looking at the Node *, then this is also a 32 bit prvalue. And if we assign the address value of the memory region returned by new, of course the same value will be assigned. So, the whole reinterprete task is a noOp and the "Node" pointer, points to some (maybe aligned and padded) memory, definitely big enough to hold a "Node" struct.
The next step is dereferencing and assignment: dummy->prev = ... /* last node */;
Here, we dereference a pointer to a valid memory area than can be interpreted as Node and assign a value, again a pointer to a pointer. This line of code will of course definitely work. There cannot be UB.
If you read again about reinterpret_cast in the CPP Reference, then you will find out the the cast itself will never be a problem. But there may be a problem with dereferencing a casted pointer.
- Any pointer to function can be converted to a pointer to a different function type. Calling the function through a pointer to a different function type is undefined, but converting such pointer back to pointer to the original function type yields the pointer to the original function.
And others.
But for the above code there will be no UB.
So, the anser to the question
Does reassignment of pointers acquired by reinterpret_cast from raw memory cause UB
is No. Definitely not for the shown piece of code.
The interpretation of the data, so, after dereferencing a pointer, maybe a problem as shown in the function pointer example. But not for the shown case.
The second question in the post:
If this really was problem, wouldn't it have occured at the point of reinterpret_casting?
the answer is: No, definitely not.
And the third question
So, does udefined behaviour actually take place here, and, if so, why?
For the shown code there will definitely be no UB.
