The Secret Life of Cows

Pascal’s scribbles blog June 2, 2018
Source

A lot of people at RustFest Paris mentioned Cows -- which may be surprising if you've never seen std::borrow::Cow!

Cow in this context stands for "Clone on Write" and is a type that allows you to reuse data if it is not modified. Somehow, these bovine super powers of Rust's standard library appear to be a well-kept secret even though they are not new. This post will dig into this very useful pointer type by explaining why in systems programming languages you need such fine control, explain Cows in detail, and compare them to other ways of organizing your data.

Organizing Data

This is what it all comes down to: People want to have a good, precise way to organize their data. And they want their programming language to support them. That's why a lot of newer languages include a bunch of data structures optimized for different use cases, and that is also why software developers are dealing with API documentation so often. To ensure that your code has the performance characteristics you expect, it is essential to know which piece of data is represented in which way.

In systems programming languages, this is in some regards even more important. You want to know:

  1. exactly where your data lives,
  2. that it is efficiently stored,
  3. that it is removed as soon as you stop using it,
  4. and that you don't copy it around needlessly.

Ensuring all these properties is a great way to write fast programs. Let's look at how we can do this in Rust.

Where Does Our Data Live

It is quite explicit where your data lives. By default, primitive types and structs containing primitive types are allocated on the stack, without any dynamic memory allocation. If you want to store data of a size only known at runtime (say the text content of a file), you need to use a type that dynamically allocates memory (on the heap), for example String, or Vec. You can explicitly allocate a data type on the heap by wrapping it in a Box.

(If you're unfamiliar with the notion of "Stack and Heap", you can find a good explanation in this chapter of the official Rust book.)

Note: Creating a new (not-empty) String means allocating memory, which is a somewhat costly operation. A language like Rust gives you quite a few options to skip some allocations, and doing so can speed up performance-critical parts of your code significantly. (Spoiler: Cow is one of these options.)

Structuring Data

If you know what you will do with your data, you can probably figure out how to best store it. If you for example always iterate through a known list of values, an array (or a Vec) is the way to go. If you need to look up values by known keys, and don't care about the order they are stored in, a hash map sounds good. If you need a stack to put data onto from different threads, you can use crossbeam-deque. This is just to give you a few examples -- there are books on this topic and you should read them. A Cow doesn't really help you here per-se, but you can use it inside your data structures.

Dropping Data

Luckily, in Rust it is easy to make sure our data gets removed from memory as soon as possible (so we don't use up too much memory and slow down the system). Rust uses the ownership model of automatically dropping resources when they go out of scope, so it doesn't need to periodically run a garbage collector to free memory. You can still waste memory, of course, by allocating too much of it manually, or by building reference cycles and leaking it.

No Needless Copying

One important step towards being a responsible citizen in regard to memory usage is to not copy data more than necessary. If you for example have a function that removes whitespace at the beginning of a string, you could create a new string that just contains the characters after the leading whitespace. (Remember: A new string means a new memory allocation.) Or, you could return a slice of the original string, that starts after the leading whitespace. The second options requires that we keep the original data around, because our new slice is just referencing it internally. This means that instead of copying however many bytes your string contains, we just write two numbers: A pointer to the point in the original string after the leading whitespace, and the length of the remaining string that we care about. (Carrying the length with us is a convention in Rust.)

But what about a more complicated function? Let's imagine we want to replace some characters in a string. Do we always need to copy it over with the characters swapped out? Or can we be clever and return some pointer to the original string if there was no replacement needed? Indeed, in Rust we can! This is what Cow is all about.

What is a Cow Anyway

In Rust, the abbreviation "Cow" stands for "clone on write"[^clone]. It is an enum with two states: Borrowed and Owned. This means you can use it to abstract over whether you own the data or just have a reference to it. This is especially useful when you want to return a type from a function that may or may not need to allocate.

[^clone]: Yes, that's right: Clone on write, not copy on write. That's because in Rust, the Copy trait is guaranteed to be a simple memcpy operation, while Clone can also do custom logic (like recursively clone a HashMap<String, String>.

A std Example

Let's look at an example. Say you have a Path and want to convert it to a string. Sadly, not every filesystem path is valid UTF-8 (Rust strings are guaranteed to be UTF-8 encoded). Rust has a handy function to get a string regardless: Path::to_string_lossy. When the path is valid UTF-8 already, it will return a reference to the original data, otherwise it will create a new string where invalid characters are replaced with the � character.

A Beefy Definition

With that in mind, let's look at the actual definition of Cow:

As you can see, it takes some convincing to have Rust accept this type in a way we can work with it. Let's go through it one by one.

  • 'a is the lifetime that we need our data to be valid for. For the Owned case it's not very interesting (to Cow own the data -- it's valid until the Cow goes out of scope), but in case the Cow contains Borrowed data, this lifetime is a restriction set by the data we refer to. We cannot have a Cow that refers to already freed memory, and rustc will let us know when that is possible by mentioning that the Cow outlives its 'a.

  • ToOwned is a trait that defines a method to convert borrowed data into owned data (by cloning it and giving us ownership of the new allocation, most likely). The type we receive from this method is an associated type on the trait, and its name is Owned (yep, the same name as the Cow variant, sorry). This allows us to refer to it in Owned(::Owned).

    To make this a bit more concrete, let's assume we have a Cow that's storing a &str (in the Borrowed case). The ToOwned implementation of str has type Owned = String, so <&str as ToOwned>::Owned == String.

  • ?Sized is a funny one. By default, Rust expects all types to be of a known size, which it expresses by having an implicit constraint on the Sized marker trait. You can explicitly opt-out of this by adding a "constraint" on ?Sized.

    The thing is: Not all possible types have a known size. For example, [u8] is an array of bytes somewhere in memory, but we don't know its length. In your application code you won't see a type like this directly, you'll see it behind references instead. And note: In Rust, the reference itself can contain the length. (See what I wrote above about slices!)

    But how does that relate to Cow again? You see, the B in Cow's definition is behind a reference: Once directly visible in the Borrowed variant, and the second type hidden in the ToOwned::Owned (which is of type Borrow). Since a Cow should be able to contain a &[u8], its definition needs to work for &'a B where B = [u8]. That in turn means need to say: "we don't require this to be Sized, we know it's behind a reference anyway" -- which is exactly what the ?Sized syntax does.

Discussion in the ATmosphere

Loading comments...