While working with iCloud recently I ran into a situation where I wanted to share a link to an item stored inside the application’s sandbox via iCloud.

Luckily, since OS X 10.7 NSFileManager has a nice API to create a public URL to an item inside the application’s ubiquity container suitable for sharing:

- (NSURL *)URLForPublishingUbiquitousItemAtURL:(NSURL *)url expirationDate:(NSDate **)outDate error:(NSError **)error;

The key point here is that the item has to be in the ubiquity container in order for this method to succeed. In our case, the item would be in the application sandbox. Luckily again, OS X 10.7 introduced another NSFileManager API that makes it easy to move an item to the ubiquity container:

- (BOOL)setUbiquitous:(BOOL)flag itemAtURL:(NSURL *)url destinationURL:(NSURL *)destinationURL error:(NSError **)error;

Both methods are documented as synchronous which means that they will 1) block until completed 2) make sure that the result and error are available upon return. This is not an API I particularly like myself but it has the advantage of making things easy to follow, assuming that the calls are moved off the main thread to keep the UI reactive.

Keeping this in mind, a naive implementation could be an NSOperation subclass which main method would look as following:

- (void)main
{
	NSURL *mediaLocationInSandbox = ...

	NSString *filename = [[[NSUUID UUID] UUIDString] stringByAppendingPathExtension:@"png"];
	NSURL *mediaLocationInUbiquityContainer = [[[NSFileManager defaultManager] URLForUbiquityContainerIdentifier:nil] URLByAppendingPathComponent:filename];

	NSError *setUbiquitousError = nil;
	BOOL setUbiquitous = [[NSFileManager defaultManager] setUbiquitous:YES itemAtURL:mediaLocationInSandbox destinationURL:mediaLocationInUbiquityContainer error:&setUbiquitousError];

	NSError *mediaLocationInCloudError = nil;
	NSDate *expirationDate = nil;
	NSURL *mediaLocationInCloud = [[NSFileManager defaultManager] URLForPublishingUbiquitousItemAtURL:mediaLocationInUbiquityContainer expirationDate:&expirationDate error:&mediaLocationInCloudError];

	NSParameterAssert(mediaLocationInCloud != nil);
}

However, such an implementation would fail with mediaLocationInCloudError being set as

Error Domain=NSCocoaErrorDomain Code=256 "The file “20ECBE26-41A5-4D79-81E3-63F5C13D1A02.png” couldn’t be opened." UserInfo=0x60000007ecc0 {NSURL=/Users/damien/Library/Mobile Documents/P97H7FTHWN~com~realmacsoftware~CloudSharing/20ECBE26-41A5-4D79-81E3-63F5C13D1A02.png, NSUnderlyingError=0x610000059170 "The operation couldn’t be completed. (LibrarianErrorDomain error 4 - The connection to the server was interrupted.)"}

This doesn’t look good at all. The error code 256 in the NSCocoaErrorDomain is NSFileReadUnknownError which is far from helpful in this case and the underlying error implies that things have gotten even worse underneath.

My first idea was to use file coordination with the NSFileCoordinator API to make sure that every read and write were coordinated with other readers and writers. However it is worth noting two points:

  • NSFileManager seems to provide a NSFileCoordinator if it can detect that the calls are made outside of a coordinated environment. This is not technically documented but if a custom NSFilePresenter is not needed file coordination should be taken care of by the file manager.
  • Using coordination doesn’t improve things and the API fails in the same way with the same error.

It is however important to note that if waiting an arbitrary amount of time (sometimes between 2 to 6 seconds) the second API call succeeds. Second important point is that the second API call can arbitrarily succeed no matter the ubiquitous state of the item (i.e. whether it has been upload or not, easy to check by checking that value of the NSURLUbiquitousItemIsUploadedKey property on NSURL).

This smells of race condition so let’s dive into the debugger and find out!


tl;dr -[NSFileManager setUbiquitous:itemAtURL:destinationURL:error:] is racy (or should I say “synchronous but not really”) and you should not assume the entirety of the system to be fully aware upon return that the item at the destination URL is actually a ubiquitous item. Things are even worse since making such assumption leads a critical system daemon (librariand) to crash 100% of the time.

Our first step will be to set a breakpoint in -[NSFileManager URLForPublishingUbiquitousItemAtURL:expirationDate:error:] and set where this leads us to.

-[NSFileManager URLForPublishingUbiquitousItemAtURL:expirationDate:error:]

After stepping over a few instructions we soon realize that there is no much going on here. A semaphore is created before calling the LBGetURLForPublishedItem function. This function takes a URL, a queue and a block which implies that it performs work on a background queue and invokes the block on completion on the given queue. Since a synchronous behavior is required here wait is called on the semaphore after calling this function and the semaphore will likely be signaled in the completion. This is a usual pattern used to make an asynchronous API behave like a synchronous one. The rest of the method deals with unpacking results from the called function and return the results to the called.

LBGetURLForPublishedItem

As said before, nothing exciting so let’s set a breakpoint in LBGetURLForPublishedItem. This functions lives in the Librarian.framework (a system private framework) and by stepping a few instructions we can notice that it quickly goes setting up an xpc connection with a daemon: librariand, a system daemon that lives under /usr/libexec/librariand.

Once the xpc connection with com.apple.librariand has been set up a message dictionary is created and two objects are inserted:

  • An integer of value 11 under the key LIBRARIAND_KEY_REQUEST (“Request Type”)
  • A string representing the path to the item in the ubiquity container that we want to publish under the key LIBRARIAND_KEY_PATH (“Path”)

Finally, the message is sent through the xpc connection by calling xpc_connection_send_message_with_reply. The reply parameter is a block of type xpc_handler_t. By using some debugging tips from a previous post we can read the memory at the block address, find the invoke function pointer in the block, disassemble it and set a breakpoint at the first instruction.

__LBGetURLForPublishedItem_block_invoke

Being typed as xpc_handler_t this function will take an xpc_object_t as second argument (remember that the first is the block itself). This object is the message reply and will usually be either a dictionary (xpc_dictionary_t) or an error.

Let’s go through the working case and the non-working one.

In the working case the object we get back is a dictionary as following:

<OS_xpc_dictionary: dictionary[0x600000194290]: { refcnt = 1, xrefcnt = 1, count = 3, dest port = 0x0 } <dictionary: 0x600000194290> { count = 3, contents =
	"URL" => <string: 0x61000007acc0> { length = 86, contents = "https://www.icloud.com/download/documents/?p=01&t=BAJKBzUxo_qdYOZe8TEBzLJ38Ttb64dorQcE" }
	"Date" => <date: 0x610000058780> Sat Aug 31 12:13:16 2013 BST (approx)
	"Success" => <bool: 0x7fff7dfc5920>: true
}>

This looks like what we were expecting.

In the non-working case we get an xpc object that we cannot trivially print in the debugger but that looks a lot like an error. We can confirm this by printing its type:

(lldb) p (struct xpc_type_s *)xpc_get_type($rsi)
(struct xpc_type_s *) $0 = 0x00007fff72212e28

(lldb) p &_xpc_type_error
(void **) $1 = 0x00007fff72212e28

So yeah, we are getting an error. In oder to figure out the error description (XPC_ERROR_KEY_DESCRIPTION) we could use an applier block to iterate through its keys and values but we can safely assume that it’s either XPC_ERROR_CONNECTION_INTERRUPTED or XPC_ERROR_CONNECTION_INVALID. We can probably trust the connection to be valid so it is likely to be an interruption.

_LBHandleServerReply

The reply block calls immediately into another function _LBHandleServerReply, passing the xpc object as first argument and a reply block as second.

This function simply checks for the LIBRARIAND_KEY_SUCCESS key in the dictionary and calls into the reply block with it if successful. If not, it unpacks the XPC error into a CF error via the function _LBCreateCFErrorFromXPCError:

Error Domain=LibrarianErrorDomain Code=4 "The operation couldn’t be completed. (LibrarianErrorDomain error 4 - The connection to the server was interrupted.)" UserInfo=0x61000007b0c0 {NSDescription=The connection to the server was interrupted.}

This is exactly the underlying error we are getting from NSFileManager.

By quickly inspecting the reply block invoke function’s body we see that it checks for an eventual error, then retrieve the URL (LIBRARIAND_KEY_URL) and expiration date (LIBRARIAND_KEY_DATE) and calls back into NSFileManager that will return to the caller.

This is all there is client side. In order to find out why the connection is interrupted we will have to attach to the librariand process in the debugger and see what’s going on over there.

librariand

Having attached to librariand, we can simply run the client side code and see if anything obvious could explain the connection interruption.

And yes, there is an obvious explanation to the interruption: librariand crashes!

We get an EXC_BAD_ACCESS at address 0x0 inside CFStringGetCString. If this wasn’t clear enough, the crashing instruction shows that we are trying to dereference NULL

movq   (%r13), %rax

It is worth remembering the declaration of CFStringGetCString:

Boolean CFStringGetCString(CFStringRef string, char *buffer, CFIndex bufferSize, CFStringEncoding encoding);

Now, remember that CoreFoundation is open-source and we can have a peak at the implementation of CFStringGetCString in CFString.c.

After a check for a buffer size of 0 the function asserts that the first argument is indeed a string. This assertion macro expands into the following function, implemented in CFRuntime.c

__CFGenericValidateType(cf, __kCFStringTypeID)

The first step of this function is to validate that the type is not NULL. However, by inspecting the value of %rdi (the first argument) in the prolog of the offending function call we can clearly see that the CFStringRef is NULL.

Even though this explains the crash we still need to figure out how this functions ended up being called with a NULL string. We can have a look at the stack trace at the moment of the crash in order to know where it has occurred.

Librariand stack trace at the moment of the crash

We are clearly missing function names here but thankfully LLDB is kind enough to let us set up symbolic breakpoints for these unnamed functions.

___lldb_unnamed_function50$$librariand

Stepping over ___lldb_unnamed_function50$$librariand, we can’t note anything very exciting, a few registers values are moved around up to where it gets to that CFStringGetCString. Given the function declaration the string is expected to be in %rdi. Following %rdi back to the start of the function we can see that its value originates in %rsi in the function prolog, that is the second argument of the ___lldb_unnamed_function50$$librariand function.

Let’s go to the stack frame above, into ___lldb_unnamed_function34$$librariand.

___lldb_unnamed_function34$$librariand

As per the x86-64 calling convention, observing the registers in the prolog we can see that the first argument is an xpc connection and the second one an xpc dictionary.

By inspecting the xpc connection we can find out the name of the receiver on the other side com.apple.librariand.peer.0x7fe84be0bad0, likely to be our own process. The xpc dictionary will be the request that originated client side.

By stepping over a few instructions we can find the call to ___lldb_unnamed_function50$$librariand. Tracing %rsi up through the instructions we can find where it value originated. In this case it is the return value of another function call to UBItemCreatePublicURLForPath.

When UBItemCreatePublicURLForPath returns nil, the domain, code and localized description of an error (that has probably been returned by reference) are retrieved and ___lldb_unnamed_function50$$librariand ends up being called with "handle_publish_item_request" passed as first argument, the error domain as second, the error code as third and the error localized description as fourth.

The issue here is that the error reference is not populated by UBItemCreatePublicURLForPath when returning nil and we end up passing a nil error domain to ___lldb_unnamed_function50$$librariand that is not handled and lead to the failed assertion in CFStringGetCString that we have previously observed.

We need to find out the reason why UBItemCreatePublicURLForPath does not populate the error reference when failing to creating a public URL in our particular use case.

UBItemCreatePublicURLForPath

UBItemCreatePublicURLForPath, in the Ubiquity.framework, only takes one argument: a path as an NSString.

It very simply calls into IPCCopyItemPublicURL. This function itself returns a dictionary, containing the url and expiration if it succeeds or the error if it fails. UBItemCreatePublicURLForPath simply unpacks these objects are return them.

In our case, IPCCopyItemPublicURL fails to return a dictionary altogether (not event one containing an error) leading UBItemCreatePublicURLForPath itself to never populates its error by reference.

We need to figure out why IPCCopyItemPublicURL returns nil.

IPCCopyItemPublicURL

IPCCopyItemPublicURL takes a single argument: a path as an NSString. It then jumps straight to IPCSendCFMessageDictionarySimple.

IPCSendCFMessageDictionarySimple calls into IPCSendCFMessage, takes its return value, makes sure it’s a dictionary and returns.

IPCSendCFMessage

This function first retrieves an SRConnection and calls SRSendCFMessage with the path string as parameter. SRSendCFMessage itself jumps straight to SRSendCFMessageTimed where the CF object is packed into an xpc object via the _CFXPCCreateXPCMessageWithCFObject function.

Eventually, _SRSendMessage is called with the xpc object. In it, an xpc connection is created and a semaphore is waited and signaled around the sending of the message to simulate synchronicity.

<OS_xpc_connection: connection[0x7fbd43804b70]: { refcnt = 2, xrefcnt = e, name = com.apple.ubd, type = named, state = checked in, error = 0x0 mach = true, privileged = false, bssend = 0x5903, recv = 0x5703, send = 0x5a03, pid = 193, euid = 501, egid = 20, asid = 100004 } <connection: 0x7fbd43804b70> { name = com.apple.ubd, listener = false, pid = 193, euid = 501, egid = 20, asid = 100004 }>

From the name of the connection, we can find out the process on the other side of the connection: in this case it is ubd, a daemon that lives under /System/Library/PrivateFrameworks/Ubiquity.framework/Versions/A/Support/ubd.

The message dictionary sent through the connection is as following:

<OS_xpc_dictionary: dictionary[0x7fbd43902bc0]: { refcnt = 1, xrefcnt = 2, count = 2, dest port = 0x0 } <dictionary: 0x7fbd43902bc0> { count = 2, contents =
	"ECF19A18-7AA6-4141-B4DC-A2E5123B2B5C" => <data: 0x7fbd43901210>: { length = 4096 bytes, contents = 0x62706c697374313513940000000000008012000000007f10... }
	"type" => <uint64: 0x7fbd41d09800>: 1073741866
}>

The contents is the path string and the type will be of interest further on.

Next, we can set a breakpoint in the connection reply handler block and observe the reply message content.

The reply dictionary in the working case looks like following:

<OS_xpc_dictionary: dictionary[0x7fd900610910]: { refcnt = 1, xrefcnt = 1, count = 2, dest port = 0x0 } <dictionary: 0x7fd900610910> { count = 2, contents =
	"contents" => <dictionary: 0x7fd900513070> { count = 1, contents =
		"ECF19A18-7AA6-4141-B4DC-A2E5123B2B5C" => <data: 0x7fd900516980>: { length = 4096 bytes, contents = 0x62706c69737431351388000000000000801200000000d273... }
	}
	"result" => <int64: 0x7fd900513520>: 0
}>

whereas in the case that leads to a crash of librariand:

<OS_xpc_dictionary: dictionary[0x7fd9005168d0]: { refcnt = 1, xrefcnt = 1, count = 1, dest port = 0x0 } <dictionary: 0x7fd9005168d0> { count = 1, contents =
	"result" => <int64: 0x7fd90040c050>: 0
}>

Clearly, the reply returns the same results but the contents is missing in the second case.

Following, the semaphore is signaled at the end of the reply handler block, giving execution back to _SRSendMessage.

The rest of the instructions dispose of a few objects and the function eventually returns. We are then back in SRSendCFMessage and eventually in IPCSendCFMessage without any nil check on the way.

Whatever happens on ubd side, it is not expected to not populate the dictionary with an object for the contents key.

We will have to attach the debugger to ubd and find out what is going wrong over there.

ubd

ubd itself doesn’t crash so it’s not as easy to find an instruction to break on as it was with librariand.

We will then have to apply the good old “guess and hope for some luck” approach. We do know that we are messaging ubd through an xpc connection so whatever it does, it will have to use some xpc functions to message us back. A good start would be to set a breakpoint on xpc_create_dictionary and look at the name of the functions in the backtrace when the breakpoint is hit.

Quite luckily, one function that could be a good candidate stands out: _SRTransportHandleRequest. We can then set a breakpoint on it and inspect the request coming through. Not surprisingly there are a lot! Thus, we need a way to skim off the requests we are not interested in. Remembering the dictionary that was sent through the connection from librariand we can update the breakpoint with a condition that the type in the request dictionary be 1073741866.

SRTransportHandleRequest

With ubd attached and our conditional breakpoint set up we will only break in the debugger when the appropriate message from librariand is received.

Upon receiving the request from librariand, the xpc message is unpacked and a call to yet another unnamed function is made: ___lldb_unnamed_function149$$ubd. This function is called with an SRConnection and the path string. This function returns a dictionary on the stack.

In order to better understand the return value of such function we will take 3 cases:

  • An URL is successfully created for the ubiquitous item and replied back to librariand.
  • A URL cannot be created and an error is sent back to librariand. We will simulate this by turning off any network connectivity making sure a URL cannot be created for a legit reason.
  • A URL cannot be created and no error is sent back to librariand. This is the race we’ve been observing.

In order to test the first case we will simply stick a sleep(10) between the calls to -[NSFileManager setUbiquitous:itemAtURL:destinationURL:error:] and -[NSFileManager URLForPublishingUbiquitousItemAtURL:expirationDate:error:].

In the first case, the returned dictionary is as expected:

{
    expiration = 399743378;
    url = "https://www.icloud.com/download/documents/?p=01&t=BAJ_K9mgDi2YrwWRjFAB5dQwLqaZAKSuPwvE";
}

Following, an xpc object is created from the dictionary and added to the reply dictionary as contents. Finally, the message is sent back through the xpc connection and everyone’s happy.

In the second case, the returned dictionary is also as expected:

{
    error =     {
        code = 2;
        domain = kCFErrorDomainCFNetwork;
        kCFGetAddrInfoFailureKey = 8;
    };
}

Just as the first case, the dictionary containing the error is packed up as contents in the reply dictionary and sent to librariand through the xpc connection. Note that the useful network error ends up being wrapped by Foundation into a less than useful one. This is quite unfortunate.

Error Domain=NSCocoaErrorDomain Code=256 "The file “4AF039A7-FD55-496C-BD80-04C865361AC1.png” couldn’t be opened." UserInfo=0x60000007ac40 {NSURL=/Users/damien/Library/Mobile Documents/P97H7FTHWN~com~realmacsoftware~CloudSharing/4AF039A7-FD55-496C-BD80-04C865361AC1.png, NSUnderlyingError=0x618000245be0 "The operation couldn’t be completed. (kCFErrorDomainCFNetwork error 2 - The operation couldn’t be completed. (kCFErrorDomainCFNetwork error 2.))"}

In the third case however, no dictionary is returned at all. This leads to the reply dictionary not containing any contents and librariand receiving the unexpected response we observed earlier.

Since we would expect ___lldb_unnamed_function149$$ubd to return a dictionary containing an error in any failing case, let’s set a breakpoint on it and try to follow back to the source of the problem.

___lldb_unnamed_function149$$ubd

___lldb_unnamed_function149$$ubd also takes an SRConnection and a path string and quickly calls into ___lldb_unnamed_function1010$$ubd which itself returns the url, expiration and error.

In the first case described before, a url and expiration are returned, in the second case an error is returned whereas the third case does not return a url but also lacks an error.

___lldb_unnamed_function1010$$ubd

This function seems to do much of the actual work in publishing the item URL. A bunch of properties are first retrieved in order to create an HTTP request such as the server URL, an AuthToken. Assuming all goes smoothly it will end up creating a POST HTTP request, creating a read stream and schedule the request on a runloop. The response is then read, parsed, checked for errors and the content is returned to the caller.

The creation of the HTTP request is conditioned by the return value of yet another function: ___lldb_unnamed_function186$$ubd. This function is expected to retrieve a dictionary of properties for an item at a given ubiquity path.

{
    Checksum = 011c490058dcc81d6fef74f5dc1f3a4dcbeb73f39e;
    IsPackageRoot = 0;
    ItemName = "30610207-C807-4BB9-86F7-E0D048EF033D.png";
    ItemSize = 992318;
    ItemStatus = 1051648;
    LastEditor = "MacBook Pro Damien";
    ModTime = 1376642747;
    ModeBits = 33188;
    TotalBytesToDownload = 992318;
}

Based on the state of the network connection, this dictionary might contain some additional upload/download error.

In the broken third case described above, no dictionary is returned at all. We can quickly examine ___lldb_unnamed_function186$$ubd and find out why such is happening.

___lldb_unnamed_function186$$ubd

This function seems to perform a few sanity checks on the given path. The different branching in the code path between the working and broken case seems to occur based on the return value of yet another function: ___lldb_unnamed_function371$$ubd.

I can’t say for sure what is going on in this function but it seems that it attempts to retrieve an item from a sqlite database. Two SQL statements stand out in particular:

select item_id from shadow_table where parent_id = ? and (local_filename = ? or filename = ?) and ((state & ?) == 0);

select item_id from item_table where parent_id = ? and (local_filename = ? or filename = ?) and ((state & ?) == 0);

Again, I’m not sure what precisely these statements are attempting to retrieve nor the identity of the database they are querying. However it is obvious that, in the broken case, an ubiquity item cannot be found for the given path.

I think we can conclude there is a race condition between -[NSFileManager setUbiquitous:itemAtURL:destinationURL:error:] and -[NSFileManager URLForPublishingUbiquitousItemAtURL:expirationDate:error:]. The first method does make the item ubiquitous returning YES synchronously when succeeding. However upon return we cannot assume that the item at this path has been fully registered by ubd.

Again, these are guesses but I believe that the item will be fully registered ubd only after attempting (not succeeding) to upload it. We have no idea about the process ubd follows when attempting to upload items nor the delay between the registration of an item as ubiquitous and the attempt to upload it to the iCloud server. With that in mind, the best advice would be to track the upload status of the ubiquity item by the mean of an NSMetadataQuery and retrieve a URL for publishing only after the item has been fully uploaded to the iCloud server (by tracking the NSMetadataUbiquitousItemIsUploadedKey and NSMetadataUbiquitousItemIsUploadingKey attributes).


So, to summarize, we set a breakpoint in -[NSFileManager URLForPublishingUbiquitousItemAtURL:expirationDate:error:] and step over instructions through the Librarian and Ubiquity frameworks. This lead us to attaching to the librariand process that we found was crashing because of a non-populated error for an API that failed to return a result. Attempting to trace the issue this lead us to attaching to the ubd process and figuring out that a registered item couldn’t be retrieved by ubd for the ubiquitous path.

Here’s the list of breakpoints that were added in order to debug this.

Breakpoints

This was a fun debugging journey, just a bit longer than I expected!

I have obviously filed a radar about this: rdar://problem/14772373.