Posted: 23rd October 2019
Revised: 13th March 2021
***Note that SnapChat have now moved onto a different method of storing data. Arroyo.db has been in used since late 2020. It is a SQL database housing Protobuf blobs. It may well be the focus of a blog post soon***
I'm going to start this blog entry by admitting that I don't have all the answers on this file. I have figured out most of it, enough to write my own parser at least, but work is still ongoing to fill in the remaining blanks.
What follows is all that I know about this file format at this time. However this may all be moot now as although the same file format is still in use, the retention period for messages now appears to be virtually nil.
The File Itself
iOS Extractions over the last year or so may include the 'ChatConversationStore.plist' file which SnapChat uses to store all message data. Although it's called a PLIST, it isn't formatted like either a BPList or an XML PList and I couldn't find any tool to parse the data with the exception of UFED PA which presented me with a very user un-friendly tree structure, certainly nothing that I could give to my investigators. (I believe that Axiom is now capable of parsing this file too but I was so invested in this myself that I decided to continue).
And so I began deconstructing the file, simply trying to work out how the data was structured. I knew it wasn't encrypted as I could read clear text within the file, but the way the data was organised seemed very confusing.
What I found was a way of organising data that I'd not come across before. That's not to say it is unique to SnapChat, but if it is used elsewhere, I haven't seen it.
It's also worth mentioning that during my research, I found several variations between versions of SnapChat, but the general premise of how the data is stored is the same. More about this though at the end.
The header to the file in this case is TSAF, which I couldn't find any references for except for other people with the same questions as me. And I honestly still don't understand the X bytes that follow it. So I'm going to start this journey further along the file and start by explaining some of the easier things I worked out, and work my way up to the more complicated parts.
The first thing you may notice is that most clear text strings are preceded by a 0x08 and are followed/terminated by null (0x00). This is fairly consistent but with a few exceptions which I will explain later. You can see in the image below how SCChatV3ConversationStore and conversations both have the 08 and 00 header and termination. But you may also see that SCChatConversationV3 is preceded by a 1E even though it is still terminated with a 00.
I went through the entire document and found that this was the same throughout. I also noticed that the strings that started with a 1E were class names and were followed by numerous seemingly meaningless characters. These "meaningless" characters are actually the definition of the objects in the class and are separated into 8 byte segments.
Immediately following the name of the class, there is an int16 (2 bytes) which define the number of objects within the class. I then ignore the following 9 bytes as I do not currently know what they are (and it doesn't seem to matter) and move to sectioning off the 8 byte structures.
Of these 8 bytes, I can only find the importance of the first byte, and you can see that in most cases, the first byte is 00, with the occasional 01, 04 or 09 thrown in.
This image shows the Class Name in red at the top of the image. The blue highlight is the number of objects in the class (0x2300 = 35 Objects). The green lines separate the 8 byte structures (you can see that there are 35 of them) and the cyan highlights the first byte of the structure.
The first byte of these 8 byte structures, is a header; a definition of the type of data to expect. Imagine that each of the 8 byte structures is a variable, then the first byte tells you what datatype that variable is and by extension, how many bytes are required.
00 = To Be Announced (for want of a better term)
01 = Boolean
which requires 1 byte.
04 = Int32 (Little Endian) which requires 4 bytes.
09 = Int64 which required 8 bytes.
So looking at the 35 objects that make up SCChatConversationV3 object:
This newly created object needs to be saved in a list of class objects which we'll call "CreatedClasses". As this is the first object we have created, it is saved at Index 0.
*This is important because 2 similar PLists may create Classes in different order and therefore need parsing differently*
Once the object has been defined, you will see the byte 1F. This indicates that we need to use a Class Object that has already been defined and the following byte tells us which object to create, as per the index in the CreatedClasses list.
This image shows that we are using a previously defined class (0x1F highlighted in orange) and that the object is at index 0 in the CreatedClass list (0x00 highlighted in purple).
So we now know that we have an instance of SCChatConversationV3. The next X many bytes will all be values within the class instance. How many bytes depends on the datatypes that have been/will be defined.
Filling in Class Instances
We now know that we have a bunch of empty variables that need to be actual values, but all we know about variable one is that the datatype is TBA ("To Be Announced" - I should say that that is a term that I have opted to use for simplicities sake, I'm sure there is a real name for it that I don't know).
What I mean by "To Be Announced" is that the Class definition is unaware of what the datatype is going to be; this may be for example because the variable can hold different types of data or it may be any number of other reasons. The way we deal with it is always the same though.
In the case of TBA Variables, we use the first byte of the actual value to determine the datatype.
This image shows the class definition for SCChatConversationV3 (outlined in dark blue) and the creation of Class at index 1 (outlined in light blue).
The first value is highlighted in red and is found to have a TBA value of x00 in the class definition. In this case, we take the first byte 08 as the header.We have already learned x08 was a text string and is terminated by 00. Therefore our first variable is everything between the 08 and 00.
Once we have completed value 1, we refer to the next object in the class definition (pink) and find it is also a x00. Again we find the header 08 so variable 2 is the string that follows right up until the next 00.
Once we have completed value 2, we refer back to the class definition (blue) which is again x00. This time, we find that the value is also x00. This literally means there is no value, it's simply not being used in this case. For example, this location may be where the attachment filename is stored but there is no attachment in this case.
The forth value (green) is back to x00 and is found to be a string value again.
The fifth value in the definition (yellow) is x01 which is a boolean value, requiring only a single byte. We find that the value here is x0D which equates to TRUE.
And so it goes on.
The Data Types defined in the Class Definition are:
See below chart
0D or 0E
True | False
4 bytes (Little Endian)
05 00 00 00
8 bytes (Little Endian)
A0 05 00 00 00 00 00 00
"To Be Announced" DataTypes
<Any object type> stored at location 2 in the Object List
02 01 20
<Any object type> stored at location 288 in the Object List
Int32 (Little Endian)
05 00 00 00
String object at location 8 of the String List
05 02 30
String object at location 560 of the String List
08 44 45 46 41 55 4C 54 00
09 00 03 00 00 00
Note that the number that follows the Dictionary definition is the number of objects in the dictionary. In this example, 3 objects.
0A 00 02 00 00 00
Note that the number that follows the Array definition is the number of objects in the array. In this example, 2 objects.
Int16 (Little Endian)
Int64 (Little Endian)
A3 C2 3A E9 69 01 00 00
D9 D8 58 3F
00 00 00 00 00 00 08 40
30 2E 39 34 39 20 30 2E 32 33 35 20 30 2E 33 34 31
0.949 0.235 0.341
FC A9 51 FE 53 13 D7 41
*No example to show*
This is basically a Dictionary with a name.
So think KeyValuePair<String,Dictionary>
1E 53 43 43 68 61 74 43 6F 6E 76 65 72 73 61 74 69 6F 6E 56 33 00 00 00 23 00 00 80
0x23 Objects (35 Objects)
Note that byte two refers to the index of the created class
New SCChatConversationV3 instance
Another important thing to know about parsing these files is the developers penchant for rounding up bytes. I was confused for several days about the number of null bytes sprinkled randomly throughout the file. For example:
These two example values are both Doubles. Both are defined using the first byte 0x16 and both require 8 bytes to be read (Little Endian).
But you will notice that
one requires 16 bytes to display the value whereas the other only requires 14 bytes. This is because there are 2 less null values.
So how do we know that value one doesn't end at 14 bytes and D7 and 41 are part of the next value? Viewing it like we are above is actually blinding us to what we need to see .
These are the same two Double values when viewed in a standard Hex Viewer. We can see that the 8 bytes that make up the values both start on the first byte of the their respective rows and then go on for their 8 bytes. The varying number of nulls is used to pad out the values so that the value will fit neatly on a row.
This can be observed consistently throughout the file for many different datatypes. Basically any type that has a set value of bytes will be made to display neatly by applying padding between the header and the actual value. This doesn't always mean starting a new row. But it does appear to mean that there has to be enough room for the full value to fit on a single row AND that the value can start on an even byte.
Static length values may never start on an odd (red) byte unless it fits wholly into that single byte (such as a NULL, BOOL or Int8).
2 Byte values such as an Int16 must start
on green and have room to fit both bytes on the same row.
4 Byte values such as an Int32 must start
on green and have room to fit all 4 bytes on the same row.
8 Byte values such as an Int64 must start
on green and have room to fit all 8 bytes on the same row.
If a value would naturally fall on an odd byte, or cannot fit wholly on the same row, it is padded with nulls so that it will start on an even byte or on a new row.
You can see above how each of the values highlighted above has a header of either x12 or x16, both of which are 8 byte integers.
The first value (pink) only has 3 bytes remaining after the header before a new line. So the 3 bytes are left blank and the 8 byte integer starts on the new line.
The second, third and forth values (red, blue and green) only have 7 bytes remaining after the header before a new line. So all 7 are left blank and the 8 byte integer starts on the new line.
The fifth value (cyan) naturally falls at the start of a new line but if it was immediately after the header byte, it would overlap the second second half of the row by one byte, so instead, the value starts at the mid-way point.
Finally, the sixth value (yellow) only has 5 bytes available until the mid-point and so it begins at the mid-way point instead.
If this seems like an odd and wasteful way to do things I would have to agree. Maybe there is some logic there that I am not seeing but there seems to be an abundance of zero's which are there for no reason except formatting?
There are two, (well 4) types of Pointers; and these took a while to figure out.
Basically, every time a string or an object is created, it is added to a list which can be referenced later on.
So in practical terms, when this section is parsed;
The strings SCChatV3ConversationStore, conversations and SCChatConversationV3 are processed as described earlier but are also added to a list of strings. ie;
Later on in the file, if any of those strings is required again, instead of including the entire string, which takes x many bytes, we can instead use a pointer which takes considerably less.
Pointers to STRING objects are defined using either x05, followed by a single byte (int8) to define the index, or x06 which uses two bytes (int16) to define the index.
So if the string "conversations" appeared again later in the file, instead of taking 15 bytes to show 08 63 6F 6E 76 65 72 73 61 74 69 6F 6E 73 00, the file would instead use just 2 bytes to define the object as a Pointer (x05) and the index position (x01). To point to the string "SCChatConversationV3" the file would contain 05 02.
As well as Strings, Objects can also be pointed to and are stored in a separate list. Objects covers a much wider range of data types, so Dictionaries, Arrays, Doubles*, Singles and Int64's are all added to the object list. Smaller objects such as Int8, Int16, Boolean etc are so small that there is no benefit to using pointers.
*There are actually two types of Doubles identified. One that IS added to the object list and one that IS NOT added. 0x14 and 0x16 respectively.
Object Pointers are have a header of as x02 for int16 indexes or x03 for int32 indexes.
Dictionaries are defined using the marker 09 and are followed by an int32 (after rounding) that defines the number of objects in the dictionary. This is then followed by the first object's Value, then the first object's Key. Then the second Value and second Key and so on.
This image shows the definition of a Dictionary (red) followed by 2 empty bytes (the result of rounding) followed by 4 bytes which define the number of objects in the dictionary (Blue).
The values (green and light green) are both before their respective keys (shown in pink and purple). In both cases, the Values have a header of x0F which is an int 8 (and therefore only requires the next byte.
The keys are both pointers to the string list at positions 8 and 26.
Arrays are basically the same as dictionaries but do not have keys so are just a list of values.
They are defined using the marker x0A and followed by an int32 that defines the number of objects. That is then followed by each array object until all expected objects are accounted for.
Finalising The Parse
Once the entire file was parsed, I had a list of objects similar to how UFED is able to display this file.
Using known test data, I was able to identify which node was the sender, which node was the timestamp etc. Great! That means that I can simply get my program to grab the appropriate nodes and insert them into a table. And it worked perfectly.
...Until I tried to parse a different file.
The version of the file is listed near the bottom of the file and I have files which are both version 3 and 6. Both parse perfectly using the information found above, but where the sender may be node 20 in one file, sender is node 22 in the next. This is true not only on files shown as different versions, but also on files shown as the same version(?). So out of the 5 files I had that were all version 6, I identified 4 slightly different schema...
Some of the nodes can easily be identified programmatically. Others are more difficult and so for now, I have addressed it by having a list of known schema which are automatically tested to see if any work. Should it fail, I have built in the option to allow the user to create their own schema. I may or may not put even more work into this. It seems kind of pointless at this moment in time if the results that are recoverable are minimal.
Finally, once the results are in a table, they can be filterd, reordered and selected/deselected before being output to a html report.
SnapChat data from iOS devices has been made with the odd choice to go with a PList file instead of SQLite as they do on Android. What's more bizarre is their choice to make it nothing like any other PList in use on the device (That I have found anyway).
Deconstructing the file was a challenge that started way back in May 2019. After a few weeks of not getting very far I turned my attention to ArtEx, only to come back to SnapChat again a few weeks ago. There was lots of trial and error for some of the more unusual points of the file and while I may still not understand the choice by the coders to do things a certain way, I'm fairly confident that my interpretation is correct, based soley on the fact that it works. Consistantly.
Hopefully you can find some use in this tool and that it isn't made redunant too soon!
You can find my FREE SnapChat Message parser "Spoopy" in the Software section of the site.