Serialized data structures
Serialization is a process whereby data objects are preserved during storage on a computer system. Serializing data preserves the original type of the object. That's to say, we can serialize dictionaries, lists, integers, or strings into a file. Sometime later, when we deserialize this file, those objects will still maintain their original data type. Serialization is great because if, for example, we stored script objects to a text file, we wouldn't be able to feasibly reconstruct those objects into their appropriate data types as easily. As we know, reading a text file reads in data as a string.
XML and JSON are the two common examples of plain text-encoded serialization formats. You may already be accustomed to analyzing these files in forensic investigations. Analysts familiar with mobile device forensics will likely recognize application-specific XML files containing account or configuration details. Let's look at how we can leverage Python to parse XML and JSON files.
We can use the xml module to parse any markup language that includes XML and HTML data. The following book.xml file in the text contains the details about this book. If you've never seen XML data before, the first thing you may note is that it's similar in structure to HTML, another markup language, where contents are surrounded by opening and closing tags, as follows:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<authors>Preston Miller & Chapin Bryce</authors>
<chapters>
<element>
<chapterNumber>1</chapterNumber>
<chapterTitle>Now for Something Completely Different</chapterTitle>
<pageCount>30</pageCount>
</element>
<element>
<chapterNumber>2</chapterNumber>
<chapterTitle>Python Fundamentals</chapterTitle>
<pageCount>25</pageCount>
</element>
</chapters>
<numberOfChapters>13</numberOfChapters>
<pageCount>500</pageCount>
<publisher>Packt Publishing</publisher>
<title>Learning Python for Forensics</title>
</root>
For analysts, XML and JSON files are easy to read because they're in plain text. However, a manual review becomes impractical when working with files containing thousands of lines. Fortunately, these files are highly structured, and even better, they're meant to be used by programs.
To explore XML, we need to use the ElementTree class from the xml module, which will parse the data and allow us to iterate through the children of the root node. In order to parse the data, we must specify the file being parsed. In this case, our book.xml file is located in the same working directory as the Python interactive prompt. If this weren't the case, we would need to specify the file path in addition to the filename. If you're using Python 2, please make sure to import print_function from __future__ . We use the getroot() function to access the root-level node, as follows:
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('book.xml')
>>> root = tree.getroot()
With the root element, let's use the find() function to search for the first instance of the authors tag in the XML file. Each element has different properties, such as tag, attrib, and text. The tag element is a string that describes the data, which in this case is authors. An attribute(s) or attrib are stored in a dictionary if present. Attributes are values assigned within a tag. For example, we could have created a chapter tag:
<chapter number=2, title="Python Fundamentals", count=20 />
The attributes for this object would be a dictionary with the keys number, title, and count and their respective values. To access the content between the tags (for example, chapterNumber), we would need to use the text attribute.
We can use the findall() function to find all occurrences of a specified child tag. In the following example, we're looking for every instance of chapters/element found in the dataset. Once found, we can use list indices to access specific tags within the element parent tag. In this case, we only want to access the chapter number and title in the first two positions of the element. Look at the following example:
>>> print(root.find('authors').text)
Preston Miller & Chapin Bryce
>>> for element in root.findall('chapters/element'):
... print('Chapter #{}'.format(element[0].text))
... print('Chapter Title: {}'.format(element[1].text))
...
Chapter #1
Chapter Title: Now for Something Completely Different
Chapter #2
Chapter Title: Python Fundamentals
There are a number of other methods we can use to process markup language files using the xml module. For the full documentation, please see https://docs.python.org/3/library/xml.etree.elementtree.html.
With XML covered, let's look at that same example stored as JSON data and, more importantly, how we use Python to interpret that data. Later, we're going to create a JSON file named book.json; note the use of keys, such as title, authors, publisher, and their associated values are separated by a colon. This is similar to how a dictionary is structured in Python. In addition, note the use of the square brackets for the chapters key and then how the embedded dictionary-like structures are separated by commas. In Python, this chapters structure is interpreted as a list containing dictionaries once it's loaded with the json module:
{
"title": "Learning Python Forensics",
"authors": "Preston Miller & Chapin Bryce",
"publisher": "Packt Publishing",
"pageCount": 500,
"numberOfChapters": 13,
"chapters":
[
{
"chapterNumber": 1,
"chapterTitle": "Now for Something Completely Different",
"pageCount": 30
},
{
"chapterNumber": 2,
"chapterTitle": "Python Fundamentals",
"pageCount": 25
}
]
}
To parse this data structure using the json module, we use the loads() function. Unlike our XML example, we need to first open a file object before we can use loads() to convert the data. In the next code block, the book.json file, which is located in the same working directory as the interactive prompt, is opened and its contents are read into the loads() method. As an aside, we can use the dump() function to perform the reverse operation and convert Python objects into the JSON format for storage. As with the XML code block, if you're using Python 2, please import print_function from __future__:
>>> import json
>>> jsonfile = open('book.json', 'r')
>>> decoded_data = json.loads(jsonfile.read())
>>> print(type(decoded_data))
<class'dict'>
>>> print(decoded_data.keys())
dict_keys(['title', 'authors', 'publisher', 'pageCount', 'numberOfChapters', 'chapters'])
The module's loads() method reads the JSON file's string content and rebuilds the data into Python objects. As you can see in the preceding code, the overall structure is stored in a dictionary with key and value pairs. JSON is capable of storing the original data type of the objects. For example, pageCount is deserialized as an integer and title as a string object.
Not all the data is stored in the form of dictionaries. The chapters key is rebuilt as a list. We can use a for loop to iterate through chapters and print out any pertinent details:
>>> for chapter in decoded_data['chapters']:
... number = chapter['chapterNumber']
... title = chapter['chapterTitle']
... pages = chapter['pageCount']
... print('Chapter {}, {}, is {} pages.'.format(number, title, pages))
...
Chapter 1, Now For Something Completely Different, is 30 pages.
Chapter 2, Python Fundamentals, is 25 pages.
To be clear, the chapters key was stored as a list in the JSON file and contained nested dictionaries for each chapter element. When iterating through the list of dictionaries, we stored and then printed values associated with the dictionary keys to the user. We'll be using this exact technique on a larger scale to parse our Bitcoin JSON data. More details regarding the json module can be found at https://docs.python.org/3/library/json.html. Both the XML and JSON example files used in this section are available in the code bundle for this chapter. Other modules exist, such as pickle and shelve, which can be used for data serialization. However, they won't be covered in this book.