What is the deal with structured data and why should firms care?

What is the deal with structured data and why should firms care?

The ‘legaltech boom’ in recent years has, for the most part, been driven by a new approach towards data and its management. Various AI initiatives and products in the legal sector attempt to work with labeled and structured data to assist lawyers with document drafting and due diligence. However, one of the issues that law firms are still attempting to solve is how to ensure that the data they receive and produce is properly structured.

What does that mean? What’s the difference between the ‘structured’ and ‘unstructured’ data and what are its examples within the legal context? What problem is structured data solving?

What is structured and unstructured data?

In simple terms, structured data is the data that follows a certain data model. Such a model has a predefined structure and order, and every piece of data is precisely labeled, making it easy to access data in the future. Such data is then stored in databases. Unstructured data, on the other hand, does not follow any predefined structure and its values are not labeled – no relationship between the data values is, therefore, identified (i.e. pure text, audio recordings, or scanned files). Hence, MS Word or PDF files are examples of unstructured data, yet they are still the most common ways of producing textual data. In the graph below, we can compare the linear growth of structured data with the exponential growth of unstructured data in recent years.

structured data vs. unstructured data
Source: IDC

Some of the most common types of unstructured data:

  • Text files: word processing files, spreadsheets, presentations, emails.
  • Email: although email does contain metadata (e.g. “to”, “from”, “date / time” and “subject”), the plain text field, containing the most substantial information, is largely unstructured.
  • Websites: Similarly to email, social network sites contain metadata such as “date”, “number of likes”, “location” etc. but the text fields remain unstructured.
  • Business applications: MS Office documents, PDFs and similar – again, besides date and title, text fields remains unstructured.

As another example, consider the following snippet of a contract drafted purely in a text-editor (i.e. MS WORD):

Although a rather short piece of text, it contains various pieces of data (as highlighted), and although it may not seem like it, all of such data is a prime example of unstructured data. As mentioned earlier in THIS article, computers natively do not understand natural languages (i.e. English) as all characters are eventually transformed into a sequence of 1’s and 0’s (the binary code). Therefore, although we might be able to search for ‘date’ within the above document trough the ‘ctrl+F’ (or any other) search function, the same wouldn’t work if we searched for ‘company’, as the computer won’t be able to understand on its own that Fresh Produce Delivery is a company name, or that Ltd. is a company type. For this reason, we can call the above snippet of a contract as ‘unstructured’, as data within it is not labeled and sorted, resulting in poor searchability when looking for particular entities that the computer itself does not recognize.

So how can we transform the above data to a structured data?

To store data systematically and following a clear predefined order, we should start with relational databases.  Relational databases use a structure that allows us to identify and access data in relation to another piece of data in a database. It is usually stored in tables, and every data value is then stored in a column with the respective label. Those who wish to familiarise themselves with relational databases, knowledge of SQL (Structured Query Language), a database management language for relational databases, will be highly beneficial (tutorials on SQL coming soon).

Consider the following example:

Contract IDDateCompany NameCompany TypeLocationCompany No.Address
0August 17, 2020Fresh Produce DeliveryLtd.E&W044160753

Compared to the plain text version above, we now have our values entered into a table. When dealing with thousands of contracts, we are now able to sort our data in all kinds of ways (i.e. by date, company no. or find all companies containing ‘Delivery’ in their name). As is clear, the main advantage of structured data is therefore the easy data accessibility it provides.

Using Python to answer the question: “How do I open this JSON file ...

Another example of structured data is the JSON (JavaScript Object Notation) format used for formatting data stored on websites so that they can be exchanged between the server and the browser easily. Consider the below example of the above contract formatted in a JSON format:

JSON comes in especially handy when trying to sort or retrieve information from a website with an enormous amount of information. In our example, we could simply look for the word “delivery” within [response][contents][companyName]. Keep in mind that while SQL is a programming language used to create relational databases with advanced search engine algorithms, JSON is simply a format used to store data (and is not natively ready to perform any search algorithm operations). However, JSON is one of the formats database data can be stored in.

A structured approach to data storage provides for more efficient data accessibility but also enables the machine to search through data quickly, as it can be directed where to search for the given query (given the clear structure order) and does not need to go over all the available data.

How does this transformation affect the legal sector?

Nowadays, when talking about structured data within the legal context, the term eDiscovery is often used to refer to ‘identifying, collecting and producing electronically stored information (ESI) in response to a request for production in a lawsuit or investigation’ . At the moment, most of the data within the legal sector have to be collected and identified manually. Although AI models that search for and label data automatically exist, they are never 100% accurate, but more importantly, creating and training such models usually takes much longer than manually curating the documents. Lawyers, therefore, often work with some kind of database software that allows data to be entered manually while seeking assistance from AI models which speed the drafting process up a little (i.e. clause prediction in contracts). Paradoxically, for AI to easily label our data, we must first, in most cases, label and structure the data ourselves to ensure the highest precision (more on AI and how it learns here).

However, although law firms continue to focus on data structures slightly more, a large part of documents is still unstructured, with many of them conforming to no structured rules. Besides the use of MS-word, many files are still being scanned, and thereby removing any machine-readable text layer from the document. Although software converting image text to machine-readable text exists, the recovered data will always show some losses of quality or accuracy. The ultimate goal of all data being structured is finding any information quickly without having to go through multiple folders and files to navigate to it (whether digital or physical folders) while being able to manage (i.e. sort) data in multiple ways. Needless to say, this is extremely important in the legal context, where retrieving data quickly and efficiently to assist clients is at the core of firms’ business practice.

Nowadays, trainees are often handed the task of manually curating unstructured data files, and however daunting that task might seem, it not only results in better data structures of the firms, but also teaches future lawyers about proper data management, a skill that will be greatly appreciated in the future.

Structured vs Unstructured Data comparison

Leave a Reply

Your email address will not be published. Required fields are marked *