In a previous article, we talked about the issues the legal industry is facing regarding the way data is structured and organized. In this article, we will be taking a detailed look at databases and how data is organized within them. As law firms are continuously aiming to build database infrastructures to leverage legal data and data analysis to its fullest potential, the knowledge of how databases operate is increasingly becoming a valuable skill.
What is a database?
A database is a structured collection of data stored electronically in a computer system. Data within most databases are organized in tables (rows and columns), just like it would be organized in an Excel spreadsheet. So what is the difference between an excel and an actual database? Although storing data in the same structural fashion (tables), there are multiple differences between the two:
- 1. Databases can store larger quantities of data compared to Excel (or other spreadsheet software). Excel is currently limited to storing a maximum of one million rows, whereas a database can store much more than that.
- 2. Excel is considerably slower. Excel first has to load all the data into the computer’s memory before it can process the data. The memory of a computer (RAM) is all the working processes of a computer, making it quicker to access software that is idle (i.e. Skype in windows taskbar), rather than turning it on and off. In most cases, the RAM is full of processes that we cannot see or do not use, and hence, the memory is not exclusive to just the spreadsheet software. Therefore, an Excel formula in a sheet that contains large quantities of data could take seconds to calculate the output or even result in an error message like this:
- 3. In databases, all we care about is the information being stored; you don’t care about formatting.
- 4. Databases provide a stable structure and are generally much more secure.
- 5. Databases can store a wide range of data types, whereas Excel usually stores just textual data and images.
Why ‘relational’ ?
A relational database organizes data into tables that can be linked—or related—based on data common to each table. Although multiple spreadsheets can be linked together in Excel, the options are logically limited.
The above example diagram shows a snippet of a client database in a law firm and each of the four tables focuses on a different theme of the clients’ contracts. Other tables might include information about the client, disputes, invoices, and so on. The above four tables could also be created in Excel, but the linkage between them, as mentioned before, would be limited. The main idea behind the architecture of relational databases is to be able to look for specific themes/topics rather than to always scrape data from a single unorganized table. More tables, although more difficult to be written and constructed, usually mean quicker access to data.
Notice the PK and FK icons within the tables – those indicate Primary Keys (PK) and Foreign Keys (FK). The PK is a unique data value, such as client ID or contract ID. The PK value can be indexed, meaning that another table accessing it does not need to scan through the whole table, but simply reads the respective row to find related data in the foreign table quickly. The FK is simply a reference to the PK in another table, linking the two tables together. If an FK value exists, a PK value linked to it cannot be deleted, ensuring consistency.
Where are databases written?
The most common way to create databases is through an RDBMS or Relational Database Management System such as Oracle, Microsoft SQL Server, and MySQL. These database systems use Structured Query Language (SQL) to create and manage database structures. It is important to remember that data stored properly according to a pre-defined database structure allows law firms to use data for other purposes like machine learning. Most machines or deep learning models learn from hundreds of thousands of data and with a proper database, we can simply create a .csv (table format) file rather than entering the data manually; tabular data is extremely important in AI and Data Science.
In next week’s tutorial, we will cover the basics of database management in PostgreSQL, a free and open-source relational database management system.