Loading DrugBank SQL Files
What is SQL?
Structured Query Language (SQL, sometimes pronounced “sequel”) is a domain-specific language (DSL) for working with relational databases. Different database vendors offer their implementation of the SQL standard; as such, there is usually non-overlapping functionality between these various implementations.
If you are unfamiliar with SQL or just need a refresher, there is a good online tutorial site that also allows you to run queries within your browser against a test dataset. For specific documentation regarding database setup and use, consult your database vendor's documentation.
DrugBank SQL support
DrugBank offers CSV-formatted data downloads with specific `.sql` files for loading the data into three SQL implementations: MySQL, PostgreSQL, and Microsoft SQL Server (MS SQL). If you use a different implementation, you may have to write custom logic to load the included CSV files.
Loading the Data
Loading the data manually
For each supported implementation, there will be a corresponding zip file available from a user's downloads portal. This zip file includes several CSV files (along with corresponding `.fmt` files for MS SQL) and four `.sql` files. These latter files include all the commands required to perform various operations and can be executed as SQL scripts from the database management system itself.
To load the data, the 'create_schema,' 'load_tables,' and 'add_constraints' files should be executed (in order). The final script, 'drop_tables,' removes all DrugBank tables from the database and can be used to clean the database before loading newer data. As such, the data can be added to any supported database type by executing each script. However, it is often desirable to automate this process; this is covered in the following sections.
A simple loading script
In this part of the guide, we focus on the most popular of our supported SQL implementations: MySQL and PostgreSQL. We provide a simple script in python 3 (written in version 3.10.2) that will accept either the starting zip file or a previously unzipped file directory. It optionally drops the database (for subsequent loads) and executes the scripts to create the schema, load the tables, and add all the key constraints:
For example, to load a MySQL database on your local machine with port 3306, you would run the following:
python3 </full/path/to/the/script.py> -z </path/to/zipfile> -t 3306 -u <user name> -p <user password> –-type MySQL
To see all of the supported command line options, along with their possible and default values, simply call the script with the help option (either '-h' or '--help'). If you are unsure of the port to specify, start with the default (3306 for MySQL and 5432 for PostgreSQL).
We will explore a few notable points for each database implementation below, which will be most valuable for users looking to implement custom scripts. If you plan to modify the provided script or implement your own custom script, it is recommended to read the implementation notes below. Otherwise, feel free to skip to the section on considerations.
A note on connections
Regardless of where the SQL server you are using lives (on the same machine or over a network connection), you must specify some parameters to connect. At a minimum, this will be the host address and exposed port, but it should also include a user account name and password. The provided script assumes a nonlocal server connection but should work if your SQL server is hosted locally.
Notable points - MySQL
While the MySQL loader is relatively straightforward, it does have some important aspects that are worth mentioning. The first is that, as opposed to other command line arguments, the password must be specified either:
- using the '-p' command, in which case the password must follow immediately after with no space in between (i.e., '-p<password>')
- using the '--password' command, in which case the flag must be followed by an equals sign and the password, again with no spaces (i.e., '--password=<password>')
The other point is that the script toggles two environment variables, 'local_infile' and 'FOREIGN_KEY_CHECKS.' Setting 'local_infile' to True allows MySQL to accept data from files not hosted on the same server as the SQL server. Setting 'FOREIGN_KEY_CHECKS' to False allows for missing foreign keys, which is required to load the data. The script resets each of these to their defaults following execution (regardless of whether an error occurred) and should be considered a best practice.
A MySQL recommendation
Although MySQL is common and freely available, we recommend using MariaDB as a MySQL-like implementation. The script will work with MariaDB and we have seen that certain functionality, such as adding the '--drop' flag, works better with MariaDB containers than MySQL containers (see the below section on Docker).
Notable points - PostgreSQL
The PostgreSQL loader contains some additional logic compared to the MySQL loader. Similar to MySQL, specifying a password in the command line interface differs from most other options. In PostgreSQL, the easiest way to do this is to set the 'PGPASSWORD' environment variable. Although it would be possible to set this globally, the script sets it and passes it as an environment variable for each new process spawned by the 'subprocess' module.
In addition, PostgreSQL does not support the 'EXISTS' keyword in the 'CREATE DATABASE' statement (as of version 15.1). Therefore, the more complicated statement 'SELECT 'CREATE DATABASE' ”<database>”' WHERE NOT EXISTS (SELECT FROM pg_database WHERE datname = '<database>')\gexec' must be used instead. This query uses a WHERE clause on the list of existing databases to filter a SELECT statement that would create the database. Also, note the use of double quotes around the database name in the initial CREATE statement, which is required for case sensitivity.
The largest difference is how PostgreSQL handles non-local input files. Rather than copying data using normal SQL commands (COPY FROM), it is required to use the `\copy` meta-command. There is a large function ('_fix_postgresql_load_file') that converts the supplied 'load_tables' script into a script that can run over a nonlocal database connection. The most notable point about the `\copy` meta-command is that the whole command should appear on a single line within the file as it is terminated by the newline character rather than a semicolon.
Other SQL implementations
Although we have covered only MySQL and PostgreSQL here, the pattern of creating the schema, loading tables, and then adding constraints should apply to other SQL implementations. Furthermore, the provided python code can be repurposed or modified to fit specific requirements.
The DrugBank database is fairly large (on the order of a few gigabytes, depending on the exact data included). As such, it is recommended to remove any local files that are no longer required after having been loaded. It is also the case that the large volume of data can lead to long execution times for the 'load_tables' SQL file.
For these reasons, you should consider how often to download/load new database versions from your download portal based on your specific use case. For applications that require the most up-to-date data possible, a daily cadence is reasonable. In contrast, for less sensitive applications, it might be possible to perform these operations on a weekly, monthly, or similar cadence.
A note on passwords
For simplicity, the provided script takes the user password as a command line input. In general, this is not a best practice. Different SQL implementations have mechanisms to use configuration files or environment variables in place of this. Depending on your setup, you may wish to alter the script to remove command line password passing and rely on one of these other mechanisms instead.
A note on permissions
When running the scripts, you may run into an error along the lines of “Access denied; you need (at least one of) the SUPER privilege(s) for this operation.” An error such as this indicates that the account specified in the script has insufficient privileges to create or modify tables within the database. If you run the script to load data into a local database, it should be sufficient to run using the 'root' user (be sure to use the right password). Alternatively, you could also grant your account super privileges (this should be covered in the documentation for your specific SQL implementation). If, instead, you are attempting to load data into a shared database, contact your administrator to discuss the best approach.
Using Docker to run your database
One common mechanism for setting up database instances is Docker, a popular application for deploying containerized applications. As an example, it is quite easy to get a version of MariaDB up and running from its official docker image. Adding a file named 'docker-compose.yml' to a directory with the following content:
version: '3.8' services: db: image: mariadb:10.9.5 ports: - 3306:3306 volumes: - ./data:/var/lib/mysql env_file: - .maria_env
and a '.maria_env' file to the same directory with the following content:
MARIADB_ROOT_PASSWORD: <your root password here> MARIADB_USER: <your user name here> MARIADB_PASSWORD: <your user password here>
where the username and password are whatever you wish them to be. These variables should be sufficient to spin up the container, but should you want to provide more customization, check this page. Once these files are created, you can create a container running MariaDB by running:
docker-compose up -d
This command will pull the image (if you do not already have it locally) and then build a container using the 'docker-compose.yml' file as instructions. You can also spin up a container without needing a compose file (see the MariaDB documentation on DockerHub); the compose file is useful when connecting your database to other services within the container.
As a quick note, the compose file above will map the container's MySQL file directory to a local subdirectory named 'data' (this is the 'volumes' declaration; the specified directory on your machine will be created if it does not already exist). This is useful for facilitating communication between your local file system and that of the container, and is useful if you need to load the files from within the container itself (see below). However, this line is not required and can be omitted if desired.
The provided script will work with Dockerized versions of MySQL and PostgreSQL - just use the correct host and exposed port (127.0.0.1 and 3306 if you copied the compose file above). If, however, you wish to execute the provided SQL scripts within the container, you must include all the files in the shared volume and then connect to the container itself (e.g., by using `docker exec -it`). This is useful if, for example, you do not have the corresponding command line client installed locally.
SQL is a powerful query language allowing users to query and analyze DrugBank data. This article provided an overview of the SQL support currently available within DrugBank, along with a sample script to load these data in MySQL and PostgreSQL.