Most finance, stock market, and signal analysis software projects deal with the time-series domain extensively. There is a lot of specialized time-series databases that make an engineer's life easier when dealing with data whose primary feature is being timestamped. If you haven't ever dealt with these, trying to decide which solution out of many to choose can be daunting.
When I had this problem choosing a time-series database for my algorithmic trader project, I hated to see all the effort go to waste (yeah, I'm really good at doing that donkey with two haystacks thing) and decided to document the process in this post. This should be helpful for you if you're just starting with specialized time series databases and would like input from people who had the same problem.
Why and What
Let's define the use case so that you can compare it to yours and decide if the results are going to be useful for you.
I need to be able to use the database with an asynchronous Python backend, be able to deploy a containerized version to a VM running Dokku, write many datapoints from multiple clients and select data by datetime intervals and various custom tags (for example a session ID that the data belongs to), or currency tags since I will be storing datapoints like stock exchange trades.
Considering the above, my database shortlist boiled down to TimescaleDB vs InfluxDB, both excellent products with wide adoption and corporate backing.
This is by no means an expert-level tutoring material for someone who wants comprehensive insight into TimescaleDB or InfluxDB. It's more of an insight into a thought process of how you eyeball a solution without any previous experience with the tech behind it, and probably a starting point of sorts if you want to get going with a proper time-series database for an early-stage project and don't have a lot of time to spend analyzing all the alternatives.
The source of my data is a generator function that produces "ticks". A tick is a data point representing a trade of a financial asset that happened at some point in time and that has some data associated to it, usually the price at which the trade was executed and the volume of the asset. For example, if someone bought two bitcoins at 6500 US dollars per bitcoin at January 1 05:00 2020, the asset of this rather hipster trade would be BTCUSD, the volume would be 2, and the price point 6500. Such data is a typical return value of many live APIs that provide access to various markets worldwide. My generator for these ticks looks like this:
Let's point out some important properties of the datapoints yielded by this generator:
- Data is emitted at constant intervals of 0.1 seconds to infinity.
- The price oscillates around the initial value of 10.0, between 7.0 and 13.0, in a sinusoid. This is made so that aggregation results make sense and show that aggregations are working right.
- At T0+7:00 to T0+20:00, data points are omitted to emulate incomplete data which is often the result of unreliable networks and other hiccups in production systems.
- Volume of each data point is constant and there are other metadata about the trade, like asset label and session ID.
Part 1: Install and Deployment
It's no use trying out a system that you can't deploy, so let's start with how to connect a DB container to an app in production context.
TimescaleDB is deployed to Dokku just like any Postgres database: with the
dokku-postgres plugin. There are two tweaks to be made. First is specifying a custom image that contains TimescaleDB:
The second tweak is fixing a certificate issue mentioned on Github. You need to copy certificates from another, standard Postgres container created with
dokku postgres:create as described in the issue link above. After this, you can use the DSN string to connect to the database as per standard Dokku Postgres setup process.
There is no actively developed plugin for InfluxDB for Dokku. What I did was to use a ready-made image for InfluxDB 2.0 and deploy it as a separate Dokku application. Then, Docker options can be specified for Dokku to set up a docker network link for any app that wants to talk to the influxDB container.
One way to do this is to use the
--link Docker option that will pass a link argument whenever the service named (this is basically what Dokku service plugins like
Since linking containers is a deprecated flag in Docker, another way to do this is using bridge networks, a Dokku command set updated in version 0.20+, to connect containers to a separate network along the lines of:
I didn't really like the TimescaleDB certificates hack; someone researching this from scratch might lose a lot of time figuring out why their TimescaleDB setup doesn't work before discovering the issue. Having a ready-to-go Dokku plugin for TimescaleDB beats the need to deploy and setup a separate application for InfluxDB, but this is a point for TimescaleDB only if you're using something in the ballpark of Dokku. Production deployment category ends in a tie for these two.
Part 2: Local Development
docker-compose to run local development. Service definitions turn out to be simple for both contenders:
Getting information on having InfluxDB to persist data via Docker volumes was a bit tricky and included for posterity; see Community link. For TimescalDB this is trivial as it's an extension over Postgres.
Local Development: Results
This is a clean-cut and expected TimescalDB win in my book because it needed much less time to figure out, mostly simply due to being Postgres based.
Part 3: Writing Many Datapoints
Both databases contain extensive tooling for batch writes. Batch writing is a complex and interesting use case for a multitude of online and offline workflows like ETL (extract-transform-load), with a lot of stuff to go wrong and lots on configuration to get right. What I have in mind as a goal of this comparison isn't very conducting to batch writes since I want data to be available immediately upon acquisition from a third-party API, and unpredictable network waits can negate any speed boosts from writing data in bulk. Therefore, I'm going to only focus on getting many singular write events to work concurrently and on developer experience, i.e. how much effort it is to get a working solution.
Writing Many Datapoints: TimescaleDB
Getting writes to work with TimescaleDB consists of two steps, with the Python quickstart as primary source:
- Figuring out how to create and use a hypertable (which is basically a time-partitioned SQL relation, the core data structure of TimescaleDB);
- Translating the
psycopg2based solution in the Quickstart into async implementation.
Creating a hypertable that contains tick datapoints goes like this:
This is adapted from the TimescaleDB Quickstart and was relatively straightforward to figure out, with two extra steps beyond typical Postgres initialization where you don't have any migrations and just recreate raw database (by connecting to
template1, a service DB) and its tables:
- First, I need to perform TimescaleDB-specific initialization by calling
create_hypertableand providing it with table name and the column that contains the time-series index, the timestamp of your data points.
- Next, I create an UNIQUE index constraint to prevent duplicates.
The asynchronous test for write is more of a concurrency scaffolding than actual code that writes to database:
What happens here is:
- I create a producer coroutine that uses a generator (not shown here for brevity) to produce fake ticks timestamped every 0.1 second, with datapoint value (which is the "price" column) gradually changing in a sinusoidal movement, and a gap in timestamps between T+7 minutes and T+20 minutes, to simulate missing data.
- I create a consumer coroutine that will receive the fake tick value (of
dicttype) and write it to database using its own connection to prevent concurrency clashes. The consumer will do an upsert operation, courtesy of the UNIQUE index I created earlier, with LWW (last write wins) rules about what to do with conflicting data.
- I then use an async queue to tie one producer and 90 consumer coroutines into a workflow that simulates concurrent load on the database. I can hope to achieve a "good enough" throughput without trying too hard here.
The amount of coroutines executed concurrently needs to stay below 100 because that is Postres' default connections limit. Optimizations like changing the database defaults are beyond our scope here, so we'll have to make do with this.
Writing Many Datapoints: InfluxDB
Instead of the run-of-the-mill SQL database and table creation of TimescaleDB, InfluxDB is normally set up via a web interface onboarding that creates initial organization, user, and bucket. The onboarding process creates a token to authorize requests to the DB with. It was a bit of a mind twister to get it to execute automatically; my initial attempt was to run
influx setup while
influxd was running in the background, and then bring
influxd to foreground. This works but you need to insert something like
sleep 25 in order to let influxd initialize properly. This isn't remotely how things should be done when working with dockerized stack.
A better solution is to call
api/v2/setup and provide it with a predefined token, which isn't really shown in the docs:
From here on, implementation is a relatively smooth sailing with an occasional hiccup like having to init
write_api inside the thread to avoid concurrent write problems:
The main thing to know about concurrent writes to InfluxDB are:
- There is no actively developed async library to talk to it so we need to resort to running the coroutines as threads via
.write_api()without arguments would result in batch writes; we use the
ASYNCHRONOUSoption here to preclude waiting for write result (since we won't be dealing with errors anyway).
Writing Many Datapoints: Results
This is weak InfluxDB win on execution time and a tie on developer experience. Switching to batch execution for InfluxDB is as easy as changing one identifier; if you're not familiar with async execution in Python, you might want to brush up on that, but otherwise, for TimescaleDB, it was nice to be able to just use tested and tried asyncpg, and for InfluxDB, the line protocol and clean implementation of tags seem very natural to use. Or maybe it's just me, I love tags.
Part 4: Reading Aggregated Data
To test reading the timestamped datapoints, I wanted to implement the basic function that a lot of financial software might have: aggregating the tick data in our datapoints into OHLCV candles. A candle is a data structure that contains, at a predefined time step, the price of an asset at the start of an interval between two step points, prices at the highest and lowest points within the interval, closing price, and sum of sizes of all ticks that belong to the interval. I gave myself around half a day's worth of time to research and implement this for each database.
Reading Aggregated Data: TimescaleDB
This is the all the code that's needed to construct an OHLCV aggregation:
I have included some typical constraints that come in handy in real code, like filtering by session ID. The aggregation results can be read directly from rows, which is, again, basically how you would use asyncpg with a normal
Postgres backend. Research and implmentation took a fraction of the time alloted to the task, which bodes well for DX going forward with the product in a larger project.
This printout shows several important points about our implementation:
- aggregation with the most amount of buckets (the 1-minute intervals) takes the most time to compute;
- the printed results for 30-minute intervals correctly show OHLC results oscillating in a sinusoidal manner, just the way the data was generated;
- the V column shows identical volumes for buckets except for the first bucket, where the volume is smaller because the ticks generating function emulates missing datapoints.
Reading Aggregated Data: InfluxDB
In this part I ran into an unpleasant surprise. In the time alloted for the task, I couldn't figure out a proper custom aggregation expression written in InfluxDB's query language, Flux. The non-finished query looks something like this:
The idea seems to be to either union separate aggregations or write your own custom aggregation but both getting the logic right and figuring out whatever is going on with the Flux language implementation examples proved to be not possible within the given time bounds of half a day's work.
I'm sure there is rational reasoning behind needing to learn a new language in order to be able to use InfluxDB 2.0 but that's beyond the scope of this comparison. It's certainly possible to just use the older versions for current projects in production.
Note: for versions 1.8 and lower, this read task would probably result in an implementation similar to TimescaleDB since InfluxDB <1.8 supports an SQL-like query language.
Reading Aggregated Data: Results
In the context of our target of a quick and good enough implementation, this results in TimescaleDB win due to significantly more complexity involved in doing the same task with InfluxDB.
Disclaimer: this is not a rigorous benchmark, see considerations in the "Why and What" section above.
My research for my specific needs resulted in a win for TimescaleDB from the developer experience (DX) point of view. I would argue that DX accounts for what matters the most at the early stages of project development. It's pretty old to say that you want to iterate quickly and be productive from the get go (who doesn't, regardless of project stage or size?), but what really makes the difference here is a developer's luxury to afford to be sure that the new library or product won't throw them curveballs as they start using it in more advanced ways. In my eyes, TimescaleDB, empowered by Postgres' maturity, does exactly that. This little test drive allowed me to put my mind at peace about which DB backend to go forward with, and provided a fun experience in writing practical reports about tooling research. Hope it was useful for you!