Just moozing

Before you can check your notes, you must make them…

Compressing time series

with one comment

The other day it struck me as odd that I didn’t know of a data format for time series. We collect a lot of time stamped data and most of the time it just get dumped in a .csv file, whenever we want to share it. What do the pros do?

Lossless compression

For generic use, it must be lossless and compressed.

We have a lot of compressed format like jpeg for images and mp3 for music. They are lossy, and since we know which frequencies may be heard, it is ok to skip some data in the high frequency range. Same goes for images, where crystal clear images might not be needed.

My first contact with lossless compression was .png images. We were working with images recognition, so it was important for us to be able to extract the exact pixel values from the images as seen by the camera. This could also be done with ppm, except that one is not compressed.

To compare I have used gimp to create a homogeneous and an inhomogeneous image. The last one is made using the RGB noise filter on highest settings.

WordPress.com doesn’t allow me to upload .ppm files for comparison.

Size matrix looks like this:

ppm png ppm to png ratio
Just green 768054 bytes 1411 bytes 544
Noise 768054 bytes 497834 bytes 1,5

So the compression works very, very well for uniform images, and less well for noisy images. The obvious conclusion is to use compression whenever possible.

There is a cost in terms of CPU cycles and complexity, but for most use cases that is not an issue.

Don’t we have something similar for time series? We could do csv file and then zip it (.csv.zip? .csv.gz?), but that is not a format I have seen used ever. Still, we must have a lot of time series, where the value is the same for prolonged periods (like sensor maximum value or sensor minimum value). Solar panel for instance, produces 0 kW at night.

Time series data

I will compare time series data from my solar panels. I extract the data in some obscurely formatted text file, which is not useful – but all the data is there.

The data dumps contain 3 months worth of 5 min. interval readings. This gives 18000+ lines and the file takes up 1,5MB.

It is a problem for my spreadsheet, so I turned to pandas and matplotlib.

current_6_days.png

As expected, we have a lot of time where the values of the current production is 0.

CSV file

The case for .csv files:

  • Everybody knows it, every (relevant) program  knows it and all programming language have modules to read it.
  • It it human readable and writable
  • It is the expected data transport format

The case against it:

  • It is just text, so interpretation is problematic
    • is the decimal delimiter ‘.’ or ‘,’
    • is 2/3/4 a date and is it read as american or something else?
  • Which delimiter to use? tabs? semicolons?
  • What about headers?
  • Yes it is human readable – when you don’t have too many columns.

The csv file with its 18143 records takes up 907KB. It is generated from the raw data file.

 

HDF5 file

The HDF5 file format is the only one I found that is designed for data like time series. Recommendations are that it is used for large data sets.

Even though the resulting python programs to read/write hdf5 files are very simpler (using pandas), I did not find it trivial to get it working. There were issues with libraries.

Resulting sizes

The table below shows the file sizes of my test with 18000 records

Raw csv csv +gzip hdf5 hdf5 w. comp
size 1,5 MB 907 kB 163 kB 2,3 MB 1,6 MB

The goal with HDF5 is to handle complex data and lots of it – like millions of entries. So that is probably the reason for the (way) larger file size.

I’ll probably stick with CSV – it is just simpler and it works for me.

Someone should make a read-only data format for simple time series data.

 

Sources code for the scripts used are available on github.

Advertisements

Written by moozing

May 7, 2017 at 12:00

Posted in Tech

Tagged with , , , ,

One Response

Subscribe to comments with RSS.

  1. Cool post! I never thought about compressing csv files.

    archer920gmailcom

    May 12, 2017 at 19:25


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: