TarManager: combine many pickle files into some tar files using Python


The concept is:

TarManager: combine many pickle files into some tar files using Python

This page shows how to combine many pickle files into some tar, tar.gz, or tar.xz using Python. In some simulations, we generate thousands and millions of output files for varying some parameters and some values for each parameters. The problem of such huge number of small files is the difficulty to handle on making backup to NAS, external HDD, and cloud storage services.
One of the efficient way to work around is reducing the file number by compressing files in one tar file as follows. This page is not specialized in Matplotlib, but is tool is useful for matplotlib user especially for science, engineering, research, or development departments.
See also:

Python Matplotlib Tips: Speed up generating figures by running external python script parallelly using Python and matplotlib.pyplot

This page shows my suggestion to process data and generate figure parallelly by running some external python script.


Python Matplotlib Tips: Extract data from tar.gz and expand on RAM using tarfile library, then plot using matplotlib

This page shows an example of how to extract data from tar.gz file and expand not on HDD/SSD but on RAM, then plot some data using matplotlib. In some cases such as simulation, data logging,and image processing, you may have to deals with great many files composed of small files. In such situation, as you know, tar.gz file is efficient way to increase data transfer speed and decrease the number of the files. You have to extract some data from tar.gz file in order to generate figures using data. You don't have to deals with annoying intermediate files if you expand data not on HDD/SSD but directly on RAM as shown in this page.


Python Matplotlib Tips: Speed up plotting magnified waveforms using Python & Matplotlib.pyplot

Tips for drawing efficient figures using python matplotlib pyplot. You can brush up them by adding some additional options and settings.



In [1]:
import numpy as np
import random
import pickle
import tarfile
import os
import time
import platform
print('python: '+platform.python_version())
import numpy as np
print('numpy: '+np.__version__)
python: 3.7.6
numpy: 1.18.1

First of all, define Dummy class for store some values and numpy arrays to save as pickle files.
Then define TarManager class to handle many pickle files (sorry for no docstring).

In [2]:
class Dummy():
    def __init__(self, n, b, x, y):
        self.n = n
        self.b = b
        self.x = x
        self.y = y


class TarManager():

    
    def __init__(self, n, indir, infmt, outdir, outfmt, inext='pkl',
                 outext='tar', targetsize=104857600):
        filelist = {}
        self.n = n
        self.targetsize = targetsize
        self.indir = indir
        self.infmt = infmt
        self.outdir = outdir
        self.outfmt = outfmt
        self.inext = inext
        self.outext = outext
        self.i2p = {} # dictionary converting id num to pickle file name
        self.i2t = {} # dictionary converting id num to comparessed tar file nmae
        if self.outext == 'tar':
            self.twm = 'w' # tar write mode
            self.trm = 'r' # tar read mode
        elif self.outext == 'tar.gz':
            self.twm = 'w:gz'
            self.trm = 'r:gz'
        elif self.outext == 'tar.bz2':
            self.twm = 'w:bz2'
            self.trm = 'r:bz2'
        elif self.outext == 'tar.xz':
            self.twm = 'w:xz'
            self.trm = 'r:xz'

    def chkpklsize(self):
        self.pklsize = {}
        for idx in range(self.n):
            infn = self.infmt % idx + '.' + self.inext
            ipath = os.path.join(self.indir, infn)
            self.i2p[idx] = ipath
            self.pklsize[idx] = os.path.getsize(ipath)

    def getchunk(self):
        chunksize = 0
        chunksidx = 0 # chunk start index
        for idx in range(self.n):
            chunksize += self.pklsize[idx]
            if chunksize > self.targetsize:
                yield chunksidx, idx, chunksize
                chunksize = 0
                chunksidx = idx + 1
        else:
            yield chunksidx,  idx, chunksize

    def compress(self):
        try:
            self.pklsize
        except:
            self.chkpklsize()
        outidx = 0
        for sidx, eidx, _ in self.getchunk():
            outfn = self.outfmt % (sidx, eidx) + '.' + self.outext
            opath = os.path.join(self.outdir, outfn)
            
            with tarfile.open(opath, self.twm) as tar:
                for i in range(sidx, eidx+1):
                    infn = self.infmt % (i) + '.' + self.inext
                    ipath = os.path.join(self.indir, infn)
                    tar.add(ipath)
                    self.i2t[i] = opath
            outidx += 1

    def del_pkls(self):
        for idx in range(Nfile):
            fn = self.infmt % idx + '.' + self.inext
            infn = os.path.join(self.indir, fn)
            os.remove(infn)

    def loaddata(self, idx):
        tar = tarfile.open(self.i2t[idx])
        return pickle.load(tar.extractfile(self.i2p[idx]))

Define the number of sample data.

In [3]:
Nfile = 2000

Generate sample data.
Here, the length of the sample array is assumed to be varied with the condition.
The size of pickle files is, therefore, varies from several hundreds Bytes to several MB.

In [4]:
smin, smax = np.infty, 0
for idx in range(Nfile):
    n = int(10**(1 + random.random() * 4.5))
    b = random.random()
    x = np.linspace(0, 1, n)
    y = x**b
    d = Dummy(n, b, x, y)
    with open('sub/dat%04d.pkl' % idx, 'wb') as f:
        pickle.dump(d, f)
    s = os.path.getsize('sub/dat%04d.pkl' % idx)
    smin, smax = min(smin, s), max(smax, s)
print('minimum file size: %d B' % smin)
print('maximum file size: %d MB' % (smax / 1024**2))
minimum file size: 422 B
maximum file size: 4 MB

In this situation, constant size in tar files is preferred because the size of pickle files is varied in the order level.
The separation of Nfile-files is done by resulting size of the tar files.
Here, 200 MB is set as a minimum file size of tar (except last tar file).

In [5]:
fsize = 200 * 1024 * 1024 # 50 MB

Then run the compression. Firstly, combine to .tar file (without compression)

In [6]:
tm = TarManager(Nfile, targetsize=fsize,
                indir='sub', infmt='dat%04d',
                outdir='sub', outfmt='t_%04d-%04d',
                outext='tar')
stime = time.time()
tm.compress()
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtar.pkl', 'wb') as f:
    pickle.dump(tm, f)
0 [min] 2.9 [sec]

tar.gz extension can compress but is not efficient.

In [7]:
tm = TarManager(Nfile, targetsize=fsize,
                indir='sub', infmt='dat%04d',
                outdir='sub', outfmt='t_%04d-%04d',
                outext='tar.gz')
stime = time.time()
tm.compress()
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtargz.pkl', 'wb') as f:
    pickle.dump(tm, f)
1 [min] 4.2 [sec]

tar.xz extension can compress more efficiently but takes really long time.

In [8]:
tm = TarManager(Nfile, targetsize=fsize,
                indir='sub', infmt='dat%04d',
                outdir='sub', outfmt='t_%04d-%04d',
                outext='tar.xz')
stime = time.time()
tm.compress()
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtarxz.pkl', 'wb') as f:
    pickle.dump(tm, f)
10 [min] 56.0 [sec]

Delete the original tar file.

In [9]:
tm.del_pkls()

Load data from tar files, and compare loading time.

In [10]:
with open('tmtar.pkl', 'rb') as f:
    tm = pickle.load(f)

stime = time.time()
for i in range(0, Nfile, 100):
    tmp = tm.loaddata(i)
    tmp.b, tmp.n, tmp.x, tmp.y
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
0 [min] 1.4 [sec]
In [11]:
with open('tmtargz.pkl', 'rb') as f:
    tm = pickle.load(f)

stime = time.time()
for i in range(0, Nfile, 100):
    tmp = tm.loaddata(i)
    tmp.b, tmp.n, tmp.x, tmp.y
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
0 [min] 54.7 [sec]
In [12]:
with open('tmtarxz.pkl', 'rb') as f:
    tm = pickle.load(f)

stime = time.time()
for i in range(0, Nfile, 100):
    tmp = tm.loaddata(i)
    tmp.b, tmp.n, tmp.x, tmp.y
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
4 [min] 58.5 [sec]

Although the file size of tar.xz is the half of the original, the compress/decompress time is terrible.
tar.gz is medium good for both compress and decompress.
So I recommend just tar because it is simple and fast for compress and decompress.