The concept is:
This page shows how to combine many pickle files into some tar, tar.gz, or tar.xz using Python. In some simulations, we generate thousands and millions of output files for varying some parameters and some values for each parameters. The problem of such huge number of small files is the difficulty to handle on making backup to NAS, external HDD, and cloud storage services.
One of the efficient way to work around is reducing the file number by compressing files in one tar file as follows. This page is not specialized in Matplotlib, but is tool is useful for matplotlib user especially for science, engineering, research, or development departments.
See also:
This page shows my suggestion to process data and generate figure parallelly by running some external python script.
This page shows an example of how to extract data from tar.gz file and expand not on HDD/SSD but on RAM, then plot some data using matplotlib. In some cases such as simulation, data logging,and image processing, you may have to deals with great many files composed of small files. In such situation, as you know, tar.gz file is efficient way to increase data transfer speed and decrease the number of the files. You have to extract some data from tar.gz file in order to generate figures using data. You don't have to deals with annoying intermediate files if you expand data not on HDD/SSD but directly on RAM as shown in this page.
Python Matplotlib Tips: Speed up plotting magnified waveforms using Python & Matplotlib.pyplot
Tips for drawing efficient figures using python matplotlib pyplot. You can brush up them by adding some additional options and settings.
import numpy as np
import random
import pickle
import tarfile
import os
import time
import platform
print('python: '+platform.python_version())
import numpy as np
print('numpy: '+np.__version__)
First of all, define Dummy class for store some values and numpy arrays to save as pickle files.
Then define TarManager class to handle many pickle files (sorry for no docstring).
class Dummy():
def __init__(self, n, b, x, y):
self.n = n
self.b = b
self.x = x
self.y = y
class TarManager():
def __init__(self, n, indir, infmt, outdir, outfmt, inext='pkl',
outext='tar', targetsize=104857600):
filelist = {}
self.n = n
self.targetsize = targetsize
self.indir = indir
self.infmt = infmt
self.outdir = outdir
self.outfmt = outfmt
self.inext = inext
self.outext = outext
self.i2p = {} # dictionary converting id num to pickle file name
self.i2t = {} # dictionary converting id num to comparessed tar file nmae
if self.outext == 'tar':
self.twm = 'w' # tar write mode
self.trm = 'r' # tar read mode
elif self.outext == 'tar.gz':
self.twm = 'w:gz'
self.trm = 'r:gz'
elif self.outext == 'tar.bz2':
self.twm = 'w:bz2'
self.trm = 'r:bz2'
elif self.outext == 'tar.xz':
self.twm = 'w:xz'
self.trm = 'r:xz'
def chkpklsize(self):
self.pklsize = {}
for idx in range(self.n):
infn = self.infmt % idx + '.' + self.inext
ipath = os.path.join(self.indir, infn)
self.i2p[idx] = ipath
self.pklsize[idx] = os.path.getsize(ipath)
def getchunk(self):
chunksize = 0
chunksidx = 0 # chunk start index
for idx in range(self.n):
chunksize += self.pklsize[idx]
if chunksize > self.targetsize:
yield chunksidx, idx, chunksize
chunksize = 0
chunksidx = idx + 1
else:
yield chunksidx, idx, chunksize
def compress(self):
try:
self.pklsize
except:
self.chkpklsize()
outidx = 0
for sidx, eidx, _ in self.getchunk():
outfn = self.outfmt % (sidx, eidx) + '.' + self.outext
opath = os.path.join(self.outdir, outfn)
with tarfile.open(opath, self.twm) as tar:
for i in range(sidx, eidx+1):
infn = self.infmt % (i) + '.' + self.inext
ipath = os.path.join(self.indir, infn)
tar.add(ipath)
self.i2t[i] = opath
outidx += 1
def del_pkls(self):
for idx in range(Nfile):
fn = self.infmt % idx + '.' + self.inext
infn = os.path.join(self.indir, fn)
os.remove(infn)
def loaddata(self, idx):
tar = tarfile.open(self.i2t[idx])
return pickle.load(tar.extractfile(self.i2p[idx]))
Define the number of sample data.
Nfile = 2000
Generate sample data.
Here, the length of the sample array is assumed to be varied with the condition.
The size of pickle files is, therefore, varies from several hundreds Bytes to several MB.
smin, smax = np.infty, 0
for idx in range(Nfile):
n = int(10**(1 + random.random() * 4.5))
b = random.random()
x = np.linspace(0, 1, n)
y = x**b
d = Dummy(n, b, x, y)
with open('sub/dat%04d.pkl' % idx, 'wb') as f:
pickle.dump(d, f)
s = os.path.getsize('sub/dat%04d.pkl' % idx)
smin, smax = min(smin, s), max(smax, s)
print('minimum file size: %d B' % smin)
print('maximum file size: %d MB' % (smax / 1024**2))
In this situation, constant size in tar files is preferred because the size of pickle files is varied in the order level.
The separation of Nfile
-files is done by resulting size of the tar files.
Here, 200 MB is set as a minimum file size of tar (except last tar file).
fsize = 200 * 1024 * 1024 # 50 MB
Then run the compression. Firstly, combine to .tar
file (without compression)
tm = TarManager(Nfile, targetsize=fsize,
indir='sub', infmt='dat%04d',
outdir='sub', outfmt='t_%04d-%04d',
outext='tar')
stime = time.time()
tm.compress()
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtar.pkl', 'wb') as f:
pickle.dump(tm, f)
tar.gz
extension can compress but is not efficient.
tm = TarManager(Nfile, targetsize=fsize,
indir='sub', infmt='dat%04d',
outdir='sub', outfmt='t_%04d-%04d',
outext='tar.gz')
stime = time.time()
tm.compress()
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtargz.pkl', 'wb') as f:
pickle.dump(tm, f)
tar.xz
extension can compress more efficiently but takes really long time.
tm = TarManager(Nfile, targetsize=fsize,
indir='sub', infmt='dat%04d',
outdir='sub', outfmt='t_%04d-%04d',
outext='tar.xz')
stime = time.time()
tm.compress()
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtarxz.pkl', 'wb') as f:
pickle.dump(tm, f)
Delete the original tar file.
tm.del_pkls()
Load data from tar files, and compare loading time.
with open('tmtar.pkl', 'rb') as f:
tm = pickle.load(f)
stime = time.time()
for i in range(0, Nfile, 100):
tmp = tm.loaddata(i)
tmp.b, tmp.n, tmp.x, tmp.y
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtargz.pkl', 'rb') as f:
tm = pickle.load(f)
stime = time.time()
for i in range(0, Nfile, 100):
tmp = tm.loaddata(i)
tmp.b, tmp.n, tmp.x, tmp.y
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
with open('tmtarxz.pkl', 'rb') as f:
tm = pickle.load(f)
stime = time.time()
for i in range(0, Nfile, 100):
tmp = tm.loaddata(i)
tmp.b, tmp.n, tmp.x, tmp.y
etime = time.time()
print('%0d [min] %0.1f [sec]' % ((etime - stime) // 60, (etime - stime) % 60))
Although the file size of tar.xz
is the half of the original, the compress/decompress time is terrible.
tar.gz
is medium good for both compress and decompress.
So I recommend just tar
because it is simple and fast for compress and decompress.