Extract data from tar.gz and expand on RAM using tarfile library, then plot using matplotlib


The concept of this article is:

Extract data from tar.gz and expand on RAM using tarfile library, then plot using matplotlib

This page shows an example of how to extract data from tar.gz file and expand not on HDD/SSD but on RAM, then plot some data using matplotlib.
In some cases such as simulation, data logging,and image processing, you may have to deals with great many files composed of small files. In such situation, as you know, tar.gz file is efficient way to increase data transfer speed and decrease the number of the files. You have to extract some data from tar.gz file in order to generate figures using data. You don't have to deals with annoying intermediate files if you expand data not on HDD/SSD but directly on RAM as shown in this page.
See also:

Python Matplotlib Tips: Speed up generating figures by running external python script parallelly using Python and matplotlib.pyplot

This page shows my suggestion to process data and generate figure parallelly by running some external python script.


In [1]:
import platform
print('python: '+platform.python_version())
import matplotlib.pyplot as plt
from matplotlib import __version__ as matplotlibversion
print('matplotlib: '+matplotlibversion)
import numpy as np
print('numpy: '+np.__version__)
%matplotlib inline
python: 3.6.3
matplotlib: 2.1.1
numpy: 1.13.3
In [2]:
from os import mkdir
from os.path import join

subdir = "tardir"
try: mkdir(subdir)
except: pass

ts = np.linspace(0,2,11)
shift = np.linspace(0,5,11)

for num,phi in enumerate(shift):
    wave = np.array([ts,ts**2+phi])
    with open(join(subdir,"%d.dat"%num), "wb") as f: np.savetxt(f,wave,delimiter=",")

Check the output dat file

In [3]:
with open(join(subdir,"0.dat"), "r") as f:
    print(f.readlines())
['0.000000000000000000e+00,2.000000000000000111e-01,4.000000000000000222e-01,6.000000000000000888e-01,8.000000000000000444e-01,1.000000000000000000e+00,1.200000000000000178e+00,1.400000000000000133e+00,1.600000000000000089e+00,1.800000000000000044e+00,2.000000000000000000e+00\n', '0.000000000000000000e+00,4.000000000000000777e-02,1.600000000000000311e-01,3.600000000000000977e-01,6.400000000000001243e-01,1.000000000000000000e+00,1.440000000000000391e+00,1.960000000000000409e+00,2.560000000000000497e+00,3.240000000000000213e+00,4.000000000000000000e+00\n']

Compress bulk files to one tar.gz file

In [4]:
import subprocess
subprocess.call("tar cvzf tardir.tar.gz tardir")
Out[4]:
0

Remove bulk files

In [5]:
import shutil
shutil.rmtree(subdir)

Check filenames in tardir.tar.gz

In [6]:
import tarfile
with tarfile.open("tardir.tar.gz", mode="r:gz") as tar:
    print(tar.getnames())
['tardir', 'tardir/0.dat', 'tardir/1.dat', 'tardir/10.dat', 'tardir/2.dat', 'tardir/3.dat', 'tardir/4.dat', 'tardir/5.dat', 'tardir/6.dat', 'tardir/7.dat', 'tardir/8.dat', 'tardir/9.dat']

Plot each data with extracting files not on HDD/SSD but on RAM

In [7]:
import csv
from io import StringIO
In [8]:
with tarfile.open("tardir.tar.gz", mode="r:gz") as tar:
    for tarinfo in tar:
        if not tarinfo.isfile(): continue
        
        # Extract each file in tar.gz as binary
        binary = b''.join(tar.extractfile(tarinfo).readlines())
        
        # Convert binary to string
        strdata = binary.decode("utf-8")
        
        # Convert string to np.array
        arr = np.array(list(csv.reader(StringIO(strdata),delimiter=',')),dtype="float32")
        
        # Draw figure
        plt.plot(arr[0],arr[1])

If you know filename to be plotted, you can manually extract data from tar file as follows:

In [9]:
toplotfile1 = subdir+"/"+"0.dat"
toplotfile2 = subdir+"/"+"5.dat"
with tarfile.open("tardir.tar.gz", mode="r:gz") as tar:
    
    #plot toplotfile1
    binary = b''.join(tar.extractfile(toplotfile1).readlines())
    strdata = binary.decode("utf-8")
    arr = np.array(list(csv.reader(StringIO(strdata),delimiter=',')),dtype="float32")
    plt.plot(arr[0],arr[1])
    
    # plot toplotfiles2
    binary = b''.join(tar.extractfile(toplotfile2).readlines())
    strdata = binary.decode("utf-8")
    arr = np.array(list(csv.reader(StringIO(strdata),delimiter=',')),dtype="float32")
    plt.plot(arr[0],arr[1])