Can I speedup YAML?

python json yaml

15,922

Solution 1

You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.

What's happening is Python's json library encodes Python's builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).

On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.

The same kind of stuff has to happen backwards when loading.

The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:

YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.

-- UPDATE

What I said before remains true, but if you're running Linux there's a way to speed up Yaml parsing. By default, Python's yaml uses the Python parser. You have to tell it that you want to use PyYaml C parser.

You can do it this way:

import yaml
from yaml import CLoader as Loader, CDumper as Dumper

dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)

In order to do so, you need yaml-cpp-dev (package later renamed to libyaml-cpp-dev) installed, for instance with apt-get:

$ apt-get install yaml-cpp-dev

And PyYaml with LibYaml as well. But that's already the case based on your output.

I can't test it right now because I'm running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.

Solution 2

For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:

Code to reproduce the plot:

import numpy
import perfplot

import json
import ujson
import orjson
import toml
import yaml
from yaml import Loader, CLoader
import pandas


def setup(n):
    numpy.random.seed(0)
    data = numpy.random.rand(n, 3)

    with open("out.yml", "w") as f:
        yaml.dump(data.tolist(), f)

    with open("out.json", "w") as f:
        json.dump(data.tolist(), f, indent=4)

    with open("out.dat", "w") as f:
        numpy.savetxt(f, data)

    with open("out.toml", "w") as f:
        toml.dump({"data": data.tolist()}, f)


def yaml_python(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=Loader)
    return out


def yaml_c(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=CLoader)
    return out


def json_load(arr):
    with open("out.json", "r") as f:
        out = json.load(f)
    return out


def ujson_load(arr):
    with open("out.json", "r") as f:
        out = ujson.load(f)
    return out


def orjson_load(arr):
    with open("out.json", "rb") as f:
        out = orjson.loads(f.read())
    return out


def loadtxt(arr):
    with open("out.dat", "r") as f:
        out = numpy.loadtxt(f)
    return out


def pandas_read(arr):
    out = pandas.read_csv("out.dat", header=None, sep=" ")
    return out.values


def toml_load(arr):
    with open("out.toml", "r") as f:
        out = toml.load(f)
    return out["data"]


perfplot.save(
    "out.png",
    setup=setup,
    kernels=[
        yaml_python,
        yaml_c,
        json_load,
        loadtxt,
        pandas_read,
        toml_load,
        ujson_load,
        orjson_load,
    ],
    n_range=[2 ** k for k in range(18)],
)

15,922

Eric

Python addicted guy

Updated on June 25, 2022

Comments

Eric almost 2 years

I made a little test case to compare YAML and JSON speed :

import json
import yaml
from datetime import datetime
from random import randint

NB_ROW=1024

print 'Does yaml is using libyaml ? ',yaml.__with_libyaml__ and 'yes' or 'no'

dummy_data = [ { 'dummy_key_A_%s' % i: i, 'dummy_key_B_%s' % i: i } for i in xrange(NB_ROW) ]


with open('perf_json_yaml.yaml','w') as fh:
    t1 = datetime.now()
    yaml.safe_dump(dummy_data, fh, encoding='utf-8', default_flow_style=False)
    t2 = datetime.now()
    dty = (t2 - t1).total_seconds()
    print 'Dumping %s row into a yaml file : %s' % (NB_ROW,dty)

with open('perf_json_yaml.json','w') as fh:
    t1 = datetime.now()
    json.dump(dummy_data,fh)
    t2 = datetime.now()
    dtj = (t2 - t1).total_seconds()
    print 'Dumping %s row into a json file : %s' % (NB_ROW,dtj)

print "json is %dx faster for dumping" % (dty/dtj)

with open('perf_json_yaml.yaml') as fh:
    t1 = datetime.now()
    data = yaml.safe_load(fh)
    t2 = datetime.now()
    dty = (t2 - t1).total_seconds()
    print 'Loading %s row from a yaml file : %s' % (NB_ROW,dty)

with open('perf_json_yaml.json') as fh:
    t1 = datetime.now()
    data = json.load(fh)
    t2 = datetime.now()
    dtj = (t2 - t1).total_seconds()
    print 'Loading %s row into from json file : %s' % (NB_ROW,dtj)

print "json is %dx faster for loading" % (dty/dtj)

And the result is :

Does yaml is using libyaml ?  yes
Dumping 1024 row into a yaml file : 0.251139
Dumping 1024 row into a json file : 0.007725
json is 32x faster for dumping
Loading 1024 row from a yaml file : 0.401224
Loading 1024 row into from json file : 0.001793
json is 223x faster for loading

I am using PyYAML 3.11 with libyaml C library on ubuntu 12.04. I know that json is much more simple than yaml, but with a 223x ratio between json and yaml I am wondering whether my configuration is correct or not.

Do you have same speed ratio ?
How can I speed up yaml.load() ?

codeshot almost 9 years

loading is still 12x slower with yaml.my sample is a list of 600,000 empty dictionaries. Yaml doesn't need to do anything extra except slightly cleverer syntax analysis which should take almost no extra time.
Hans Nelsen over 7 years

On mac: brew install yaml-cpp libyaml
nevelis over 7 years

Jivan you're a bloody legend. I was going to rewrite some python code in C++ to speed things up. My 6MB yaml file took 53 seconds to load using the standard yaml loader, and only 3 seconds with CLoader.
Mike Nakis over 7 years

I am not sure why you are saying that the CLoader speedup is only of interest if you are running under Linux; I just tried this under windows and it works, giving me a huge speedup.
Anthon almost 6 years

I don't mind ruby, but I do mind bogus answers. 1) you're not really using ruby, in your code you are using a thin layer around libyaml C library: "The underlying implementation is the libyaml wrapper Psych". 2) you compare that with PyYAML without the libyaml C library. If you had, you would see that Python wrapping libyaml is not 7 times slower but only a few percent. 3) the announcement for the deprecation of the commands module was made in PEP 0361 in 2006, you still propose to use that more than eleven years later.
Anthon over 5 years

The comment you link to is incorrect. PyYAML doesn't build a graph. There are no connections between the Nodes that the representer emits, not even in the case of a single object occurring multiple times in a data-structure.
Niels-Ole over 5 years

If you cannot import name 'CLoader' from 'yaml' try installing libyaml-dev and then reinstall pyyaml: pip --no-cache-dir install --verbose --force-reinstall -I pyyaml github.com/yaml/pyyaml/issues/108
Ezra Steinmetz about 2 years

But make sure you're using safe loaders - from yaml import CSafeLoader as Loader, CSafeDumper as Dumper