Can I speedup YAML?
Solution 1
You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.
What's happening is Python's json
library encodes Python's builtin datatypes directly into text chunks, replacing '
into "
and deleting ,
here and there (to oversimplify a bit).
On the other hand, pyyaml
has to construct a whole representation graph before serialising it into a string.
The same kind of stuff has to happen backwards when loading.
The only way to speedup yaml.load()
would be to write a new Loader
, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML
parser, taking the following comment in consideration:
YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.
-- UPDATE
What I said before remains true, but if you're running Linux
there's a way to speed up Yaml
parsing. By default, Python's yaml
uses the Python parser. You have to tell it that you want to use PyYaml
C
parser.
You can do it this way:
import yaml
from yaml import CLoader as Loader, CDumper as Dumper
dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)
In order to do so, you need yaml-cpp-dev
(package later renamed to libyaml-cpp-dev
) installed, for instance with apt-get:
$ apt-get install yaml-cpp-dev
And PyYaml
with LibYaml
as well. But that's already the case based on your output.
I can't test it right now because I'm running OS X and brew
has some trouble installing yaml-cpp-dev
but if you follow PyYaml documentation, they are pretty clear that performance will be much better.
Solution 2
For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:
Code to reproduce the plot:
import numpy
import perfplot
import json
import ujson
import orjson
import toml
import yaml
from yaml import Loader, CLoader
import pandas
def setup(n):
numpy.random.seed(0)
data = numpy.random.rand(n, 3)
with open("out.yml", "w") as f:
yaml.dump(data.tolist(), f)
with open("out.json", "w") as f:
json.dump(data.tolist(), f, indent=4)
with open("out.dat", "w") as f:
numpy.savetxt(f, data)
with open("out.toml", "w") as f:
toml.dump({"data": data.tolist()}, f)
def yaml_python(arr):
with open("out.yml", "r") as f:
out = yaml.load(f, Loader=Loader)
return out
def yaml_c(arr):
with open("out.yml", "r") as f:
out = yaml.load(f, Loader=CLoader)
return out
def json_load(arr):
with open("out.json", "r") as f:
out = json.load(f)
return out
def ujson_load(arr):
with open("out.json", "r") as f:
out = ujson.load(f)
return out
def orjson_load(arr):
with open("out.json", "rb") as f:
out = orjson.loads(f.read())
return out
def loadtxt(arr):
with open("out.dat", "r") as f:
out = numpy.loadtxt(f)
return out
def pandas_read(arr):
out = pandas.read_csv("out.dat", header=None, sep=" ")
return out.values
def toml_load(arr):
with open("out.toml", "r") as f:
out = toml.load(f)
return out["data"]
perfplot.save(
"out.png",
setup=setup,
kernels=[
yaml_python,
yaml_c,
json_load,
loadtxt,
pandas_read,
toml_load,
ujson_load,
orjson_load,
],
n_range=[2 ** k for k in range(18)],
)
Related videos on Youtube
Comments
-
Eric almost 2 years
I made a little test case to compare YAML and JSON speed :
import json import yaml from datetime import datetime from random import randint NB_ROW=1024 print 'Does yaml is using libyaml ? ',yaml.__with_libyaml__ and 'yes' or 'no' dummy_data = [ { 'dummy_key_A_%s' % i: i, 'dummy_key_B_%s' % i: i } for i in xrange(NB_ROW) ] with open('perf_json_yaml.yaml','w') as fh: t1 = datetime.now() yaml.safe_dump(dummy_data, fh, encoding='utf-8', default_flow_style=False) t2 = datetime.now() dty = (t2 - t1).total_seconds() print 'Dumping %s row into a yaml file : %s' % (NB_ROW,dty) with open('perf_json_yaml.json','w') as fh: t1 = datetime.now() json.dump(dummy_data,fh) t2 = datetime.now() dtj = (t2 - t1).total_seconds() print 'Dumping %s row into a json file : %s' % (NB_ROW,dtj) print "json is %dx faster for dumping" % (dty/dtj) with open('perf_json_yaml.yaml') as fh: t1 = datetime.now() data = yaml.safe_load(fh) t2 = datetime.now() dty = (t2 - t1).total_seconds() print 'Loading %s row from a yaml file : %s' % (NB_ROW,dty) with open('perf_json_yaml.json') as fh: t1 = datetime.now() data = json.load(fh) t2 = datetime.now() dtj = (t2 - t1).total_seconds() print 'Loading %s row into from json file : %s' % (NB_ROW,dtj) print "json is %dx faster for loading" % (dty/dtj)
And the result is :
Does yaml is using libyaml ? yes Dumping 1024 row into a yaml file : 0.251139 Dumping 1024 row into a json file : 0.007725 json is 32x faster for dumping Loading 1024 row from a yaml file : 0.401224 Loading 1024 row into from json file : 0.001793 json is 223x faster for loading
I am using PyYAML 3.11 with libyaml C library on ubuntu 12.04. I know that json is much more simple than yaml, but with a 223x ratio between json and yaml I am wondering whether my configuration is correct or not.
Do you have same speed ratio ?
How can I speed upyaml.load()
? -
codeshot almost 9 yearsloading is still 12x slower with yaml.my sample is a list of 600,000 empty dictionaries. Yaml doesn't need to do anything extra except slightly cleverer syntax analysis which should take almost no extra time.
-
Hans Nelsen over 7 yearsOn mac: brew install yaml-cpp libyaml
-
nevelis over 7 yearsJivan you're a bloody legend. I was going to rewrite some python code in C++ to speed things up. My 6MB yaml file took 53 seconds to load using the standard yaml loader, and only 3 seconds with CLoader.
-
Mike Nakis over 7 yearsI am not sure why you are saying that the CLoader speedup is only of interest if you are running under Linux; I just tried this under windows and it works, giving me a huge speedup.
-
Anthon almost 6 yearsI don't mind ruby, but I do mind bogus answers. 1) you're not really using ruby, in your code you are using a thin layer around libyaml C library: "The underlying implementation is the libyaml wrapper Psych". 2) you compare that with PyYAML without the libyaml C library. If you had, you would see that Python wrapping libyaml is not 7 times slower but only a few percent. 3) the announcement for the deprecation of the
commands
module was made in PEP 0361 in 2006, you still propose to use that more than eleven years later. -
Anthon over 5 yearsThe comment you link to is incorrect. PyYAML doesn't build a graph. There are no connections between the
Node
s that the representer emits, not even in the case of a single object occurring multiple times in a data-structure. -
Niels-Ole over 5 yearsIf you
cannot import name 'CLoader' from 'yaml'
try installinglibyaml-dev
and then reinstall pyyaml:pip --no-cache-dir install --verbose --force-reinstall -I pyyaml
github.com/yaml/pyyaml/issues/108 -
Ezra Steinmetz about 2 yearsBut make sure you're using safe loaders -
from yaml import CSafeLoader as Loader, CSafeDumper as Dumper