R: Generic flattening of JSON to data.frame

11,268

Solution 1

Hi @Sim I had cause to reflect on your problem yesterday define:

flatten<-function(x) {
    dumnames<-unlist(getnames(x,T))
    dumnames<-gsub("(*.)\\.1","\\1",dumnames)
    repeat {
        x <- do.call(.Primitive("c"), x)
        if(!any(vapply(x, is.list, logical(1)))){
           names(x)<-dumnames
           return(x)
        }
    }
}
getnames<-function(x,recursive){

    nametree <- function(x, parent_name, depth) {
        if (length(x) == 0) 
            return(character(0))
        x_names <- names(x)
        if (is.null(x_names)){ 
            x_names <- seq_along(x)
            x_names <- paste(parent_name, x_names, sep = "")
        }else{ 
            x_names[x_names==""] <- seq_along(x)[x_names==""]
            x_names <- paste(parent_name, x_names, sep = "")
        }
        if (!is.list(x) || (!recursive && depth >= 1L)) 
            return(x_names)
        x_names <- paste(x_names, ".", sep = "")
        lapply(seq_len(length(x)), function(i) nametree(x[[i]], 
            x_names[i], depth + 1L))
    }
    nametree(x, "", 0L)
}

(getnames is adapted from AnnotationDbi:::make.name.tree)

(flatten is adapted from discussion here How to flatten a list to a list without coercion?)

as a simple example

my_data<-list(x=list(1,list(1,2,y='e'),3))

> my_data[['x']][[2]][['y']]
[1] "e"

> out<-flatten(my_data)
> out
$x.1
[1] 1

$x.2.1
[1] 1

$x.2.2
[1] 2

$x.2.y
[1] "e"

$x.3
[1] 3

> out[['x.2.y']]
[1] "e"

so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.

A more complicated example

library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)

UPDATE

naive way to remove trailing .1

my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))

> gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
[1] "x.1"   "x.2.1" "x.2.2" "x.2.y" "x.3"  

Solution 2

R has two packages for dealing with JSON input: rjson and RJSONIO. If I understand correctly what you mean by "collection of non-cyclical homogeneous or heterogeneous data structures", I think either of these packages will import that sort of structure as a list.

You can then flatten that list (into a vector) using the unlist function.

If the list is suitably structured (a non-nested list where each element is the same length) then as.data.frame prvoides an alternative to convert the list to be a data frame.

An example:

(my_data <- list(x = list('1' = 1, '2' = list(y = 2))))
unlist(my_data)

Solution 3

The jsonlite package is a fork of RJSONIO specifically designed to make conversion between JSON and data frames easier. You don't provide any example json data, but I think this might be what you are looking for. Have a look at this blog post or the vignette.

Solution 4

Great answer with the flatten and getnames functions. Took a few minutes to figure out all the options needed to get from a vector of JSON strings to a data.frame, so I thought I'd record that here. Suppose jsonvec is a vector of JSON strings. The following builds a data.frame (data.table) where there is one row per string, and each column corresponds to a different possible leaf node of the JSON tree. Any string missing a particular leaf node is filled with NA.

library(data.table)
library(jsonlite)
parsed = lapply(jsonvec, fromJSON, simplifyVector=FALSE)
flattened = lapply(parsed, flatten) #using flatten from accepted answer
d = rbindlist(flattened, fill=TRUE)
Share:
11,268
Sim
Author by

Sim

["co-found", "hack", "create", "enjoy", "play", "invest", "partner"].sort(function(){return 0.5-Math.random()})

Updated on June 05, 2022

Comments

  • Sim
    Sim almost 2 years

    This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.

    There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.

    In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].

    If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.

    This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.

  • Richie Cotton
    Richie Cotton almost 12 years
    What's with the downvote? unlist seems exactly like the "generic data structure flattening utility" that @Sim wants. In fact, the similar question linked to by @ttmaccer includes answers that make extensive use of unlist.
  • Richie Cotton
    Richie Cotton almost 12 years
    @ttmaccer: Yeah, you can't have it both ways in R. You either a flat (vector) data structure with a single data type or a nested (list) structure with mixed types. I think that there are enough tools in R that any JSON is transformable into whatever you want though.
  • Sim
    Sim almost 12 years
    @RichieCotton @ttmaccer I agree that unlist won't work in a generic way. If this is the best R has out of the box, I'll go ahead and write the recursive descent flattener I've used in other languages.
  • Sim
    Sim almost 12 years
    Looks promising. How would you suggest we get rid of the trailing .1s?
  • Sim
    Sim almost 12 years
    You should be able to reassign names(flattened_structure), right?
  • Sim
    Sim almost 12 years
    I agree. Cleaner now. My question was specifically about converting a large JSON document that is an array of dictionaries/hashes to data.frame. For that you'd have to build the column set as the union of all flattened list names, right?
  • EarlyPoster
    EarlyPoster almost 11 years
    Thanks for the example code. I however get- 'names' attribute [71556] must be the same length as the vector [66648] - on a large nested set.
  • DavidR
    DavidR almost 7 years
    Great answer - I just want to suggest one change in flatten. Currently, you do your flattening (.Primitive("c)) before checking that the list is already flattened. If you provide list that is already flattened, it seems to convert an atomic vector, which loses all type information. I suggest moving the flattening to the end of the repeat loop.
  • Ufos
    Ufos almost 5 years
    fromJSON(path_to_file, flatten=F) -- then examine the output. It's usually quite logical, although you might need a few rbindlist() and other tidying to fully unwrap it.