Home > Software design >  Extract top-level key and contents from large JSON using stream
Extract top-level key and contents from large JSON using stream

Time:01-08

One procedure in a system is to 'extract' one key and its (object) value to a dedicated file to subsequently process it in some way in a (irrelevant) script.

A representative subset of the original JSON file looks like:

{
  "version" : null,
  "produced" : "2021-01-01T00:00:00 0000",
  "other": "content here",
  "items" : [
    {
      "code" : "AA",
      "name" : "Example 1",
      "prices" : [ "other", "content", "here" ]
    }, 
    {
      "code" : "BB",
      "name" : "Example 2",
      "prices" : [ "other", "content", "here" ]
    }
  ]
}

And the current output, given that subset as input, simply equals:

[
    {
      "code" : "AA",
      "name" : "Example 1",
      "prices" : [ "other", "content", "here" ],
    }, 
    {
      "code" : "BB",
      "name" : "Example 2",
      "prices" : [ "other", "content", "here" ],
    }, 
    ...
]

Previously, we would extract the whole partion of "items" using jq with a very straightforward command (which worked fine):

cat file.json | jq '.items' > file.items.json

However, recently the size of the original json file has increased drastically in size, causing the script to fail due to a Out of memory error. One obvious solution is to use jq's 'stream' option. However, I am kind of stuck on how to convert above command to a valid filter in jq's stream syntax.

cat file.json | jq --stream '...' > file.items.json

Any advice on what to use as a filter for this command would be greatly appreciated. Thanks in advance!

CodePudding user response:

You should use the --stream flag in combination with the fromstream builtin

jq --stream --null-input '
  fromstream(inputs | select(.[0][0] == "items"))[]
' file.json 
[
  {
    "code": "AA",
    "name": "Example 1",
    "prices": [
      "other",
      "content",
      "here"
    ]
  },
  {
    "code": "BB",
    "name": "Example 2",
    "prices": [
      "other",
      "content",
      "here"
    ]
  }
]

Demo not for the efficiency or memory consumption but rather for the syntax (as I had to stream your original input using tostream for the lack of the --stream option on jqplay.org)


Note: Although it works for the sample data, do not try to shortcut using

jq --stream --null-input 'fromstream(inputs).items' file.json

directly on your large JSON file, as it only

reconstructs the entire input JSON entity, thus defeating the purpose of using --stream

(clarified by @peak)

CodePudding user response:

If a stream of the {code, name, prices} objects is acceptable, then you could go with:

< input.json jq --stream -n '
   fromstream( 2 | truncate_stream(inputs | select(.[0][0] == "items")) )'

This would have minimal memory requirements, which may or may not be significant depending on the value of .items|length

  •  Tags:  
  • Related