Home > Software engineering >  Extracting text from a JSON file into specified output using Python
Extracting text from a JSON file into specified output using Python

Time:01-05

I have a JSON text file from Qualitrics that looks like this (for example, this is one variable I pulled from the text):

{
      "SurveyID": "SV_8v79iA9BlgTnAnH",
      "Element": "SQ",
      "PrimaryAttribute": "QID7",
      "SecondaryAttribute": "Do you use similar websites or resources to accomplish the objectives you have in using Open Data...",
      "TertiaryAttribute": null,
      "Payload": {
        "QuestionText": "Do you use similar websites or resources to accomplish the objectives you have in using Open Data Flint?",
        "DefaultChoices": false,
        "DataExportTag": "Similar",
        "QuestionID": "QID7",
        "QuestionType": "MC",
        "Selector": "SAVR",
        "SubSelector": "TX",
        "DataVisibility": {
          "Private": false,
          "Hidden": false
        },
        "Configuration": {
          "QuestionDescriptionOption": "UseText"
        },
        "QuestionDescription": "Do you use similar websites or resources to accomplish the objectives you have in using Open Data...",
        "Choices": {
          "1": {
            "Display": "Yes"
          },
          "2": {
            "Display": "No"
          }
        },
        "ChoiceOrder": [
          1,
          2
        ],
        "Validation": {
          "Settings": {
            "ForceResponse": "ON",
            "ForceResponseType": "ON",
            "Type": "None"
          }
        },
        "GradingData": [],
        "Language": [],
        "NextChoiceId": 3,
        "NextAnswerId": 1
      }
    },

I want to extract text only from lines QuestionText and QuestionID so that it creates an output that looks exactly like this:

*
name = QID7
text = 
Do you use similar websites or resources to accomplish the objectives you have in using Open Data Flint?                          
*

Here is my code so far but I'm getting an error that the list indices must be integers or slices, not str:

import json

with open('flint.json', 'r') as myfile:
    data=myfile.read()

# parse file
obj = json.loads(data)

print("name = "   str(obj["SurveyElements"]["Payload"]["QuestionID"]), "text = "   str(obj["SurveyElements"]["Payload"]["QuestionText"]))

How can I create a Python script that will extract the information I want and output the results in the format I need so that the asterisks, 'name =', 'text =', line breaks, and clean text replicate the above output? Will I need to use regex to get what I need? Or apply multiple conditions per line until the conditions are satisfied?

CodePudding user response:

It definitely helped that you eventually added the structure of the json file.

After looking at your approach, I would just note that you need to account for the survey elements being in a list of dictionaries.

Below is an example to get what you want. I did assume you are only looking at survey questions, which is why I included the check: if element["Element"] == "SQ" (only these elements include QuestionID & QuestionText)

import json

# open json file
with open('flint.json', 'r') as myfile:
    data = myfile.read()

# load json
obj = json.loads(data)

# create a list of dictionaries,
# that contains only the survey elements
survey_elements_list = obj["SurveyElements"]

# iterate through the list
# and only look at survey questions
# checking if element["Element"] == "SQ"
for element in survey_elements_list:
    if element["Element"] == "SQ":
        question_id = element["Payload"]["QuestionID"]
        question_text = element["Payload"]["QuestionText"]
        print("*")
        print(f"name = {question_id}")
        print(f"text =\n{question_text}")

CodePudding user response:

The QuestionText and QuestionID fields are nested within the Payload dictionary. You need to index into that dictonary before accessing those fields.

Your print should look like the following:

print(f"name = {obj["Payload"]["QuestionID"]}, text = {str(obj["Payload"]["QuestionText"]}")

Edit: The full JSON file shows more layers of nesting that we need to go through to access a specific field. Accessing these fields uses roughly the same format as above, but with a couple extra index accesses (I've also edited the response to use format-strings, rather than string concatenation):

for survey_element in obj["SurveyElements"]:
    survey_element_payload = survey_element["Payload"]
    if "QuestionID" in survey_element_payload and "QuestionText" in survey_element_payload:
        print(f"name = {survey_elements_payload["QuestionID"]}, text = {survey_elements_payload["QuestionText"]}")
  •  Tags:  
  • Related