Home > Enterprise >  Why does mongoimport work for CSV file but not JSON file with the exact same data?
Why does mongoimport work for CSV file but not JSON file with the exact same data?

Time:01-12

I have a very large data set that needs a database, and I am using Windows10 PowerShell as part of my development environment for the CLI and shell interface with both MongoDB and Python;

I have chosen MongoDB as the best solution for the needs of this data and future analysis with Python;

Within the data, I have Python tuples to contain units of 3 numbers with each separated by a comma - (x, y, z) - and therefore it was necessary to choose semicolons ; as the delimiter for the CSV file; Furthermore, MongoDB also has issues with parentheses, so I refactored to be Python Lists/Arrays [x, y, z]; However, MongoDB apparently has an issue with this, so I simplified the data to be a simple text string representation "[x, y, z]" in order to first isolate the issues described here below:

Evidently, MongoDB has an issue with importing CSV files with semicolons as delimiters, so I have had to recalculate the data to a new CSV file using tabs "\t" as the delimiter;

Using mongosh shell commands below (together with the status code messages), I am able to successfully mongoimport the CSV file to create the database with the collection;

./mongoimport --type tsv -d testdbcsv -c testcollection --headerline --drop dataCSV.csv
PS C:\mongodbdbtools\bin> ./mongoimport --type tsv -d testdbcsv -c testcollection --headerline --drop dataCSV.csv
2022-01-10T11:24:01.487 0200    connected to: mongodb://localhost/
2022-01-10T11:24:01.535 0200    dropping: testdbcsv.testcollection
2022-01-10T11:24:01.611 0200    30 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin>

For the challenge - and to test and learn new aspects of MongoDB - I also created a JSON file of the exact same data in the CSV in order to see and test if the two MongoDB databases that I create from two separate files (with the same data) would produce identical MongoDB databases;

I have already tested and validated the JSONObjectForExport as a valid JSON file which is a JSON text string containing a List/Array of Dictionaries/Objects;

However, using the two Windows PowerShell commands to mongoimport the JSON file, I am getting the following messages depending upon if I add the --jsonArray argument;

./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray
PS C:\mongodbdbtools\bin> ./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray
2022-01-10T11:25:05.513 0200    connected to: mongodb://localhost/
2022-01-10T11:25:05.562 0200    dropping: testdbjson.testcollection
2022-01-10T11:25:05.563 0200    Failed: error reading separator after document #1: bad JSON array format - found no opening bracket '[' in input source
2022-01-10T11:25:05.563 0200    0 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin> 
./mongoimport -d testdbjson -c testcollection --drop dataJSON.json
PS C:\mongodbdbtools\bin> ./mongoimport -d testdbjson -c testcollection --drop dataJSON.json
2022-01-10T11:25:24.816 0200    connected to: mongodb://localhost/
2022-01-10T11:25:24.864 0200    dropping: testdbjson.testcollection
2022-01-10T11:25:24.867 0200    Failed: cannot decode string into a primitive.D
2022-01-10T11:25:24.867 0200    0 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin>

Here is an example of a simplified JSON file with the JSON text string format:

"[{\"KeyField1\": 0, \"KeyField2\": \"[0, 1, 2]\"}]"

So the CSV file imports fine; the JSON file does not import even though it is valid JSON of the exact same data that is being written to both CSV and JSON files;

What could be the issue(s) here?

Thanks for the help;

EDIT: More details of the Python code that produces the JSON File per request to clarify:

## CONVERT/SERIALIZE DICTIONARY TO JSON:
JSONObjectForExport = json.dumps(ListOfDictionaries)

## BEGIN VALIDATE JSON
## BEGIN DEFINE FUNCTION
def validateJSON(jsonData):
    try:
        json.loads(jsonData)
    except ValueError as err:
        return(False)
    return(True)

## END DEFINE FUNCTION

## CALL FUNCTION
isValid = validateJSON(JSONObjectForExport)

## TEST PRINT OUTPUT
print("Given JSON string is Valid", isValid)

## END VALIDATE JSON

## WRITE DATA AS JSON FILE - 
with open("dataJSON.json", "w") as f_json:
    json.dump(JSONObjectForExport, f_json) 

EDIT: Summary of facts and issues:

1.) Python calculates data and writes the same data to valid CSV and valid JSON format files.

2.) MongoDB successfully imports data from CSV to create db.

3.) MongoDB fails in two ways with two different error messages because neither code works to import same data in JSON format.

CodePudding user response:

So I tested a few things, and even though the above JSONObjectForExport is valid JSON produced by Python, for whatever reason, MongoDB does not like the backslash escaped quotes, so I will investigate that next and update here asap:

The following work:

FOR SINGLE JSON OBJECTS:

./mongoimport -d testdbjson -c testcollection --drop dataJSON.json

This code above works for the JSON Object below:

{"KeyField1": 0, "KeyField2": [0, 1, 2]}

OR

{"KeyField1": 0, "KeyField2": "[0, 1, 2]"}

FOR ARRAYS OF JSON OBJECTS:

./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray

This code above works for the JSON Object as array below:

[{"KeyField1": 0, "KeyField2": [0, 1, 2]}]

OR

[{"KeyField1": 0, "KeyField2": "[0, 1, 2]"}]

CodePudding user response:

Okay, so after testing a few things this morning, the issue was double encoding of the data to JSON as per issue here with an alternate solution:

Dump to JSON adds additional double quotes and escaping of quotes

Simply put: it is unnecessary to use both json.dumps() and json.dump().

The following code showing that I commented out the unnecessary json.dumps() step (and replaced it with just dumping the ListOfDictionaries) solves the problem, and now MongoDB imports the JSON Object fine.

CONCLUSION / THEORY to be tested and confirmed: MongoDB does not like the double-encoding with backslash-escaped characters:

## CONVERT/SERIALIZE DICTIONARY TO JSON:
## JSONObjectForExport = json.dumps(ListOfDictionaries) ## UNNECESSARY TO ENCODE JSON HERE
JSONObjectForExport = ListOfDictionaries

## WRITE DATA AS JSON FILE - 
with open("dataJSON.json", "w") as f_json:
    json.dump(JSONObjectForExport, f_json)
  •  Tags:  
  • Related