I have a very large data set that needs a database, and I am using Windows10 PowerShell as part of my development environment for the CLI and shell interface with both MongoDB and Python;
I have chosen MongoDB as the best solution for the needs of this data and future analysis with Python;
Within the data, I have Python tuples to contain units of 3 numbers with each separated by a comma - (x, y, z) - and therefore it was necessary to choose semicolons ; as the delimiter for the CSV file; Furthermore, MongoDB also has issues with parentheses, so I refactored to be Python Lists/Arrays [x, y, z]; However, MongoDB apparently has an issue with this, so I simplified the data to be a simple text string representation "[x, y, z]" in order to first isolate the issues described here below:
Evidently, MongoDB has an issue with importing CSV files with semicolons as delimiters, so I have had to recalculate the data to a new CSV file using tabs "\t" as the delimiter;
Using mongosh shell commands below (together with the status code messages), I am able to successfully mongoimport the CSV file to create the database with the collection;
./mongoimport --type tsv -d testdbcsv -c testcollection --headerline --drop dataCSV.csv
PS C:\mongodbdbtools\bin> ./mongoimport --type tsv -d testdbcsv -c testcollection --headerline --drop dataCSV.csv
2022-01-10T11:24:01.487 0200 connected to: mongodb://localhost/
2022-01-10T11:24:01.535 0200 dropping: testdbcsv.testcollection
2022-01-10T11:24:01.611 0200 30 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin>
For the challenge - and to test and learn new aspects of MongoDB - I also created a JSON file of the exact same data in the CSV in order to see and test if the two MongoDB databases that I create from two separate files (with the same data) would produce identical MongoDB databases;
I have already tested and validated the JSONObjectForExport as a valid JSON file which is a JSON text string containing a List/Array of Dictionaries/Objects;
However, using the two Windows PowerShell commands to mongoimport the JSON file, I am getting the following messages depending upon if I add the --jsonArray argument;
./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray
PS C:\mongodbdbtools\bin> ./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray
2022-01-10T11:25:05.513 0200 connected to: mongodb://localhost/
2022-01-10T11:25:05.562 0200 dropping: testdbjson.testcollection
2022-01-10T11:25:05.563 0200 Failed: error reading separator after document #1: bad JSON array format - found no opening bracket '[' in input source
2022-01-10T11:25:05.563 0200 0 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin>
./mongoimport -d testdbjson -c testcollection --drop dataJSON.json
PS C:\mongodbdbtools\bin> ./mongoimport -d testdbjson -c testcollection --drop dataJSON.json
2022-01-10T11:25:24.816 0200 connected to: mongodb://localhost/
2022-01-10T11:25:24.864 0200 dropping: testdbjson.testcollection
2022-01-10T11:25:24.867 0200 Failed: cannot decode string into a primitive.D
2022-01-10T11:25:24.867 0200 0 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin>
Here is an example of a simplified JSON file with the JSON text string format:
"[{\"KeyField1\": 0, \"KeyField2\": \"[0, 1, 2]\"}]"
So the CSV file imports fine; the JSON file does not import even though it is valid JSON of the exact same data that is being written to both CSV and JSON files;
What could be the issue(s) here?
Thanks for the help;
EDIT: More details of the Python code that produces the JSON File per request to clarify:
## CONVERT/SERIALIZE DICTIONARY TO JSON:
JSONObjectForExport = json.dumps(ListOfDictionaries)
## BEGIN VALIDATE JSON
## BEGIN DEFINE FUNCTION
def validateJSON(jsonData):
try:
json.loads(jsonData)
except ValueError as err:
return(False)
return(True)
## END DEFINE FUNCTION
## CALL FUNCTION
isValid = validateJSON(JSONObjectForExport)
## TEST PRINT OUTPUT
print("Given JSON string is Valid", isValid)
## END VALIDATE JSON
## WRITE DATA AS JSON FILE -
with open("dataJSON.json", "w") as f_json:
json.dump(JSONObjectForExport, f_json)
EDIT: Summary of facts and issues:
1.) Python calculates data and writes the same data to valid CSV and valid JSON format files.
2.) MongoDB successfully imports data from CSV to create db.
3.) MongoDB fails in two ways with two different error messages because neither code works to import same data in JSON format.
CodePudding user response:
So I tested a few things, and even though the above JSONObjectForExport is valid JSON produced by Python, for whatever reason, MongoDB does not like the backslash escaped quotes, so I will investigate that next and update here asap:
The following work:
FOR SINGLE JSON OBJECTS:
./mongoimport -d testdbjson -c testcollection --drop dataJSON.json
This code above works for the JSON Object below:
{"KeyField1": 0, "KeyField2": [0, 1, 2]}
OR
{"KeyField1": 0, "KeyField2": "[0, 1, 2]"}
FOR ARRAYS OF JSON OBJECTS:
./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray
This code above works for the JSON Object as array below:
[{"KeyField1": 0, "KeyField2": [0, 1, 2]}]
OR
[{"KeyField1": 0, "KeyField2": "[0, 1, 2]"}]
CodePudding user response:
Okay, so after testing a few things this morning, the issue was double encoding of the data to JSON as per issue here with an alternate solution:
Dump to JSON adds additional double quotes and escaping of quotes
Simply put: it is unnecessary to use both json.dumps() and json.dump().
The following code showing that I commented out the unnecessary json.dumps() step (and replaced it with just dumping the ListOfDictionaries) solves the problem, and now MongoDB imports the JSON Object fine.
CONCLUSION / THEORY to be tested and confirmed: MongoDB does not like the double-encoding with backslash-escaped characters:
## CONVERT/SERIALIZE DICTIONARY TO JSON:
## JSONObjectForExport = json.dumps(ListOfDictionaries) ## UNNECESSARY TO ENCODE JSON HERE
JSONObjectForExport = ListOfDictionaries
## WRITE DATA AS JSON FILE -
with open("dataJSON.json", "w") as f_json:
json.dump(JSONObjectForExport, f_json)
