I have a JSON file that I read in pandas and converted to a dataframe. I then exported this file as a CSV so I could edit it easier. Once finished, I read the CSV file back into a dataframe and then wanted to convert it back to a JSON file. However, in that process a whole lot of extra data was automatically added to my original list of dictionaries (the JSON file).
I'm sure I could hack together a fix, but wanted to know if anyone knows an efficient way to handle this process so that NO new data or columns are added to my original JSON data?
Original JSON (snippet):
[
{
"tag": "!= (not-equal-to operator)",
"definition": "",
"source": [
{
"title": "Compare Dictionaries",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch08.html#idm45795007002280"
}
]
},
{
"tag": "\"intelligent\" applications",
"definition": "",
"source": [
{
"title": "Why Machine Learning?",
"URL": "https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/ch01.html#idm45613685872600"
}
]
},
{
"tag": "# (pound sign)",
"definition": "",
"source": [
{
"title": "Comment with #",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch04.html#idm45795038172984"
}
]
},
CSV as a dataframe (an index was automatically added):
tag definition source
0 != (not-equal-to operator) [{'title': 'Compare Dictionaries', 'URL': 'htt...
1 "intelligent" applications [{'title': 'Why Machine Learning?', 'URL': 'ht...
2 # (pound sign) [{'title': 'Comment with #', 'URL': 'https://l...
3 $ (Mac/Linux prompt) [{'title': 'Test Driving Python', 'URL': 'http...
4 $ (anchor) [{'title': 'Patterns: Using Specifiers', 'URL'...
... ... ... ...
11375 { } (curly brackets) []
11376 | (vertical bar) [{'title': 'Combinations and Operators', 'URL'...
11377 || (concatenation) function (DB2/Oracle/Postgr... [{'title': 'Discussion', 'URL': 'https://learn...
11378 || (for Oracle Database) [{'title': 'Including special characters', 'UR...
11379 || (vertical bar, double), concatenation opera... [{'title': 'Including special characters', 'UR...
7009 rows × 3 columns
JSON file after converting from CSV (all sorts of awful):
{
"0":{
"Unnamed: 0":0,
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"Unnamed: 0":1,
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"Unnamed: 0":2,
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
Here is my code:
import pandas as pd
import json
# to dataframe
tags_df = pd.read_json('dsa_tags_flat.json')
# csv file was manually cleaned then reloaded here
cleaned_csv_df = pd.read_csv('dsa-parser-flat.csv')
# write to JSON
cleaned_csv_df.to_json(r'dsa-tags.json', orient='index', indent=2)
EDIT: I added an index=false to the code when going from dataframe to CSV, which helped, but still have the index of keys there that were not in the original JSON. I wonder if a library function out somewhere would prevent this? Or do I have to just write some loops and remove them myself?
Also, as you can see, the URL forward-slashes were escaped. Not what I wanted.
{
"0":{
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
"3":{
"tag":"$ (Mac\/Linux prompt)",
"definition":null,
"source":"[{'title': 'Test Driving Python', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/data-wrangling-with\/9781491948804\/ch01.html#idm140080973230480'}]"
},
CodePudding user response:
The issue is that you are adding an index at two places.
Once while writing your file to csv. This adds the "Unnamed: 0" fields in the final JSON files. You can use index = False in the to_csv method while writing CSV to disk or specify the index_col parameter while reading the saved CSV in read_csv.
Secondly you are adding an index while writing the df to json with orient="index". This adds the outermost indices such as "0", "1" in the final JSON file. You should use orient="records" if you intend to save the json in a similar format to it was loaded in.
To understand how the orient parameter works, refer to pandas.DataFrame.to_json
