ArticlesDataset.txt file contains all the metadata information of documents. unigramCount contains all unique words and their number of occurrences for each document. There are 1500 publications recorded in the txt file. Here is an example entry for a document:
{"creator":["Romain Allais","Julie Gobert"],
"datePublished":"2018-05-30",
"docType":"article",
"doi":"10.1051\/mattech\/2018010",
"id":"ark:\/\/27927\/phz10hn2bh3",
"isPartOf":"Mat\u00e9riaux & Techniques",
"issueNumber":"5-6",
"language":["eng"],
"outputFormat":["unigram","bigram","trigram"],
"pageCount":7,
"pagination":"pp. null-null",
"provider":"portico",
"publicationYear":2018,
"publisher":"EDP Sciences",
"sequence":3.0,
"tdmCategory":["Applied sciences -Engineering"],
"title":"Environmental assessment of PSS",
"url":"http:\/\/doi.org\/10.1051\/mattech\/2018010",
"volumeNumber":"105",
"wordCount":4446,
"unigramCount":{"others":1,"air":1,"networks,":1,"conventional":1,"IEEE":1}}
My purpose is to pull out the unigram counts for each document and store them in a suitable array. How can I do it by using fstream library?
How can i improve below code to reach my goal.
std::string dummy;
std::ifstream data("PublicationsDataSet.txt");
while (data.good())
{
getline(data, dummy, ',');
}
CodePudding user response:
your question delves in two different topics, one is parsing the data and the other into storing it in memory.
To the first point the answer is, you'll need a parser, you either write one which will involve a syntax parser to convert each "key words" into tokens, for then an interpreter to compile them into a data object based on the token parameter the data is preceded or succeeded eg:
- '[' = start an array, every values after this are part of the array
- ']' = end of the an array, return to previous parsing state
- ':' = separate key and values, left hand side is key, right hand side is value
- ...
this is a fine exercise to sharpen one's skills but way too arduous and with potential never-ending-bug-fixing road, as recommended also by other comments finding an already made library is probably the easier road on a time pinch or on a project time crunching scenario.
Another thing to point out, plain arrays in c are size fixed, so mostly likely since you are parsing the values you'll probably use std::vectors, which allow insertion, and once you are done processing the file and really intend to send the data back as an array you can do that directly from the object
std::vector<YourObjectType> parsedObject;
char* arr = new char[parsedObject.size()];
std::copy(v.begin(), v.end(), arr);
this is a psudo code, lots of things will depend on the implementation, but it gives the idea.
A starting point to write a parse is this article goes in great details on how it works and it's components, mind you every parser implements it's own language (yes just like c and other languages, are all parsed) so you'll need to expand on the concept with your commands
CodePudding user response:
Here's a simplified solution of what you could do using std::regex:
- Read the lines of a stream (
std::cinin this case) one by one. - Check if the line contains a
unigramCountelement. - If that's the case, walk the different entries within the
unigramCountelement.
About the regular expressions used:
"unigramCount":{}, allowing:
- zero or more whitespaces basically everywhere, and
- zero or more characters within the braces.
"<key>":<value>, where:
<key>is one or more characters other than a double quote,<value>is one or more digits, and- you could have whitespaces at both sides of the
:.
A good data structure for storing your unigramCount entries could be a std::map.
#include <iostream> // cout
#include <map>
#include <regex> // regex_match, regex_search, sregex_iterator
#include <string> // stoi
int main()
{
std::string line{};
std::map<std::string, int> unigram_counts{};
while (std::getline(std::cin, line))
{
const std::regex unigram_count_pattern{R"(^\s*\"unigramCount\"\s*:\s*\{.*\}\s*$)"};
if (std::regex_match(line, unigram_count_pattern))
{
const std::regex entry_pattern{R"(\"([^\"] )\"\s*:\s*([0-9] ))"};
for (auto entry_it{std::sregex_iterator(line.cbegin(), line.cend(), entry_pattern)};
entry_it != std::sregex_iterator{};
entry_it)
{
auto matches{*entry_it};
auto& key{matches[1]};
auto& value{matches[2]};
unigram_counts[key] = std::stoi(value);
}
}
}
for (auto& [key, value] : unigram_counts)
{
std::cout << "'" << key << "' : " << value << "\n";
}
}
// Outputs:
//
// 'IEEE' : 1
// 'air' : 1
// 'conventional' : 1
// 'networks,' : 1
// 'others' : 1
