Home > OS >  Data wrangling/cleaning in python
Data wrangling/cleaning in python

Time:01-18

I am a beginner/intermediate with Python 3 and I want to make a project that will require the use of a large set of data, but do not know the terminology to ask the question.

The project will require me to obtain a list of ALL words (using the roman alphabet, preferably no diacritics) that an English speaker would use and ranking them by their popularity (how often they are used) and sort them into some table/db.

What my thought process going into this was to use Google Ngram for popularity ranking, but then how would I obtain the word list? I do not want to download a dictionary because it wouldn't include words like "yeeted" (or any words that would have the forbidden red squiggly line under it). Maybe take a dictionary and scrape an urban dictionary as well, then nuke the duplicates?

Another roadblock is how I would store this data. Would I use an XML file so I can append more data to each word as I need it, or use a dictionary/table in another python document? The timely processing of such data is an important factor in my program, and will require anywhere from 4-16 queries of the dataset within a 3-5 minute timeframe. This is an area I am completely clueless in.

Any feedback on any questions would be immensely helpful. I am a broke 20-something so paying $400 for a refined dataset is not an option, but cheaper solutions would work.

CodePudding user response:

I posted this in a discord chat and received the following correspondence for reference.

Nightly Lights — Today at 1:09 PM not sure where you'd get the data from but storing it in... a database makes sense, since you'll probably be dealing with tens (hundreds?) of thousands of words

Dot — Today at 1:10 PM Yeah, would that be in SQL or is there a solution to store it in a file the program runs in? I'd imagine a SQL would complicate things Local directory file*

Nightly Lights — Today at 1:11 PM not entirely sure I understand the question but Sqlite is pretty simple If you're in need of more performance from the database you'll have to go standalone with mysql or postgres Sqlite stores the db in a file adjacent to your program well you choose the location but it's usually next to your python stuff Storing the word list in the Python file itself is a non starter imo

Dot — Today at 1:12 PM Fs Do you think JSON or XML could be a useable option?

Nightly Lights — Today at 1:14 PM In theory, yes, but a database would be able to handle arbitrary queries and updates better

Dot — Today at 1:14 PM I see

Nightly Lights — Today at 1:14 PM it would probably be way faster though (the database)

Dot — Today at 1:14 PM Okay, so SQlite it is Any takers on how to obtain the list of words? I can tackle the ranking in another question

zn — Today at 1:16 PM your solution sounds fine combine a reputable dictionary and an urban dictionary word list

Dot — Today at 1:16 PM Okay Lmao

zn — Today at 1:16 PM https://github.com/mattbierner/urban-dictionary-word-list GitHub GitHub - mattbierner/urban-dictionary-word-list: Script and sample ... Script and sample dataset of all urban dictionary entry names (around 1.4 million total) - GitHub - mattbierner/urban-dictionary-word-list: Script and sample dataset of all urban dictionary entry n...

you can use this script that he wrote

Dot — Today at 1:17 PM ooOOoo Grassy ass I mean gracias

Nightly Lights — Today at 1:17 PM Lmao

Dot — Today at 1:17 PM 1.4 mil

zn — Today at 1:17 PM the current list in the repo is outdated, so you need to run the script urself to update it

Sunn — Today at 1:17 PM https://en.wikipedia.org/wiki/Lists_of_English_words Maybe scrap these pages Lists of English words The following articles list English words that share certain features in common.

Dot — Today at 1:17 PM Image

Sunn — Today at 1:17 PM Also http://www.mieliestronk.com/corncob_lowercase.txt

Nightly Lights — Today at 1:18 PM as for ranking can't you just add up all the occurrences of each word and sort by the sum

Sunn — Today at 1:18 PM Then you can https://books.google.com/ngrams/json?content=Churchill,Stalin&year_start=1800&year_end=2000&corpus=26&smoothing=3

  •  Tags:  
  • Related