Home > Mobile >  How to remove dash/ hyphen from each line in .txt file
How to remove dash/ hyphen from each line in .txt file

Time:02-08

I wrote a little program to turn pages from book scans to a .txt file. On some lines, words are moved to another line. I wonder if this is any way to remove the dashes and merge them with the syllables in the line below?

E.g.:

effects on the skin is fully under-
stood one fights

to:

 effects on the skin is fully understood
 one fights

or:

effects on the skin is fully 
understood one fights

Or something like that. As long as it was connected. Python is my third language and so far I can't think of anything, so maybe someone will give mea hint.

Edit: The point is that the last symbol, if it is a dash, is removed and merged with the rest of the word below

CodePudding user response:

This is a generator which takes the input line-by-line. If it ends with a - it extracts the last word and holds it over for the next line. It then yields any held-over word from the previous line combined with the current line.

To combine the results back into a single block of text, you can join it against the line separator of your choice:

source = """effects on the skin is fully under-
stood one fights
check-out Daft Punk's new sin-
le "Get Lucky" if you hav-
e the chance. Sound of the sum-
mer."""

def reflow(text):
    holdover = ""
    for line in text.splitlines():
        if line.endswith("-"):
            lin, _, e = line.rpartition(" ")
        else:
            lin, e = line, ""
        yield f"{holdover}{lin}"
        holdover = e[:-1]

print("\n".join(reflow(source)))
""" which is:
effects on the skin is fully
understood one fights
check-out Daft Punk's new
single "Get Lucky" if you
have the chance. Sound of the
summer.
"""

To read one file line-by-line and write directly to a new file:

def reflow(infile, outfile):
    with open(infile) as source, open(outfile, "w") as dest:
        holdover = ""
        for line in source.readlines():
            line = line.rstrip("\n")
            if line.endswith("-"):
                lin, _, e = line.rpartition(" ")
            else:
                lin, e = line, ""
            dest.write(f"{holdover}{lin}\n")
            holdover = e[:-1]

if __name__ == "__main__":
    reflow("source.txt", "dest.txt")

CodePudding user response:

Here is one way to do it

with open('test.txt') as file:
    combined_strings = []
    merge_line = False
    for item in file:
        item = item.replace('\n', '') # remove new line character at end of line
        if '-' in item[-1]:  # check that it is the last character
            merge_line = True
            combined_strings.append(item[:-1])
        elif merge_line:
            merge_line = False
            combined_strings[-1] = combined_strings[-1]   item
        else:
            combined_strings.append(item)

CodePudding user response:

If you just parse the line as a string then you can utilize the .split() function to move around these kinds of items

words = "effects on the skin is fully under-\nstood one fights"
#splitting among the newlines
wordsSplit = words.split("\n")
#splitting among the word spaces
for i in range(len(wordsSplit)):
    wordsSplit[i] = wordsSplit[i].split(" ")
#checking for the end of line hyphens
for i in range(len(wordsSplit)):
    for g in range(len(wordsSplit[i])):
        if "-" in wordsSplit[i][g]:
            #setting the new word in the list and removing the hyphen
            wordsSplit[i][g] = wordsSplit[i][g][0:-1] wordsSplit[i 1][0]
            wordsSplit[i 1][0] = ""
#recreating the string
msg = ""
for i in range(len(wordsSplit)):
    for g in range(len(wordsSplit[i])):
        if wordsSplit[i][g] != "":
            msg  = wordsSplit[i][g] " "

What this does is split by the newlines which are where the hyphens usually occur. Then it splits those into a smaller array by word. Then checks for the hyphens and if it finds one it replaces it with the next phrase in the words list and sets that word to nothing. Finally, it reconstructs the string into a variable called msg where it doesn't add a space if the value in the split array is a nothing string.

CodePudding user response:

What about

import re

a = '''effects on the skin is fully under-
stood one fights'''

re.sub(r'-~([a-zA-Z0-9]*) ', r'\1\n', a.replace('\n', '~')).replace('~','\n')

Explanation

a.replace('\n', '~') concatenate input string into one line with (~ instead of \n - You need to choose some other if you want to use ~ char in the text.)

-~([a-zA-Z0-9]*) regex then selects all strings we want to alter with the () backreference which saves it to re.sub memory. Using '\1\n' it is later re-invoked.

.replace('~','\n') finally replaces all remaining ~ chars to newlines.

  •  Tags:  
  • Related