How to remove everything except specific set of characters/word using regex on vscode?-CodePudding

I have a very huge dump which i downloaded from imdb and here's a tiny example from the dump.

   nm0000006    Ingrid Bergman  1915    1982    actress,soundtrack,producer tt0036855,tt0077711,tt0038109,tt0034583
    nm0000007   Humphrey Bogart 1899    1957    actor,soundtrack,producer   tt0033870,tt0034583,tt0037382,tt0043265
    nm0000008   Marlon Brando   1924    2004    actor,soundtrack,director   tt0078788
    nm0000009   Richard Burton  1925    1984    actor,soundtrack,producer   tt0061184,tt0059749,tt0057877,tt0087803
    nm0000010   James Cagney    1899    1986    actor,soundtrack,director   tt0031867,tt0042041

Those "tt0029870" are the only things i need.

How should i do it on regex so everything so i can remove everything except those tt0031867 type codes?

I need the dump to look like this: tt0036855tt0077711tt0038109tt0034583tt0036855tt0077711tt0038109tt0034583tt0036855tt0077711tt0038109tt0034583

I will use vs code to find & replace/remove it using regex.

CodePudding user response：

It obviously depends on the regex flavor you use.

Here is a solution.

The regex is .*?(tt\d ):

.* will match any number of any characters, but the ? modifiers tells it to match as few as possible;
( and ) capture some matched text, the one we want to preserve in the substitution;
tt matches literal tt;
\d matches a digit, but tells it to match 1 or more (and, without ?, it matches as many as possible).

The modifiers applied to the regex are g to repeat the matching over and over on the lines, and s to make . match the newline character too.

CodePudding user response：

/tt0029870/ Will Work in your case. In DB you can always use Like

Select * from YOURTABE where code like '%tt0029870%'