I have a very huge dump which i downloaded from imdb and here's a tiny example from the dump.
nm0000006 Ingrid Bergman 1915 1982 actress,soundtrack,producer tt0036855,tt0077711,tt0038109,tt0034583
nm0000007 Humphrey Bogart 1899 1957 actor,soundtrack,producer tt0033870,tt0034583,tt0037382,tt0043265
nm0000008 Marlon Brando 1924 2004 actor,soundtrack,director tt0078788
nm0000009 Richard Burton 1925 1984 actor,soundtrack,producer tt0061184,tt0059749,tt0057877,tt0087803
nm0000010 James Cagney 1899 1986 actor,soundtrack,director tt0031867,tt0042041
Those "tt0029870" are the only things i need.
How should i do it on regex so everything so i can remove everything except those tt0031867 type codes?
I need the dump to look like this: tt0036855tt0077711tt0038109tt0034583tt0036855tt0077711tt0038109tt0034583tt0036855tt0077711tt0038109tt0034583
I will use vs code to find & replace/remove it using regex.
CodePudding user response:
It obviously depends on the regex flavor you use.
The regex is .*?(tt\d ):
.*will match any number of any characters, but the?modifiers tells it to match as few as possible;(and)capture some matched text, the one we want to preserve in the substitution;ttmatches literaltt;\dmatches a digit, buttells it to match 1 or more (and, without?, it matches as many as possible).
The modifiers applied to the regex are g to repeat the matching over and over on the lines, and s to make . match the newline character too.
CodePudding user response:
/tt0029870/ Will Work in your case. In DB you can always use Like
Select * from YOURTABE where code like '%tt0029870%'
