Home > Blockchain >  Multiline string substitution on bash stream
Multiline string substitution on bash stream

Time:01-19

I have a large (>500GB) Postgres dump in GCS that I would like to strip COPY commands from. Given the size of the dump I would like to perform the substitution on a gsutil cat stream rather than storing locally:

gsutil cat gs://mybucket.mydomain.com/path/to/mydump.sql | some_command > mydump-commands.sql

The COPY command can span multiple lines but always ends with \. on its own line.

I have tried with perl:

# some_command = 
perl -pe 'BEGIN{undef $/;} s/COPY.*\\.//smg'

this works for a small sample local file (below) but does not seem to work streaming from stdin.

--
-- PostgreSQL database dump
--

-- Dumped from database version 14.1
-- Dumped by pg_dump version 14.1

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

SET default_tablespace = '';

SET default_table_access_method = heap;

--
-- Name: mytable; Type: TABLE; Schema: dummy; Owner: bchrobot
--

CREATE TABLE dummy.mytable (
    id integer,
    title text
);


ALTER TABLE dummy.mytable OWNER TO bchrobot;

--
-- Data for Name: mytable; Type: TABLE DATA; Schema: dummy; Owner: bchrobot
--

COPY dummy.mytable (id, title) FROM stdin;
1   my first title
2   my second title
\.


--
-- PostgreSQL database dump complete
--

Any suggestions for this, specifically for working with streams?

CodePudding user response:

This sed command, as the replacement of some_command, will delete all lines between a line beginning with COPY and a line consisting of \., including those two lines.

sed '/^COPY/,/^\\\.$/d'

CodePudding user response:

First off, you need to escape the period as well \.. And since you have some beginning ^ and end of line $ restrictions, you should add those as well.

Since you are streaming the input, you can't undef $/ or it will try to read all 500Gb into memory, and your program will probably just hang. You have to read in line-by-line mode.

You might try a flip-flop operator:

perl -ne'print unless /^COPY/ ... /^\\\.$/' psql.txt

The range operator ... (or ..) will be true if the LHS pattern match is true, and for every line after, until and including when the RHS pattern is true. And false otherwise.

So we print all the lines that are not included in this passage.

  •  Tags:  
  • Related