Home > Blockchain >  Load Redshift Spectrum external table with CSVs with differing column order
Load Redshift Spectrum external table with CSVs with differing column order

Time:01-20

This got no answer and I have a similar question, though I'll expand it.

Suppose I have 3 CSV files in s3://test_path/. I want to create an external table and populate it with the data in these CSVs. However, not only does column order differ across CSVs, but some columns may be missing from some CSVs.

Is Redshift Spectrum capable of doing what I want?

a.csv:

id,name,type
a1,apple,1
a2,banana,2

b.csv:

type,id,name
1,b1,orange
2,b2,lemon

c.csv:

name,id
kiwi,c1

I create the external database/schema and table by running this in Redshift query editor v2 on my Redshift cluster:

CREATE EXTERNAL SCHEMA test_schema
FROM DATA CATALOG
DATABASE 'test_db'
REGION 'region'
IAM_ROLE 'iam_role'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
CREATE EXTERNAL TABLE test_schema.test_table (
  "id" VARCHAR,
  "name" VARCHAR,
  "type" SMALLINT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://test_path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;

I expect SELECT * FROM test_schema.test_table to yield:

id name type
a1 apple 1
a2 banana 2
b1 orange 1
b2 lemon 2
c1 kiwi NULL

Instead I get:

id name type
a1 apple 1
a2 banana 2
1 b1 NULL
2 b2 NULL
kiwi c1 NULL

It seems Redshift Spectrum cannot match columns by name across files the way pandas.concat() can with data frames with differing column order.

CodePudding user response:

No, your data needs to be transformed to align the data between files. Your Spectrum DDL specifies that the first row of the CSVs is ignored so the information you need isn't even being read.

If you want to have these files usable as one Spectrum table you will need to transform them to align columns and store new files to S3. You can do this with Redshift if you also have a support piece of code reading the column order from each file. You could write a Lambda to do this easily or if your CSV files are fairly simple then a Glue crawler will work. Almost any ETL tool can do this as well. Lots of choices but these files are not Spectrum ready as is.

  •  Tags:  
  • Related