HomeBig DataCode conversion from Greenplum to Amazon Redshift: Dealing with arrays, dates, and...

Code conversion from Greenplum to Amazon Redshift: Dealing with arrays, dates, and common expressions


Amazon Redshift is a totally managed service for knowledge lakes, knowledge analytics, and knowledge warehouses for startups, medium enterprises, and enormous enterprises. Amazon Redshift is utilized by tens of 1000’s of companies across the globe for modernizing their knowledge analytics platform.

Greenplum is an open-source, massively parallel database used for analytics, largely for on-premises infrastructure. Greenplum is predicated on the PostgreSQL database engine.

Many shoppers have discovered migration to Amazon Redshift from Greenplum a sexy possibility as an alternative of managing on-premises Greenplum for the next causes:

Although each Greenplum and Amazon Redshift use the open-source PostgreSQL database engine, migration nonetheless requires loads of planning and handbook intervention. This put up covers the important thing features and concerns whereas performing code conversion from Greenplum to Amazon Redshift. It’s centered on the migration of procedures, features, and views.

Resolution overview

AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Software (AWS SCT) can migrate many of the objects in a heterogeneous database migration from Greenplum to Amazon Redshift. However there are some conditions the place code conversion groups encounter errors and warnings for views, procedures, and features whereas creating them in Amazon Redshift. To handle the sort of state of affairs, handbook conversion of the code is required.

The posts focuses on methods to deal with the next whereas migrating from Greenplum to Amazon Redshift:

  • Arrays
  • Dates and timestamps
  • Common expressions (regex)

Please be aware that for this put up, we use Greenplum 4.3 and Amazon Redshift PostgreSQL 8.2.

Working with array features

The AWS SCT doesn’t convert array features whereas migrating from Greenplum or PostgreSQL to Amazon Redshift. Builders have to extensively convert these features manually. This put up outlines the commonest array features:

  • ARRAY_UPPER
  • JSON_EXTACT_ARRAY_ELEMENT_TEXT and JSON_ARRAY_LENGTH
  • UNNEST ()
  • STRING_AGG()
  • ANY ARRAY()

ARRAY_UPPER()

This perform returns the higher certain of an array. It may be used to extract the nth component from an array in PostgreSQL or Greenplum.

The Greenplum code is as follows:

With temp1 as
(
Choose 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"','NULL'] as PhoneNumbers
union all
Choose 'Bob' as FirstName, 'Haris' as LastName ,
array['222-333-4444','201-301-4001','AAA-BBB-CCCC'] as PhoneNumbers
union all
Choose 'Mary' as FirstName, 'Jane' as LastName ,
array['333-444-5555','301-401-3001','DDD-EEE-FFFF'] as PhoneNumbers
)
Choose Firstname, PhoneNumbers[ARRAY_UPPER(PhoneNumbers,1)]

There isn’t a perform to extract a component from an array in Amazon Redshift; nonetheless, there are two JSON features that can be utilized for this goal:

  • JSON_EXTRACT_ARRAY_ELEMENT_TEXT() – Returns a JSON array component within the outermost array of a JSON string
  • JSON_ARRAY_LENGTH() – Returns the variety of components within the outer array of a JSON string

See the next code:

With temp1 as
(
Choose 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"'] as PhoneNumbers
union all
Choose 'Bob' as FirstName, 'Haris' as LastName ,
array['"222-333-4444"','"201-301-4001"','"AAA-BBB-CCCC"'] as PhoneNumbers
union all
Choose 'Mary' as FirstName, 'Jane' as LastName ,
array['"333-444-5555"','"301-401-3001"','"DDD-EEE-FFFF"'] as PhoneNumbers
)

Choose
FirstName
,('['+array_to_string(phoneNumbers,',')+']') as JSONConvertedField
,JSON_EXTRACT_ARRAY_ELEMENT_TEXT
(
'['+array_to_string(phoneNumbers,',')+']'
,JSON_ARRAY_LENGTH('['+array_to_string(phoneNumbers,',')+']')-1
) as LastElementFromArray
from temp1

UNNEST()

UNNEST() is PostgreSQL’s system perform for semi-structured knowledge, increasing an array, or a mix of arrays to a set of rows. It’s launched to enhance the database efficiency of 1000’s or data for inserts, updates, and deletes.

You need to use UNNEST() for fundamental array, a number of arrays, and a number of arrays with completely different lengths.

A few of Amazon Redshift features used to unnest arrays are split_part, json_extract_path_text, json_array_length, and json_extract_array_element_text.

In Greenplum, the UNNEST perform is used to develop an array to a set of rows:

Choose ‘A’,unnest(array([1,2])

Output
A 1
A 2

with temp1 as
(
Choose 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Choose 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Choose 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
)

choose
FirstName
,LastName
,unnest(array[‘Mobile’::text,’HomePhone’::text]) as PhoneType
,unnest(array[MobilePhone::text,HomePhone::text]) as PhoneNumber
from
temp1
order by 1,2,3

Amazon Redshift doesn’t assist the UNNEST perform; you need to use the next workaround:

with temp1 as
(
Choose 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Choose 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Choose 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
),
ns as
(
Choose row_number() over(order by 1) as n from pg_tables
)

Choose
FirstName
,LastName
,split_part('Cellular,Residence',',',ns.n::int) as PhoneType
,split_part(MobilePhone|| '&&' || HomePhone, '&&', ns.n::int) as PhoneNumber
from
temp1, ns
the place
ns.n<=regexp_count('Cellular,Residence',',')+1
order by 1,2,3

When the component of array is within the type of array itself, use the JSON_EXTRACT_ARRAY_ELEMENT_TEXT() perform and JSON_ARRAY_LENGTH:

with ns as
(
Choose row_number() over(order by 1) as n from pg_tables
)

Choose JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["arrayelement1","arrayelement2"]',ns.n-1)
from ns
the place
ns.n<=JSON_ARRAY_LENGTH('["arrayelement1","arrayelement2"]')

STRING_AGG()

The STRING_AGG() perform is an combination perform that concatenates a listing of strings and locations a separator between them. The perform doesn’t add the separator on the finish of the string. See the next code:

STRING_AGG ( expression, separator [order_by_clause] )

The Greenplum code is as follows:

with temp1 as
(
Choose 'Finance'::textual content as Dept, 'John'::textual content as FirstName, 'Smith'::textual content as LastName
union all
Choose 'Finance'::textual content as Dept, 'John'::textual content as FirstName, 'Doe'::textual content as LastName
union all
Choose 'Finance'::textual content as Dept, 'Mary'::textual content as FirstName, 'Jane'::textual content as LastName
union all
Choose 'Advertising'::textual content as Dept, 'Bob'::textual content as FirstName, 'Smith'::textual content as LastName
union all
Choose 'Advertising'::textual content as Dept, 'Steve'::textual content as FirstName, 'Smith'::textual content as LastName
union all
Choose 'Account'::textual content as Dept, 'Phil'::textual content as FirstName, 'Adams'::textual content as LastName
union all
Choose 'Account'::textual content as Dept, 'Jim'::textual content as FirstName, 'Smith'::textual content as LastName
)

Choose dept,STRING_AGG(FirstName||' '||LastName,' ; ') as Workers from temp1 group by dept order by 1

The Amazon Redshift equal for the STRING_AGG() perform is LISTAGG(). This combination perform orders the rows for that group in keeping with the ORDER BY expression, then concatenates the values right into a single string:

LISTAGG(expression, separator [order_by_clause])

See the next code:

Create momentary Desk temp1 as
(
Choose 'Finance'::textual content as Dept, 'John'::textual content as FirstName, 'Smith'::textual content as LastName
union all
Choose 'Finance'::textual content as Dept, 'John'::textual content as FirstName, 'Doe'::textual content as LastName
union all
Choose 'Finance'::textual content as Dept, 'Mary'::textual content as FirstName, 'Jane'::textual content as LastName
union all
Choose 'Advertising'::textual content as Dept, 'Bob'::textual content as FirstName, 'Smith'::textual content as LastName
union all
Choose 'Advertising'::textual content as Dept, 'Steve'::textual content as FirstName, 'Smith'::textual content as LastName
union all
Choose 'Account'::textual content as Dept, 'Phil'::textual content as FirstName, 'Adams'::textual content as LastName
union all
Choose 'Account'::textual content as Dept, 'Jim'::textual content as FirstName, 'Smith'::textual content as LastName
)

Choose dept,LISTAGG(FirstName||' '||LastName,' ; ') as Workers from temp1
group by dept
order by 1

ANY ARRAY()

The PostgreSQL ANY ARRAY() perform evaluates and examine the left-hand expression to every component in array:

Choose * from temp1 the place DeptName = ANY ARRAY('10-F','20-F','30-F')

In Amazon Redshift, the analysis might be achieved with an IN operator:

Choose * from temp1 the place DeptName IN ('10-F','20-F','30-F')

Working with date features

On this part, we focus on calculating the distinction between date_part for Greenplum and datediff for Amazon Redshift.

When the applying must calculate the distinction between the subfields of dates for Greenplum, it makes use of the perform date_part, which lets you retrieve subfields corresponding to yr, month, week, and day. Within the following instance queries, we calculate the variety of completion_days by calculating the distinction between originated_date and eco_date.

To calculate the distinction between the subfields of the date, Amazon Redshift has the perform datediff. The next queries present an instance of methods to calculate the completion_days because the distinction between eco_date and orginated_date. DATEDIFF determines the variety of date half boundaries which might be crossed between the 2 expressions.

We examine the Greenplum and Amazon Redshift queries as follows:

The next Greenplum question returns 1 yr between 2009-01-01 and 2009-12-31:

SELECT date_part(‘yr’, TIMESTAMP ‘2009-01-01’) - date_part(‘yr’, 2008-12-31’) as yr;

The next Amazon Redshift question returns 1 yr between 2009-01-01 and 2009-12-31:

SELECT datediff (yr, ‘2008-12-31’ , ‘2009-01-01’ ) as yr;

The next Greenplum question returns 1 month between 2009-01-01 and 2008-12-31:

SELECT (date_part(‘yr’, ‘2009-01-01’ :: date) - date_part(‘yr’, ‘2008-12-31’ :: date)) * 12 +<br />(date_part(‘month’, ‘2009-01-01’) - date_part(‘month’, ‘2008-12-31’ :: date)) as month;

The next Amazon Redshift question returns 1 month between 2009-01-01 and 2008-12-31:

SELECT datediff( month, ‘2008-12-31’ , ‘2009-01-01’ ) as month;

The next Greenplum question returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT date_part(‘week’, timestamp ‘2009-01-01’ ) - date_part(‘week’, timestamp ‘2008-12-31’) as week;

The next Amazon Redshift question returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT datediff( week, ‘2008-12-31’ , ‘2009-01-01’ ) as week;

The next Greenplum question returns 1 day:

SELECT date_part ('day', '2009-01-01 24:00:00' :: timestamp - '2008-12-31 24:00:00 :: timestamp) as day;

The next Amazon Redshift question returns 1 day:

SELECT datediff (day, ‘2008-12-31’, ‘2009-01-01’) as day;

The next Greenplum question returns 1 hour:

SELECT date_part(‘hour’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2008-12-31 21:54:55' :: timestamp)

The next Amazon Redshift question returns 1 hour:

SELECT datediff (hour, ‘2009-01-01 21:56:10’, ‘2009-01-01’ ) as hour;

The next Greenplum question returns 3 minutes:

SELECT date_part(‘minute’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2009-01-01 21:53:10’ :: timestamp) as minutes;

The next Amazon Redshift question returns 1 minute:

SELECT datediff(minute, ‘2009-01-01 21:56:10’, ‘2009-01-01 21:57:55’) as minute;

The next Greenplum question returns 40 seconds:

SELECT date_part(‘second’, ‘2009-01-01 22:56:50’ :: timestamp - ‘2009-01-01 21:53:10’ : : timestamp) as seconds;

The next Amazon Redshift question returns 45 seconds:

SELECT datediff(second, ‘2009-01-01 21:56:10’, ‘2009-01-01- 21:56:55’) as seconds;

Now let’s take a look at how we use Amazon Redshift to calculate days and weeks in seconds.

The next Amazon Redshift question shows 2 days:

SELECT datediff(second, ‘2008-12-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24) as days;

The next Amazon Redshift question shows 9 weeks:

SELECT datediff(second, ‘2008-10-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24*7) as weeks;

For Greenplum, the date subfields have to be in single quotes, whereas for Amazon Redshift, we are able to use date subfields corresponding to yr, month, week, day, minute, second with out quotes. For Greenplum, we’ve got to subtract the subfield from one half to a different half, whereas for Amazon Redshift we are able to use commas to separate the 2 dates.

Extract ISOYEAR from date

ISOYEAR 8601 is a week-numbering yr. It begins with the Monday of the week containing the 4th of January. So for the date of early January or late December, the ISO yr could also be completely different from the Gregorian yr. ISO yr has 52 or 53 full weeks (364 or 371 days). The additional week is known as a leap week; a yr with such per week is known as a bissextile year.

The next Greenplum question shows the ISOYEAR 2020:

SELECT extract (ISOYEAR from ‘2019-12-30’ :: date) as ISOYEARS;

The next Amazon Redshift question shows the ISOYEAR 2020:

SELECT to_char(‘2019-12-30’ :: date, ‘IYYYY’) as ISOYEARS;

Perform to generate_series()

Greenplum has adopted the PostgreSQL perform generate_series(). However the generate_series perform works in a different way with Amazon Redshift whereas retrieving data from the desk as a result of it’s a chief node-only perform.

To show a collection of numbers in Amazon Redshift, run the next question on the chief node. On this instance, it shows 10 rows, numbered 1–10:

SELECT generate_series(1,10);

To show a collection of days for a given date, use the next question. It extracts the day from the given date and subtracts 1, to show a collection of numbers from 0–6:

SELECT generate_series(0, extract(day from date ‘2009-01-07’) :: int -1);

However for the queries fetching the report from the desk, becoming a member of with one other desk’s row, and processing knowledge on the compute node, it doesn’t work, and generates an error message with Invalid Operation. The next code is an instance of a SQL assertion that works for Greenplum however fails for Amazon Redshift:

SELECT column_1,
FROM table_1t1
JOIN table_2 t2
ON t2.code = t1.code
CROSS JOIN generate_series(1,12) gen(fiscal_month)
WHERE condition_1

For Amazon Redshift, the answer is to create a desk to retailer the collection knowledge, and rewrite the code as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (choose “quantity” as fiscal_month FROM table_t3 WHERE “quantity”<=12) gen
WHERE condition_1

Working with common expressions (regex features)

Amazon Redshift and Greenplum each assist three situations for sample matching:

  • LIKE
  • SIMILAR TO
  • POSIX operators

On this put up, we don’t focus on all of those sample matching intimately. As a substitute, we focus on a couple of regex features and regex escape characters that aren’t supported by Amazon Redshift.

Regexp_split_to_table perform

The Regex_split_to_table perform splits a string utilizing a POSIX common expression sample as delimiter.

This perform has the next syntax:

Regexp_split_to_table(string,sample [,flags])

For Greenplum, we use the next question:

choose regexp_split_to_table ('bat,cat,hat',’,’) as regexp_split_table_GP

For Amazon Redshift, the regexp_split_to_table perform must be transformed utilizing the Amazon Redshift split_part perform:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (choose “quantity” as fiscal_month FROM table_t3 WHERE “quantity”<=12) gen
WHERE condition_1

One other approach to convert regexp_split_to_table is as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (choose “quantity” as fiscal_month FROM table_t3 WHERE “quantity”<=12) gen
WHERE condition_1

Substring from regex expressions

Substring (the string from the regex sample) extracts the substring or worth matching the sample that’s handed on. If there isn’t any match, null is returned. For extra info, check with Sample Matching.

We use the next code in Greenplum:

create temp desk data1 ( col1 varchar );
insert into data1 values ('hellohowareyou 12687687abcd');
choose substring( col1 from '[A-Za-z]+$') from data1;
from data1

We are able to use the regexp_substr perform to transform this code to Amazon Redshift. It returns the characters extracted from a string by trying to find a daily expression sample. The syntax is as follows:

REGEXP_SUBSTR ( source_string, sample [, position [, occurrence [, parameters ] ] ] )

choose regexp_substr( col1, '[A-Za-z]+$') as substring_from_rs from data1

Key factors whereas changing common expression escapes

The Postgres escape character E doesn’t work in Amazon Redshift. Moreover, the next Greenplum common expression constraints aren’t supported in Amazon Redshift:

  • m – Matches solely at the start of a phrase
  • y – Matches solely at the start or finish of a phrase

For Amazon Redshift, use < and >, or [[:<:]] and [[:>:]] as an alternative.

Use the next code for Greenplum:

choose col1,
case
when (col1) ~ E 'm[0-9]{2}[A-Z]{1}[0-9]{1}' then
regexp_replace(col1, E '([0-9]{2})([A-Z]{1})([0-9]{1})',E '2')
else 'nothing'
finish as regex_test
from temp1123

Use the next code for Amazon Redshift:

choose col1,
case
when (col1) ~ '<[0-9]{2}[A-Z]{1}[0-9]{1}>' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1})','2')
else 'nothing'
finish as regex_test
from temp1123

OR

choose col1,
case
when (col1) ~ '[[:<:]][0-9]{2}[A-Z]{1}[0-9]{1}[[:>:]]' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1}) (.*)','2')
else 'nothing'
finish as regex_test
from temp1123

Conclusion

For heterogeneous database migration from Greenplum to the Amazon Redshift, you need to use AWS DMS and the AWS SCT emigrate many of the database objects, corresponding to tables, views, saved procedures, and features.

There are some conditions through which one perform is used for the supply setting, and the goal setting doesn’t assist the identical perform. On this case, handbook conversion is required to provide the identical outcomes set and full the database migration.

In some circumstances, use of a brand new window perform supported by the goal setting proves extra environment friendly for analytical queries to course of petabytes of knowledge.

This put up included a number of conditions the place handbook code conversion is required, which additionally improves the code effectivity and make queries environment friendly.

In case you have any questions or options, please share your suggestions.


In regards to the Authors

Jagrit Shrestha is a Database advisor at Amazon Internet Companies (AWS). He works as a database specialist serving to clients migrate their on-premises database workloads to AWS and supply technical steerage.

Ishwar Adhikary is a Database Marketing consultant at Amazon Internet Companies (AWS). He works intently with clients to modernize their database and utility infrastructures. His focus space is migration of relational databases from On-premise knowledge heart to AWS Cloud.

Shrenik Parekh works as a Database Consultants at Amazon Internet Companies (AWS). He’s experience in database migration evaluation, database migration, modernizing database setting with purpose-built database utilizing AWS cloud database companies. He’s additionally centered on AWS internet companies for knowledge analytics. In his spare time, he loves mountaineering, yoga and different outside actions.

Santhosh Meenhallimath is a Information Architect at AWS. He works on constructing analytical options, constructing knowledge lakes and migrate Database into AWS.

RELATED ARTICLES

Most Popular

Recent Comments