Amazon Redshift is a totally managed service for knowledge lakes, knowledge analytics, and knowledge warehouses for startups, medium enterprises, and enormous enterprises. Amazon Redshift is utilized by tens of 1000’s of companies across the globe for modernizing their knowledge analytics platform.
Greenplum is an open-source, massively parallel database used for analytics, largely for on-premises infrastructure. Greenplum is predicated on the PostgreSQL database engine.
Many shoppers have discovered migration to Amazon Redshift from Greenplum a sexy possibility as an alternative of managing on-premises Greenplum for the next causes:
Although each Greenplum and Amazon Redshift use the open-source PostgreSQL database engine, migration nonetheless requires loads of planning and handbook intervention. This put up covers the important thing features and concerns whereas performing code conversion from Greenplum to Amazon Redshift. It’s centered on the migration of procedures, features, and views.
Resolution overview
AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Software (AWS SCT) can migrate many of the objects in a heterogeneous database migration from Greenplum to Amazon Redshift. However there are some conditions the place code conversion groups encounter errors and warnings for views, procedures, and features whereas creating them in Amazon Redshift. To handle the sort of state of affairs, handbook conversion of the code is required.
The posts focuses on methods to deal with the next whereas migrating from Greenplum to Amazon Redshift:
- Arrays
- Dates and timestamps
- Common expressions (regex)
Please be aware that for this put up, we use Greenplum 4.3 and Amazon Redshift PostgreSQL 8.2.
Working with array features
The AWS SCT doesn’t convert array features whereas migrating from Greenplum or PostgreSQL to Amazon Redshift. Builders have to extensively convert these features manually. This put up outlines the commonest array features:
- ARRAY_UPPER
- JSON_EXTACT_ARRAY_ELEMENT_TEXT and JSON_ARRAY_LENGTH
- UNNEST ()
- STRING_AGG()
- ANY ARRAY()
ARRAY_UPPER()
This perform returns the higher certain of an array. It may be used to extract the nth component from an array in PostgreSQL or Greenplum.
The Greenplum code is as follows:
There isn’t a perform to extract a component from an array in Amazon Redshift; nonetheless, there are two JSON features that can be utilized for this goal:
- JSON_EXTRACT_ARRAY_ELEMENT_TEXT() – Returns a JSON array component within the outermost array of a JSON string
- JSON_ARRAY_LENGTH() – Returns the variety of components within the outer array of a JSON string
See the next code:
UNNEST()
UNNEST() is PostgreSQL’s system perform for semi-structured knowledge, increasing an array, or a mix of arrays to a set of rows. It’s launched to enhance the database efficiency of 1000’s or data for inserts, updates, and deletes.
You need to use UNNEST() for fundamental array, a number of arrays, and a number of arrays with completely different lengths.
A few of Amazon Redshift features used to unnest arrays are split_part
, json_extract_path_text
, json_array_length
, and json_extract_array_element_text
.
In Greenplum, the UNNEST perform is used to develop an array to a set of rows:
Output
A 1
A 2
Amazon Redshift doesn’t assist the UNNEST perform; you need to use the next workaround:
STRING_AGG()
The STRING_AGG() perform is an combination perform that concatenates a listing of strings and locations a separator between them. The perform doesn’t add the separator on the finish of the string. See the next code:
The Greenplum code is as follows:
The Amazon Redshift equal for the STRING_AGG() perform is LISTAGG(). This combination perform orders the rows for that group in keeping with the ORDER BY expression, then concatenates the values right into a single string:
See the next code:
ANY ARRAY()
The PostgreSQL ANY ARRAY() perform evaluates and examine the left-hand expression to every component in array:
In Amazon Redshift, the analysis might be achieved with an IN operator:
Working with date features
On this part, we focus on calculating the distinction between date_part
for Greenplum and datediff for Amazon Redshift.
When the applying must calculate the distinction between the subfields of dates for Greenplum, it makes use of the perform date_part, which lets you retrieve subfields corresponding to yr, month, week, and day. Within the following instance queries, we calculate the variety of completion_days
by calculating the distinction between originated_date
and eco_date
.
To calculate the distinction between the subfields of the date, Amazon Redshift has the perform datediff. The next queries present an instance of methods to calculate the completion_days
because the distinction between eco_date
and orginated_date
. DATEDIFF determines the variety of date half boundaries which might be crossed between the 2 expressions.
We examine the Greenplum and Amazon Redshift queries as follows:
The next Greenplum question returns 1 yr between 2009-01-01 and 2009-12-31:
The next Amazon Redshift question returns 1 yr between 2009-01-01 and 2009-12-31:
The next Greenplum question returns 1 month between 2009-01-01 and 2008-12-31:
The next Amazon Redshift question returns 1 month between 2009-01-01 and 2008-12-31:
The next Greenplum question returns 0 weeks between 2009-01-01 and 2009-12-31:
The next Amazon Redshift question returns 0 weeks between 2009-01-01 and 2009-12-31:
The next Greenplum question returns 1 day:
The next Amazon Redshift question returns 1 day:
The next Greenplum question returns 1 hour:
The next Amazon Redshift question returns 1 hour:
The next Greenplum question returns 3 minutes:
The next Amazon Redshift question returns 1 minute:
The next Greenplum question returns 40 seconds:
The next Amazon Redshift question returns 45 seconds:
Now let’s take a look at how we use Amazon Redshift to calculate days and weeks in seconds.
The next Amazon Redshift question shows 2 days:
The next Amazon Redshift question shows 9 weeks:
For Greenplum, the date subfields have to be in single quotes, whereas for Amazon Redshift, we are able to use date subfields corresponding to yr, month, week, day, minute, second with out quotes. For Greenplum, we’ve got to subtract the subfield from one half to a different half, whereas for Amazon Redshift we are able to use commas to separate the 2 dates.
Extract ISOYEAR from date
ISOYEAR 8601 is a week-numbering yr. It begins with the Monday of the week containing the 4th of January. So for the date of early January or late December, the ISO yr could also be completely different from the Gregorian yr. ISO yr has 52 or 53 full weeks (364 or 371 days). The additional week is known as a leap week; a yr with such per week is known as a bissextile year.
The next Greenplum question shows the ISOYEAR 2020:
The next Amazon Redshift question shows the ISOYEAR 2020:
Perform to generate_series()
Greenplum has adopted the PostgreSQL perform generate_series()
. However the generate_series
perform works in a different way with Amazon Redshift whereas retrieving data from the desk as a result of it’s a chief node-only perform.
To show a collection of numbers in Amazon Redshift, run the next question on the chief node. On this instance, it shows 10 rows, numbered 1–10:
To show a collection of days for a given date, use the next question. It extracts the day from the given date and subtracts 1, to show a collection of numbers from 0–6:
However for the queries fetching the report from the desk, becoming a member of with one other desk’s row, and processing knowledge on the compute node, it doesn’t work, and generates an error message with Invalid Operation. The next code is an instance of a SQL assertion that works for Greenplum however fails for Amazon Redshift:
For Amazon Redshift, the answer is to create a desk to retailer the collection knowledge, and rewrite the code as follows:
Working with common expressions (regex features)
Amazon Redshift and Greenplum each assist three situations for sample matching:
- LIKE
- SIMILAR TO
- POSIX operators
On this put up, we don’t focus on all of those sample matching intimately. As a substitute, we focus on a couple of regex features and regex escape characters that aren’t supported by Amazon Redshift.
Regexp_split_to_table perform
The Regex_split_to_table perform splits a string utilizing a POSIX common expression sample as delimiter.
This perform has the next syntax:
For Greenplum, we use the next question:
For Amazon Redshift, the regexp_split_to_table
perform must be transformed utilizing the Amazon Redshift split_part
perform:
One other approach to convert regexp_split_to_table
is as follows:
Substring from regex expressions
Substring (the string from the regex sample) extracts the substring or worth matching the sample that’s handed on. If there isn’t any match, null is returned. For extra info, check with Sample Matching.
We use the next code in Greenplum:
We are able to use the regexp_substr perform to transform this code to Amazon Redshift. It returns the characters extracted from a string by trying to find a daily expression sample. The syntax is as follows:
Key factors whereas changing common expression escapes
The Postgres escape character E doesn’t work in Amazon Redshift. Moreover, the next Greenplum common expression constraints aren’t supported in Amazon Redshift:
- m – Matches solely at the start of a phrase
- y – Matches solely at the start or finish of a phrase
For Amazon Redshift, use < and >, or [[:<:]] and [[:>:]] as an alternative.
Use the next code for Greenplum:
Use the next code for Amazon Redshift:
OR
Conclusion
For heterogeneous database migration from Greenplum to the Amazon Redshift, you need to use AWS DMS and the AWS SCT emigrate many of the database objects, corresponding to tables, views, saved procedures, and features.
There are some conditions through which one perform is used for the supply setting, and the goal setting doesn’t assist the identical perform. On this case, handbook conversion is required to provide the identical outcomes set and full the database migration.
In some circumstances, use of a brand new window perform supported by the goal setting proves extra environment friendly for analytical queries to course of petabytes of knowledge.
This put up included a number of conditions the place handbook code conversion is required, which additionally improves the code effectivity and make queries environment friendly.
In case you have any questions or options, please share your suggestions.
In regards to the Authors
Ishwar Adhikary is a Database Marketing consultant at Amazon Internet Companies (AWS). He works intently with clients to modernize their database and utility infrastructures. His focus space is migration of relational databases from On-premise knowledge heart to AWS Cloud.
Shrenik Parekh works as a Database Consultants at Amazon Internet Companies (AWS). He’s experience in database migration evaluation, database migration, modernizing database setting with purpose-built database utilizing AWS cloud database companies. He’s additionally centered on AWS internet companies for knowledge analytics. In his spare time, he loves mountaineering, yoga and different outside actions.
Santhosh Meenhallimath is a Information Architect at AWS. He works on constructing analytical options, constructing knowledge lakes and migrate Database into AWS.