2 years ago
#68723
vdmx
Understanding Segments in AWS Glue and its usage on Presto/Trino
I read the docs for AWS Glue GetPartitionsAPI and noticed that it has "Segment" parameter:
"Segment": {
"SegmentNumber": number,
"TotalSegments": number
},
I checked the explanations for Segment here and it says:
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be run in parallel.
- SegmentNumber - The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.
- TotalSegments - The total number of segments. Minimum value of 1. Maximum value of 10.
It looks strikingly similar to Segments in DynamoDB, which is used by DynamoDB's parallel scan feature (As discussed here: Scan vs Parallel Scan in AWS DynamoDB?). However, the difference I noticed was how the segments in Glue is limited to only 10, while it seems unlimited on DynamoDB. Therefore:
1. I would like to know what Glue Segments are, if it's really the same thing and use cases for parallel scans in DynamoDB apply here too (e.g. over 20GB table size, read throughput not fully utilized)
Also, Presto/Trino has a Glue-segment-related parameter on config.properties
as seen on Trino AWS Glue Configuration Properties:
hive.metastore.glue.partitions-segments
- Number of segments for partitioned Glue tables, defaults to 5.
which indicates that segments can be utilized by Presto/Trino. However, when I tried setting the parameter to different numbers, queries keeps using the same number of splits and the speed remains the same (I used same number of nodes every time). Therefore:
2. How Glue Segments fit into Presto/Trino's conceptual model of nodes, tasks and splits?
And in this article about How to use AWS Glue Data Catalog as Metastore for Hive on AWS EMR There is this notice about throttling:
In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property
aws.glue.partition.num.segments
in hive-site configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1.
3. How Glue Segments may cause throttling and in what kind of scenario? (I think this one might become clear as my understanding about Glue Segment begins to form)
amazon-web-services
aws-glue
0 Answers
Your Answer