Apache Iceberg Integration in AWS Glue Data Catalog

Large volumes of unstructured, semi-structured, or raw data could be inexpensively stored in data lakes, which were first used for big data and analytics use cases

[{"selector":"#anim-82b0c831-0401-485d-af3a-636233cea728","keyframes":{"opacity":[1,1]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-7b3a98bf-78c3-43c4-80fa-de3579f078cf","keyframes":{"transform":["scale(3)","scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-62892c9e-f6da-467e-8867-e98f402da696 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate(-3.552713678800501e-15%, 0%) scale(1.5)","translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}]

Organizations have realized that data lakes can be used for more than just reporting, which has led to an evolution in the amount of use cases that may be implemented

[{"selector":"#anim-6d50008e-ba3f-4bfd-b052-0bf2f2d5d9e2","keyframes":{"transform":["rotate(-540deg) scale(0.1)","none"],"opacity":[0,1]},"delay":0,"duration":1000,"fill":"both","iterations":1}] [{"selector":"#anim-ecb04158-ccfb-46f5-ba1d-6d1e1b790243 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-34.179687404002955%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}]

This is because data lakes include ever-increasing amounts of vital corporate data, which frequently needs to be updated or deleted

[{"selector":"#anim-5d92aa17-7fbf-440d-91d0-78f9743f773a","keyframes":{"opacity":[1,1]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-1d4b90c1-c1de-4bd6-8f07-33d291e99a66","keyframes":{"transform":["scale(3)","scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-62d81fab-38e3-4afb-9a27-0b5535a9513b [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(34.19286002065242%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}]

To enable concurrent writes and reads without compromising data integrity, this scenario necessitates transactional capabilities on your data lake

[{"selector":"#anim-81dee9d0-f4a0-46f5-a54b-fbf239d2b959","keyframes":{"transform":["rotate(-540deg) scale(0.1)","none"],"opacity":[0,1]},"delay":0,"duration":1000,"fill":"both","iterations":1}] [{"selector":"#anim-9d518a06-b345-4408-863d-e610db7a0c33 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-21.874999829338588%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}]

Using OTF formats has its own set of challenges, such as managing a large number of small files on Amazon Simple Storage Service (Amazon S3) as each transaction creates a new file or managing object and meta-data versioning at scale

[{"selector":"#anim-f54c6f00-4c2b-4888-b3eb-707d4ea0bfd2","keyframes":{"opacity":[0,1]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-83b620fd-3b21-47da-b7f7-bc46bcbdaaad","keyframes":{"transform":["scale(0.3333333333333333)","scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-eb5b4f50-bcff-48a1-93ef-cedd906313fc [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(34.179687404002955%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}]

To overcome these issues, organizations usually create and manage their own data pipelines, which results in more infrastructure work that is not differentiated

[{"selector":"#anim-82d46100-ffe4-4890-9697-97e241ec8fa1 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-38.18749992832221%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-aea3e23b-4d36-4bc4-97b4-5c4d399a0992","keyframes":{"opacity":[1,1]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-01951be6-3689-4763-bec0-928a95f3991c","keyframes":{"transform":["scale(3)","scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}]

Speaking with AWS customers has shown us that the hardest part is getting all the little files that are created by every transactional write on tables combined into a limited number of huge files

[{"selector":"#anim-7de577b2-c725-4ea3-912c-a2aa6551f78a","keyframes":{"opacity":[1,1]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-d60a5496-8cb0-48c3-9110-dfe34b14e560","keyframes":{"transform":["scale(3)","scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-39079075-5206-4bdc-aa4e-ca1bd30eefd5 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-34.179687404002955%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}]

Because the queries take less compute power to run, the performance boost is also advantageous to the cost of usage when using engines that charge for computation Read more on Govindhtech.com

[{"selector":"#anim-5299aaa5-e74c-46c0-bb56-1730bd07f2d8","keyframes":{"opacity":[1,1]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-f6aef956-0254-4620-b4e0-40946f958bc7","keyframes":{"transform":["scale(3)","scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-c00c3b15-f5bc-409a-890f-a5348f024193 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-32.924107039241285%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}]