<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Dozer | Start building real-time data apps in minutes Blog</title>
        <link>https://getdozer.io/blog/</link>
        <description>Dozer | Start building real-time data apps in minutes Blog</description>
        <lastBuildDate>Thu, 16 Nov 2023 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <item>
            <title><![CDATA[Retrieval Augmented Generation (RAG) workflow with Dozer]]></title>
            <link>https://getdozer.io/blog/RAG with dozer</link>
            <guid>https://getdozer.io/blog/RAG with dozer</guid>
            <pubDate>Thu, 16 Nov 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog explores the challenges of siloed and real-time data, once formidable barriers, now yield to RAG's ingenuity. Through Dozer, RAG unlocks the secrets of siloed data, while event stream processing and real-time data pipelines ensure that the LLM remains abreast of the ever-changing world.
]]></description>
            <content:encoded><![CDATA[<p>Large Language Models (LLMs) are a type of AI that is trained on a massive amount of text data. This allows LLMs to generate text, translate languages, write content and answer in an informative way. However, LLMs are not perfect. They often make mistakes and produce text that is not coherent or relevant to the topic at hand. LLMs can sometimes generate inaccurate or misleading information, even if it sounds plausible. This is because they learn from statistical patterns in the data, which may not always correspond to reality. This issue can be particularly problematic in applications where factual accuracy is crucial.</p><p>One particular challenge lies in the issue of hallucinations, where LLMs produce outputs that are factually inaccurate or misleading. This phenomenon, often stemming from outdated training data, can have significant implications for the reliability and trustworthiness of LLMs.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="issues-with-llms">Issues with LLMs<a href="#issues-with-llms" class="hash-link" aria-label="Direct link to Issues with LLMs" title="Direct link to Issues with LLMs">​</a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="hallucinations-problem">Hallucinations Problem:<a href="#hallucinations-problem" class="hash-link" aria-label="Direct link to Hallucinations Problem:" title="Direct link to Hallucinations Problem:">​</a></h3><p>Hallucinations in LLMs occur when the model's predictions deviate from reality, generating text that is inconsistent with the input or the broader context. This can manifest in various forms, such as fabricating facts, expressing outdated opinions, or drawing erroneous conclusions from data. The underlying cause of these hallucinations can be traced back to the training data upon which LLMs are built.</p><p>LLMs are trained on massive amounts of text and code, encompassing a vast repository of human knowledge. However, this data is not without its biases and imperfections. It may reflect societal prejudices, contain outdated information, or simply lack the nuance and context required for accurate understanding. When LLMs are trained on such flawed data, they may inherit these biases and imperfections, leading to hallucinations in their outputs.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="out-of-date-training-data">Out-of-Date Training Data:<a href="#out-of-date-training-data" class="hash-link" aria-label="Direct link to Out-of-Date Training Data:" title="Direct link to Out-of-Date Training Data:">​</a></h3><p>The presence of outdated data in training sets further exacerbates the hallucination problem. As technology and society evolve, information becomes obsolete, and LLMs trained on such data may struggle to keep pace with the changing world. This can lead to the generation of factually incorrect information or outdated opinions, undermining the credibility of LLMs and limiting their usefulness in real-world applications.</p><p>Retrieval Augmented Generation (RAG) is a promising approach to address the issues of hallucination and out-of-date training data in large language models (LLMs). RAG combines the strengths of LLMs with those of retrieval-based systems to generate more accurate and reliable text.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-rag-works">How RAG Works<a href="#how-rag-works" class="hash-link" aria-label="Direct link to How RAG Works" title="Direct link to How RAG Works">​</a></h2><p>RAG works by first retrieving relevant passages from an external knowledge source, such as a search engine or a document database. These passages are then used to provide context and anchor the LLM's generation process. This helps to ensure that the generated text is consistent with the retrieved information and less likely to contain hallucinations.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="librarian-and-the-writer-analogy">Librarian and the Writer Analogy<a href="#librarian-and-the-writer-analogy" class="hash-link" aria-label="Direct link to Librarian and the Writer Analogy" title="Direct link to Librarian and the Writer Analogy">​</a></h3><p>Imagine a large language model (LLM) as a highly skilled writer, but with a limited knowledge base. While this writer can craft compelling narratives and compose insightful essays, they lack access to the vast expanse of information available in the world. That's where RAG comes in, acting as the writer's resourceful assistant.</p><p>RAG functions like a diligent librarian, scouring external data stores to gather relevant information tailored to the user's request. This context-rich information, whether it's real-time updates, user-specific details, or even factual data that hasn't made it into the LLM's training set, is then seamlessly integrated into the writer's prompt.</p><p>With this enhanced prompt, the LLM is empowered to produce even more informative and personalized responses, akin to a writer armed with a wealth of background research. RAG, in essence, bridges the gap between the LLM's knowledge and the vast sea of information, elevating its capabilities to new heights.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="comparing-rag-with-fine-tuning-llms">Comparing RAG with Fine-tuning LLMs<a href="#comparing-rag-with-fine-tuning-llms" class="hash-link" aria-label="Direct link to Comparing RAG with Fine-tuning LLMs" title="Direct link to Comparing RAG with Fine-tuning LLMs">​</a></h3><p>Fine-tuning takes a pre-trained LLM model and trains it more on a smaller dataset, which was not used before to improve performance with relevant task.</p><p><img loading="lazy" alt="finetunevsrag" src="/blog/assets/images/finetunevrag-985d64e8c1166efebd4170920e7b3170.png" width="821" height="515" class="img_ev3q">  </p><p>RAG is particularly well-suited for scenarios where you can enrich your LLM prompt with information that was not available during its training phase. This includes real-time data, personal or user-specific data, and contextual information relevant to the prompt. By incorporating such external knowledge, RAG enables LLMs to generate more accurate, relevant, and personalized responses.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="challenges-with-rag">Challenges with RAG<a href="#challenges-with-rag" class="hash-link" aria-label="Direct link to Challenges with RAG" title="Direct link to Challenges with RAG">​</a></h3><p>When working with data that is siloed or in real-time, implementing RAG can present significant challenges. Siloed data refers to information that is isolated or segregated within specific systems or databases, making it difficult to access and integrate with other data sources. Real-time data, on the other hand, is constantly changing and requires immediate processing to maintain relevance.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="siloed-data">Siloed Data<a href="#siloed-data" class="hash-link" aria-label="Direct link to Siloed Data" title="Direct link to Siloed Data">​</a></h4><p>Retrieval-Augmented Generation (RAG) relies on accessing and retrieving relevant information from external data sources to enhance the capabilities of large language models (LLMs). However, when the data is siloed or isolated within specific systems or databases, it becomes difficult for RAG to effectively utilize this information. This poses significant challenges for the implementation of RAG, as it hinders the LLM's ability to generate comprehensive and informative responses.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="real-time-data">Real-time Data:<a href="#real-time-data" class="hash-link" aria-label="Direct link to Real-time Data:" title="Direct link to Real-time Data:">​</a></h4><p>Real-time data, which is constantly changing and requires immediate processing to maintain relevance, presents another set of challenges for RAG. The LLM needs to be able to access and process real-time data streams with minimal latency to ensure that the generated text is always relevant and up-to-date. This can be challenging due to the high volume and dynamic nature of real-time data.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-dozer-can-help-with-both-the-challenges">How Dozer can help with both the Challenges<a href="#how-dozer-can-help-with-both-the-challenges" class="hash-link" aria-label="Direct link to How Dozer can help with both the Challenges" title="Direct link to How Dozer can help with both the Challenges">​</a></h2><p>Dozer is a powerful Data Access backend that simplifies the process of building and deploying data-driven applications. It provides a unified interface to access and process data from multiple sources, including databases, APIs, and streaming platforms. This enables developers to build applications that leverage real-time data without having to worry about the underlying infrastructure.</p><p>This means that you can use Dozer to ingest data from any source, such as a database in real-time, which allows you to mitigate the issues related to siloed data and real-time data.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="here-are-few-ways-dozer-helps-with-rag">Here are few ways Dozer helps with RAG<a href="#here-are-few-ways-dozer-helps-with-rag" class="hash-link" aria-label="Direct link to Here are few ways Dozer helps with RAG" title="Direct link to Here are few ways Dozer helps with RAG">​</a></h3><p><strong>Real-time data ingestion:</strong> Dozer can be used to ingest real-time data from a variety of sources, such as social media feeds, customer interactions, and sensor data. This data can then be used to provide RAG models with the most up-to-date information.</p><p><strong>Data transformation:</strong> Dozer's streaming SQL engine can be used to transform and process data in real time. This can be used to clean and prepare data for use by RAG models, as well as to extract features that are relevant to the task at hand.</p><p><strong>Contextual information:</strong> Dozer can be used to store and manage contextual information, such as user profiles and knowledge graphs. This information can then be used to provide RAG models with a richer understanding of the context of the task at hand.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="the-resourceful-assistant">The resourceful assistant<a href="#the-resourceful-assistant" class="hash-link" aria-label="Direct link to The resourceful assistant" title="Direct link to The resourceful assistant">​</a></h3><p>So continuing the story of the writer and the librarian, Dozer is the resourceful assistant. Who steps into the scene, armed with its vast knowledge of data sources and its ability to seamlessly integrate external information. It acts as a bridge between the writer (LLM) and the vast library of information (external data sources), enabling the writer to access and utilize a broader range of knowledge.</p><p>Just as the librarian guides the writer to relevant books and articles, Dozer guides the LLM to the most pertinent data sources, providing it with the context and information needed to craft even more informative and personalized responses.</p><p>Dozer's role extends beyond mere retrieval; it also helps the writer process and transform the retrieved information, ensuring that it is in a format that can be readily incorporated into the text generation process. This collaboration between the writer, the librarian, and the assistant elevates the quality of the generated text, making it more comprehensive, accurate, and tailored to the user's needs.</p><p>With Dozer on board, the writer can confidently venture into unexplored territories of knowledge, knowing that its resourceful assistant will always be there to provide the necessary support and guidance. Together, they form an unstoppable team, capable of producing text that is not only informative but also insightful, engaging, and truly remarkable.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion">​</a></h2><p>In conclusion, the combination of large language models (LLMs) and retrieval-augmented generation (RAG) has the potential to revolutionize the way we interact with computers. By providing LLMs with access to real-time data, personal data, and contextual information, RAG enables LLMs to generate more accurate, relevant, and personalized responses. Dozer is a data infrastructure platform that can be used to build and deploy RAG applications. It provides a number of features that can be helpful for RAG development, such as a streaming SQL engine for real-time data transformation, support for a variety of data sources, a distributed architecture that can scale to handle large data volumes, and a variety of security and compliance features.</p><p>The future of LLM-powered applications is bright, and RAG is playing a key role in this evolution. With Dozer, developers can easily build and deploy RAG applications that can take advantage of the latest advances in LLM technology. </p><p>In the upcoming articles, we will explore how to build RAG applications using Dozer and OpenAI's assistant. Stay tuned!</p><p>For more information and examples, check out the <a href="https://github.com/getdozer/dozer" target="_blank" rel="noopener noreferrer">Dozer GitHub repository</a>.</p><p>Stay tuned for more updates and exciting use cases of Dozer and OpenAI assistant.</p>]]></content:encoded>
            <category>llm</category>
            <category>gpt</category>
            <category>assistants</category>
            <category>real-time</category>
            <category>rag</category>
            <category>openai</category>
        </item>
        <item>
            <title><![CDATA[Hyper-personalized chatbots using LLMs, Dozer and Vector Databases]]></title>
            <link>https://getdozer.io/blog/llm-chatbot</link>
            <guid>https://getdozer.io/blog/llm-chatbot</guid>
            <pubDate>Tue, 13 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore how Dozer boosts LLM chatbots by enabling dynamic interactions and personalized experiences in the banking sector. Learn how text embeddings and customer profiles enrich the chatbot's knowledge, overcoming limitations and delivering tailored responses to users.]]></description>
            <content:encoded><![CDATA[<p>In the realm of language model-based (LLM) applications, leveraging the power of artificial intelligence and natural language processing has become increasingly prevalent. One such application is the creation of chatbots that conversationally interact with users. In this article, we explore how Dozer can significantly enhance the capabilities of LLM-based applications. Using a chatbot for a bank as our illustrative use case, we delve into the challenges faced in contextualizing the chatbot's knowledge and how Dozer can provide a solution by enriching the customer profile.</p><p><img loading="lazy" alt="Dozer and LLM" src="/blog/assets/images/llm-7a8b4c7dd35d9fd60a564a331384072d.svg" width="1345" height="432" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="requirements">Requirements<a href="#requirements" class="hash-link" aria-label="Direct link to Requirements" title="Direct link to Requirements">​</a></h2><p>To deliver a personalized and effective user experience, our LLM chatbot needs to possess knowledge in two key areas:</p><ol><li>Understanding the bank's products and services</li><li>Having a comprehensive customer profile</li></ol><p>Traditionally, the primary method of providing information to an LLM is through the use of context. However, there are inherent limitations to this approach, particularly concerning the limited space available for context. LLM models typically have constraints on context size, often limited to a few thousand tokens, although newer models like Anthropic have expanded this to support contexts of up to 100,000 tokens. Despite these advancements, incorporating extensive knowledge, such as comprehensive bank product information or detailed customer profiles, within the context remains challenging.</p><p>To overcome this limitation, a dynamic approach to context population is necessary. Rather than relying solely on a fixed context, the context can be dynamically populated based on the user prompt. This allows for the inclusion of specific and relevant information related to the user's query, enabling the LLM chatbot to provide more accurate and tailored responses. By dynamically adjusting the context, the chatbot can access the necessary knowledge and adapt its understanding to better address the user's needs.</p><p>A practical approach is to leverage text embeddings and a vector database. This approach involves the creation of a vector database that stores the text embeddings of the bank's product and service information. Text embeddings represent the semantic meaning of the text and can capture the relationships and similarities between different pieces of information.</p><p>In the context of our bank's chatbot, this approach can for example be applied when a user inquires about credit card products. Here's how it works:</p><ol><li><strong>Vector database population:</strong> The bank's credit card information, including details such as card types, benefits, requirements, and features, is transformed into text embeddings. These embeddings capture the essential characteristics of the credit card descriptions and specifications, representing them as numerical vectors.</li><li><strong>User query processing:</strong> When a user interacts with the chatbot and asks a question about credit card options, the chatbot processes the query and extracts relevant keywords, such as "credit card," "options," or specific card types.</li><li><strong>Similarity search:</strong> The extracted keywords and contextual information are used to perform a similarity search within the vector database. The search aims to find the text embeddings that are most similar to the user's query, focusing on credit card-related information. By measuring the similarity between the user query and the stored text embeddings, the chatbot identifies the most relevant credit card details.</li><li><strong>Context population:</strong> The retrieved credit card knowledge, aligned with the user's query, is dynamically populated into the chatbot's context. This means that the relevant credit card information becomes part of the context considered by the chatbot when generating responses.</li></ol><p>By leveraging text embeddings and a vector database, our bank's chatbot can efficiently retrieve and utilize the most relevant credit card information based on the user's query.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="but-there-is-a-problem"><strong>But, there is a problem!</strong><a href="#but-there-is-a-problem" class="hash-link" aria-label="Direct link to but-there-is-a-problem" title="Direct link to but-there-is-a-problem">​</a></h2><p>However, even with the integration of the vector database to enhance the LLM chatbot's knowledge of the bank's products and services, there is an essential missing component - the customer profile. The chatbot lacks knowledge about individual customers and their specific details, which is crucial for providing relevant and personalized responses. For instance, when discussing credit card eligibility criteria, factors such as the customer's annual income play a significant role. Without access to detailed customer information, the chatbot may struggle to provide accurate and contextually appropriate responses. To truly create personalized experiences, it is essential to incorporate the customer's profile, including their financial history, subscribed products, investments, and other relevant data, into the chatbot's context. This way, the chatbot can deliver tailored information and meet the specific needs of each customer, enhancing their overall experience.</p><p>This is where Dozer plays a vital role. With its data integration capabilities, Dozer can seamlessly gather and consolidate customer data from various sources, such as core banking systems, CRM platforms, and transaction databases. By connecting to these sources and capturing real-time data updates, Dozer ensures that the customer profile remains accurate and up to date.</p><p>By leveraging a comprehensive customer profile, the LLM chatbot can access the necessary information to tailor its responses to the specific customer. Whether the customer is inquiring about credit card options, loan eligibility, or account details, the chatbot can draw from the enriched customer profile to provide relevant and personalized answers.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="putting-it-all-together">Putting it all together<a href="#putting-it-all-together" class="hash-link" aria-label="Direct link to Putting it all together" title="Direct link to Putting it all together">​</a></h2><p>The diagram above represents the architecture of an intelligent chatbot system that leverages Dozer, a vector database for storing the bank's knowledge base, and an LLM app powered by langchain for conversational interactions with users. At the core of the architecture is Dozer, which aggregates customer data from multiple source systems, ensuring a comprehensive and up-to-date customer profile. The vector database serves as the repository for the bank's general knowledge, encompassing products, services, policies, and more. The LLM app, integrated with langchain, acts as the intelligent conversational interface, leveraging the enriched customer profile from Dozer and the knowledge base from the vector database. Together, these components enable the chatbot to deliver personalized and accurate responses, providing users with a seamless and engaging banking experience.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="lets-give-it-a-try-with-chatgpt">Let's give it a try with ChatGPT<a href="#lets-give-it-a-try-with-chatgpt" class="hash-link" aria-label="Direct link to Let's give it a try with ChatGPT" title="Direct link to Let's give it a try with ChatGPT">​</a></h2><p>To validate our assumptions, we have provided ChatGPT a list of credit card options and a comprehensive customer profile:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">- EveryDay VISA Card: </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Benefits: 5% cashback on grocery shopping, 8% cashback on Esso, Shell, Chevron, 0.3% cashback on all other expenses</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Requirements: $30,000 minimum annual income</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Annual fee: 35$</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">- LiveFresh VISA Card: </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Benefits: Up to 5% cashback on Online &amp; Visa contactless spend, Additional 5% Green Cashback on selected Eateries, Retailers and Transport Services, 0.3% Cashback on All Other Spend</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Requirements: $30,000 minimum annual income</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Annual fee: 35$</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">- Miles&amp;More VISA Card: </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Benefits: 10 miles per dollar on hotel transactions at Kaligo, 6 miles on flight, hotel and travel packages at Expedia, 3 miles on online flights &amp; hotel transactions (capped at S$5,000 per month), 2 miles per $ on overseas spend, 1.2 miles per $ on local spend, Receive 10,000 bonus miles when you pay for your annual fee</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Requirements: $30,000 minimum annual income</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Annual fee: 215$ + free second card</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">- Vantage VISA Card: </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Benefits: Earn 1.5 miles per $ in local spend, Eaern 2.2 miles per $ in foreign spend, Earn up to 6 miles per S$1 on Expedia bookings,  Up to 19% off on fuel spending at Esso</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Requirements: $60,000 minumum annual income</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Annual fee: 215$</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">- Super VISA Card:</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Benefits:10X Points (4 miles per $1) on online purchases, 3X Points (1.2 miles per $1) on overseas purchases, 1X  Point (0.4 miles per $1) on other purchases</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Requirements: $50,000 minimum annual income</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Annual fee: 70$</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">First name: John</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Last name: Smith</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">annual income: 55,000$</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">address: 33 Tampines Street 86, Singapore 528573</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Phone: 7763 6678</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Owned priducts:</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> - checking account - balance: $32,122</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> - debit card - outstanding balance: $1,100</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">last month spending pattern:</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  Local spend:</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  - fuel 3,122$</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  - groceries: $7,233</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  - travel: $12,122</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  - food &amp; beverages: $1455</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  - transportation: $345</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  Foreign spend:</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  - accomodation: $3433</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  - food and beverages: $1344</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>We then asked ChatGPT to start acting like a virtual bank teller. We will not provide a full transcript of the conveconversation, but only few interesting parts.</p><p>In the conversation below, the chatbot has effectively utilized the customer's profile and annual income to recommend suitable financial products. It analyzed the user's annual income, which is $55,000, and made recommendations based on this information. Notably, the chatbot did not propose credit cards that required a higher annual income, such as the Vantage VISA Card which has an income requirement of $60,000.</p><p><img loading="lazy" alt="Dozer and LLM" src="/blog/assets/images/chatgpt1-fe0dd3b558109108e07aa23ad71e6223.png" width="1400" height="1634" class="img_ev3q"></p><p>Also, when asked to estimate potential rewards, the chatbot used the customer's specific spending habits, applying different mile earning rates per spending category, to provide a personalized and detailed estimate of the possible miles to be earned with the Miles&amp;More VISA Card.</p><p><img loading="lazy" alt="Dozer and LLM" src="/blog/assets/images/chatgpt2-83d42155cf88122a64e407c0e1d12d18.png" width="1400" height="1162" class="img_ev3q"></p><p>In the forthcoming post, we will provide a comprehensive example demonstrating how to implement such a system utilizing Langchain, Dozer, and a vector database.</p>]]></content:encoded>
            <category>LLM</category>
            <category>chatbot</category>
            <category>banking</category>
        </item>
        <item>
            <title><![CDATA[Shapes and Forms of Structured Data: SCD Types, Master Full, Master Incremental, Unitemporal, and Bitemporal]]></title>
            <link>https://getdozer.io/blog/data-shapes</link>
            <guid>https://getdozer.io/blog/data-shapes</guid>
            <pubDate>Tue, 30 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Structured data forms the foundation of many data-driven systems and is crucial for effective data analysis and decision-making. Within the realm of structured data, there are different shapes and forms that enable organizations to manage and utilize data in diverse ways. In this blog post, we will explore several important concepts related to structured data, including SCD (Slowly Changing Dimensions) types, Master Full, Master Incremental, Unitemporal, and Bitemporal data.
]]></description>
            <content:encoded><![CDATA[<p>Structured data forms the foundation of many data-driven systems and is crucial for effective data analysis and decision-making. Within the realm of structured data, there are different shapes and forms that enable organizations to manage and utilize data in diverse ways. In this blog post, we will explore several important concepts related to structured data, including SCD (Slowly Changing Dimensions) types, Master Full, Master Incremental, Unitemporal, and Bitemporal data.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="slowly-changing-dimensions-scd">Slowly Changing Dimensions (SCD)<a href="#slowly-changing-dimensions-scd" class="hash-link" aria-label="Direct link to Slowly Changing Dimensions (SCD)" title="Direct link to Slowly Changing Dimensions (SCD)">​</a></h2><p>Slowly Changing Dimensions refer to the nature of data that evolves over time in a data warehousing context. SCDs capture changes in dimensional attributes, such as customer addresses, product specifications, or employee roles, while maintaining historical records. There are different SCD types to manage these changes:</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="scd-type-1">SCD Type 1<a href="#scd-type-1" class="hash-link" aria-label="Direct link to SCD Type 1" title="Direct link to SCD Type 1">​</a></h3><p>This type overwrites existing data with updated information, effectively losing historical details. It is suitable when historical data is not critical, such as in cases where only the most recent values are needed.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="scd-type-2">SCD Type 2<a href="#scd-type-2" class="hash-link" aria-label="Direct link to SCD Type 2" title="Direct link to SCD Type 2">​</a></h3><p>Type 2 preserves historical data by creating new records for each change, typically through the addition of an effective date and expiration date. This allows for accurate tracking of historical changes and provides a complete audit trail.</p><p>Let's consider an example of Slowly Changing Dimensions (SCD) Type 2 in the context of a customer database. Suppose we have a table that stores customer information, including their name, address, and membership status. Initially, a customer named John Smith signs up with the address "123 Main Street" and is assigned a membership status of "Standard."</p><p>Later, John moves to a new address, "456 Oak Avenue." Instead of updating the existing record, SCD Type 2 creates a new record with the updated address and assigns it a new effective date, indicating when the change took place. The previous record for John Smith remains in the database, capturing the historical address and its associated time frame. The updated record reflects John's current address and effective date.</p><p>By utilizing SCD Type 2, the customer database can maintain a complete audit trail of changes. It allows for tracking John's address history, which can be useful for analyzing customer behavior, understanding migration patterns, or generating accurate reports based on specific time periods.</p><table><thead><tr><th>Customer ID</th><th>Name</th><th>Address</th><th>Membership Status</th><th>Effective Start Date</th><th>Effective End Date</th></tr></thead><tbody><tr><td>1</td><td>John Smith</td><td>123 Main Street</td><td>Standard</td><td>2021-01-01</td><td>2022-02-28</td></tr><tr><td>1</td><td>John Smith</td><td>456 Oak Avenue</td><td>Standard</td><td>2022-03-01</td><td>(Current Record)</td></tr></tbody></table><h3 class="anchor anchorWithStickyNavbar_LWe7" id="scd-type-3">SCD Type 3<a href="#scd-type-3" class="hash-link" aria-label="Direct link to SCD Type 3" title="Direct link to SCD Type 3">​</a></h3><p>Type 3 retains limited historical information by adding additional columns to capture a limited number of changes. It sacrifices full historical tracking but can be useful in cases where only recent changes need to be analyzed.</p><p>Let's continue with the same customer database example, but this time, let's explore how Slowly Changing Dimensions (SCD) Type 3 would handle changes in customer information.</p><p>Initially, John Smith signs up with the address "123 Main Street" and is assigned a membership status of "Standard." Instead of creating a new record for each change, SCD Type 3 adds additional columns to capture limited historical information.</p><p>When John moves to a new address, "456 Oak Avenue," the existing record is updated with the new address, but the previous address is retained in a separate column designated for the previous value. In addition, a separate column captures the effective date of the change.</p><p>So, in the SCD Type 3 example, the customer record for John Smith would contain the following information: his current address, the previous address, and the date of the address change.</p><p>By utilizing SCD Type 3, the customer database retains some historical information while still allowing for efficient storage and retrieval. This approach is useful when limited historical tracking is required, and only a few key changes need to be captured and analyzed.</p><table><thead><tr><th>Customer ID</th><th>Name</th><th>Current Address</th><th>Previous Address</th><th>Address Change Date</th><th>Membership Status</th></tr></thead><tbody><tr><td>1</td><td>John Smith</td><td>456 Oak Avenue</td><td>123 Main Street</td><td>2022-03-01</td><td>Standard</td></tr></tbody></table><h2 class="anchor anchorWithStickyNavbar_LWe7" id="master-full-and-master-incremental">Master Full and Master Incremental<a href="#master-full-and-master-incremental" class="hash-link" aria-label="Direct link to Master Full and Master Incremental" title="Direct link to Master Full and Master Incremental">​</a></h2><p>Master Full is a form of structured data management where a centralized "master" dataset contains all relevant information about a specific entity. In this approach, updates to the master dataset are performed in bulk, typically by replacing the entire dataset with a fresh copy. Master Full is suitable when the dataset is relatively small or when updates occur infrequently. It ensures consistency across systems that rely on the master data but may not be ideal for real-time or frequent updates.</p><p>Master Incremental is another form of structured data management where changes to the master dataset are made incrementally, without replacing the entire dataset. Instead, only the modified or new records are updated. This approach is efficient when dealing with large datasets or when frequent updates occur. Master Incremental allows for faster processing times and minimizes the need to process unchanged data, but it requires careful tracking of changes and synchronization between systems.</p><p>Let's illustrate the concepts of Master Full and Master Incremental using an example of a product catalog management system.</p><p>In the Master Full approach, the entire product catalog is replaced with a fresh copy when updates are made. The "Status" column represents the active state of each product. When a product is no longer available or needs to be deleted, the entire record is removed from the dataset. However, this approach does not retain any historical information about deleted products.</p><table><thead><tr><th>Product ID</th><th>Name</th><th>Description</th><th>Price</th><th>Effective Date</th><th>Status</th></tr></thead><tbody><tr><td>1</td><td>Product A</td><td>Description of Product A</td><td>$10.99</td><td>2021-01-01</td><td>Active</td></tr><tr><td>2</td><td>Product B</td><td>Description of Product B</td><td>$15.99</td><td>2021-01-01</td><td>Active</td></tr><tr><td>3</td><td>Product C</td><td>Description of Product C</td><td>$8.99</td><td>2021-01-01</td><td>Active</td></tr></tbody></table><p>In the Master Incremental approach, updates are made by modifying only the relevant records. New products or modified details are added as new records with their respective effective dates. For representing deletes, a new record is added with an updated status indicating that the product is now inactive or deleted. In the example above, Product B's status is changed to "Inactive" with an effective date of June 15, 2022, indicating that it is no longer available.</p><p>By implementing the Master Incremental approach, the product catalog management system maintains a history of changes, including deletes, by adding new records rather than removing existing ones. This allows for accurate tracking and analysis of product data over time while efficiently managing updates and deletions.</p><table><thead><tr><th>Product ID</th><th>Name</th><th>Description</th><th>Price</th><th>Effective Date</th><th>Status</th></tr></thead><tbody><tr><td>1</td><td>Product A</td><td>Description of Product A</td><td>$10.99</td><td>2021-01-01</td><td>Active</td></tr><tr><td>2</td><td>Product B</td><td>Description of Product B</td><td>$15.99</td><td>2021-01-01</td><td>Active</td></tr><tr><td>3</td><td>Product C</td><td>Description of Product C</td><td>$8.99</td><td>2021-01-01</td><td>Active</td></tr><tr><td>4</td><td>Product D</td><td>Description of Product D</td><td>$12.99</td><td>2022-05-10</td><td>Active</td></tr><tr><td>2</td><td></td><td></td><td></td><td>2022-06-15</td><td>Inactive</td></tr></tbody></table><h2 class="anchor anchorWithStickyNavbar_LWe7" id="unitemporal-and-bitemporal">Unitemporal and Bitemporal<a href="#unitemporal-and-bitemporal" class="hash-link" aria-label="Direct link to Unitemporal and Bitemporal" title="Direct link to Unitemporal and Bitemporal">​</a></h2><p>Unitemporal data refers to structured data that incorporates a single valid time dimension. It captures not only the current state of the data but also the validity period during which each record was considered accurate. Unitemporal data is valuable in scenarios where the historical context of changes is crucial for analysis, compliance, or auditing purposes. It enables tracking changes and conducting retrospective analysis based on different time periods.</p><p>Bitemporal data combines two time dimensions: valid time and transaction time. Valid time represents the period during which a record is considered accurate, while transaction time captures when a change was made to the record. Bitemporal data is commonly used in scenarios where analyzing the temporal aspects of data changes is critical, such as in financial systems or legal applications. It allows for precise tracking of changes and their timing, providing a comprehensive historical view.</p><p>Let's consider a real estate property database that tracks the historical ownership and value of properties. We want to capture the current and past ownership details along with their validity periods.</p><table><thead><tr><th>Property ID</th><th>Owner</th><th>Start Date</th><th>End Date</th></tr></thead><tbody><tr><td>1</td><td>John Smith</td><td>2010-01-01</td><td>2015-06-30</td></tr><tr><td>1</td><td>Sarah Johnson</td><td>2015-07-01</td><td>(Current)</td></tr><tr><td>2</td><td>Alex Williams</td><td>2012-03-15</td><td>(Current)</td></tr><tr><td>3</td><td>Emma Thompson</td><td>2014-05-10</td><td>2022-08-31</td></tr></tbody></table><p>In this Unitemporal example, the data captures the current and historical ownership information for properties. Each property has a unique Property ID. For instance, Property ID 1 was owned by John Smith from January 1, 2010, to June 30, 2015. Then, the ownership was transferred to Sarah Johnson from July 1, 2015, until the current date, which is represented as "Current." Property ID 2 is currently owned by Alex Williams, and Property ID 3 was owned by Emma Thompson until August 31, 2022.</p><p>By utilizing Unitemporal data, the property database can maintain a complete history of ownership records, enabling analysis based on specific timeframes and generating accurate reports reflecting changes in property ownership over time.</p><p>Let's continue with the real estate property database example, but this time, we will illustrate the Bitemporal approach to track the historical ownership and value of properties, capturing both valid time and transaction time.</p><table><thead><tr><th>Property ID</th><th>Owner</th><th>Start Date</th><th>End Date</th><th>Transaction Date</th><th>Property Value</th></tr></thead><tbody><tr><td>1</td><td>John Smith</td><td>2010-01-01</td><td>2015-06-30</td><td>2010-01-05</td><td>$100,000</td></tr><tr><td>1</td><td>Sarah Johnson</td><td>2015-07-01</td><td>(Current)</td><td>2015-07-10</td><td>$150,000</td></tr><tr><td>2</td><td>Alex Williams</td><td>2012-03-15</td><td>(Current)</td><td>2012-03-20</td><td>$200,000</td></tr><tr><td>3</td><td>Emma Thompson</td><td>2014-05-10</td><td>2022-08-31</td><td>2014-05-15</td><td>$300,000</td></tr></tbody></table><p>In this Bitemporal example, the data includes both the valid time and transaction time. Each property has a unique Property ID, and ownership records are associated with specific owners. Additionally, the transaction date represents when the ownership change or value update occurred.</p><p>For instance, Property ID 1 was initially owned by John Smith from January 1, 2010, to June 30, 2015, with a transaction date of January 5, 2010, and a property value of $100,000. The ownership was then transferred to Sarah Johnson, effective from July 1, 2015, until the current date, with a transaction date of July 10, 2015, and an updated property value of $150,000.</p><p>Similarly, Property ID 2 is currently owned by Alex Williams, with a transaction date of March 20, 2012, and a property value of $200,000. Property ID 3 was owned by Emma Thompson until August 31, 2022, with a transaction date of May 15, 2014, and a property value of $300,000.</p><p>By utilizing Bitemporal data, the property database can accurately track the historical ownership records and property values, considering both the valid time and transaction time. This enables precise analysis of property ownership and value changes over specific time periods, facilitating comprehensive historical reporting and audit trails.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion">​</a></h3><p>Understanding the various shapes and forms of structured data is essential for effective data management and analysis. From managing changes in Slowly Changing Dimensions (SCDs) to adopting Master Full or Master Incremental approaches, and from leveraging the temporal aspects of Unitemporal and Bitemporal data, each concept offers unique benefits</p>]]></content:encoded>
            <category>data</category>
            <category>database</category>
        </item>
        <item>
            <title><![CDATA[Dozer Goes Open Source: Empowering the Community to Build Real-time Data Apps]]></title>
            <link>https://getdozer.io/blog/dozer-goes-open-source</link>
            <guid>https://getdozer.io/blog/dozer-goes-open-source</guid>
            <pubDate>Fri, 21 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Dozer Goes Open Source Empowering the Community to Build Real-time Data Apps]]></description>
            <content:encoded><![CDATA[<p><img loading="lazy" src="/blog/assets/images/cover-open-source.png" alt="Dozer Goes Open Source: Empowering the Community to Build Real-time Data Apps" class="img_ev3q"></p><p>We are excited to announce that Dozer, is now open source! <a href="https://www.apache.org/licenses/LICENSE-2.0.html" target="_blank" rel="noopener noreferrer">The Apache 2.0 license</a> 🎉. With this move, we aim to empower the community to build and scale <a href="https://www.splunk.com/en_us/data-insider/what-is-real-time-data.html" target="_blank" rel="noopener noreferrer">real-time data</a> applications more effectively.</p><p>Dozer simplifies the process of connecting applications to various data sources, such as PostgreSQL, Kafka, or other databases &amp; sources, enabling developers to easily unify data across different sources. By making Dozer open source, we are inviting developers to contribute to its growth and help shape the future of real-time data applications.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-is-dozer">What is Dozer?<a href="#what-is-dozer" class="hash-link" aria-label="Direct link to What is Dozer?" title="Direct link to What is Dozer?">​</a></h2><p>Dozer is a Real-Time Analytical Layer for your LLMs and Data Products. It enables easy real-time data products development, deployment and maintenance. Aim of the product is to enable developers to build customer facing analytical products without having to worry about building infrastructure. Dozer is built in Rust and utilises Clickhouse for serving low latency analytics.
With a few lines of SQL and a simple YAML configuration, you can build, deploy and maintain full data backends. Dozer is designed to be easy to use, scalable, and flexible, making it an ideal platform for building real-time data applications.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-open-source">Why Open Source?<a href="#why-open-source" class="hash-link" aria-label="Direct link to Why Open Source?" title="Direct link to Why Open Source?">​</a></h2><p>There are several reasons why we decided to make Dozer open source:</p><ul><li><p><strong>Community-driven innovation:</strong> We want developers from all over the world to contribute their ideas,and provide valuable insights, improvements, and fixes, leading to a more innovative and robust platform, and faster development cycles.</p></li><li><p><strong>Transparency and trust:</strong> We want to enable users to view and understand the underlying code, fostering trust in the platform and ensuring that it meets their needs and expectations. Especially how data is being stored and processed.</p></li><li><p><strong>Collaboration and learning:</strong> We encourage developers to collaborate, share ideas, and learn from one another. We want to foster a strong community that helps developers grow their skills and expertise in the real-time data space.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="getting-started-with-dozer">Getting Started with Dozer<a href="#getting-started-with-dozer" class="hash-link" aria-label="Direct link to Getting Started with Dozer" title="Direct link to Getting Started with Dozer">​</a></h2><p>To help you get started with Dozer, we've created  <a href="https://getdozer.io/docs/dozer/" target="_blank" rel="noopener noreferrer">documentation</a> and <a href="https://getdozer.io/blogs/" target="_blank" rel="noopener noreferrer">tutorial blogs</a> that guide you through the process of setting up, configuring, and using Dozer for your data applications. You can find these resources on our official documentation site and can also follow <a href="https://dev.to/getdozer" target="_blank" rel="noopener noreferrer">Dozer on dev.to</a> or additional resources on getting to know &amp; use Dozer better!</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-contribute">How to Contribute?<a href="#how-to-contribute" class="hash-link" aria-label="Direct link to How to Contribute?" title="Direct link to How to Contribute?">​</a></h2><p>We welcome contributions from the community! If you're interested in contributing to Dozer, please check out our GitHub <a href="https://github.com/getdozer/dozer" target="_blank" rel="noopener noreferrer">repository</a> for the guidelines. Whether you want to submit a bug report, suggest a new feature, or contribute code, we appreciate your help in making Dozer even better.</p><p>We also encourage you to check out the following resources to know other ways of contributions:</p><ul><li><strong>Show and tell GitHub discussion forum</strong>: We have setup a <a href="https://github.com/getdozer/dozer/discussions/categories/show-and-tell" target="_blank" rel="noopener noreferrer">GitHub discussion forum</a>, where you can share your Dozer projects, and experience, ask questions, do a feature request, and connect with fellow Dozer community members. Feel free to join the conversation and share your own projects &amp; ideas!
Additionally, we have also created a discord channel to give developers an opportunity to showcase and talk about their projects. Join our <a href="https://discord.gg/64rQR4d3Z8" target="_blank" rel="noopener noreferrer">discord channel</a>, If you’ve built something using Dozer, we’d love to see it!</li></ul><p>You can also join our community <a href="https://discord.com/invite/3eWXBgJaEQ" target="_blank" rel="noopener noreferrer">discord</a> to see what we are cooking at Dozer 👩‍🍳 👨‍🍳</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion">​</a></h2><p>We believe that by making Dozer open source, we are empowering the developer community to build amazing real-time data applications. We look forward to seeing the project and ideas that arise from this collaborative effort, and we're excited to work together to shape the future of real-time data!
Together, we can shape the future of real-time data and contribute to the ongoing success of open source software. 🚀</p><p>Happy coding ! 🚀👩‍💻👨‍💻</p>]]></content:encoded>
            <category>dozer</category>
            <category>apache</category>
            <category>open source</category>
            <category>community</category>
            <category>company</category>
        </item>
        <item>
            <title><![CDATA[Why you might not even need a data platform]]></title>
            <link>https://getdozer.io/blog/why-you-might-not-need-a-data-platform</link>
            <guid>https://getdozer.io/blog/why-you-might-not-need-a-data-platform</guid>
            <pubDate>Thu, 16 Mar 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Do you really need a data platform? New leaner architecures that could save you a lot of money are emerging !]]></description>
            <content:encoded><![CDATA[<p>Every company I meet today has a data platform. And if they don’t have one, they want one. Problem is that building and maintaining a data platform is not trivial. First, multiple tools need to be integrated together: Airflow, Spark, Presto, Kafka, Flink, Snowflake, and potentially many more, but, more importantly, a dedicated engineering team must be setup to maintain and make sure everything runs smoothly. And, what usually happens is that, after data has been accumulated for months and months, the cost of running such infrastructure is higher than the benefit.</p><p><em>So the question is: do you really need a data platform ?</em></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="get-back-to-the-basics">Get back to the basics<a href="#get-back-to-the-basics" class="hash-link" aria-label="Direct link to Get back to the basics" title="Direct link to Get back to the basics">​</a></h2><p>Let’s take an example of a mid-size company embarking into an adventure of building a data platform. Generally, they do it for two purposes:</p><ul><li><strong>Data analytics</strong>: Being able to generate analytical dashboards from historical data</li><li><strong>AI and advanced use cases</strong> such as real-time user personalisation</li></ul><p>Typically, for the first use case you’d setup a Snowflake or Databricks instance and dump all your data there. But wait! Do you really need it? Very likely you will not have Petabytes of data to manage. How about something leaner?</p><p>If you are familiar with the data space, you’d have probably recently heard about tools like Pola.rs, Datafusion or DuckDB! If you have not heard about them, they are small and highly efficient OLAP query engines that can achieve impressive performance. The reason why they are so efficient is because their authors have made the decision to go back to the basics. Forget about distributed data processing frameworks like Apache Spark with inefficient network shuffling! Forget about 20 years old languages like Java or Scala (With all the GC problems they bring along)! Embrace simplicity using lower-level languages like C/C++, even better, Rust, and squeeze every CPU cycle to get as much performance as possible.</p><p>So, it’s pretty trivial to dump all your data from your OLTP databases into an S3 bucket, and bring up multiple ad-hoc instances of DuckDB, Pola.rs or DataFusion, run all your OLAP queries and shuts everything down. All for a negligible TCO. Multiple companies realised the potential of such approach and are building what I call a “Poor’s men data platforms” around these tools. <a href="https://motherduck.com/" target="_blank" rel="noopener noreferrer">MotherDuck</a> is doing this with DuckDB, for example.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-about-real-time-">How about real-time ?<a href="#how-about-real-time-" class="hash-link" aria-label="Direct link to How about real-time ?" title="Direct link to How about real-time ?">​</a></h2><p>While this lean approach is very easy to achieve for batch workloads, it is not that trivial when we start addressing more complex use cases like AI or real time personalisation. Real-time is a lot harder and, in many scenarios, the goal of real-time use cases is not just producing analytical dashboards, but it is the full integration of the data with customer-facing applications, enabling another level of interactivity. The simplest example is probably user personalisation. For such a use case, data from multiple sources need to be combined, an ML model might be applied, and, in some scenarios, data should be updated based on user behaviour. All this in real time!</p><p>Achieving this today is not trivial. Some companies have given up to handle all this in real time, because it’s simply too complex and expensive. Think for instance of how reverse ETL and personalisation APIs are really implemented today in most cases: everything is still batch! Data is pulled from your sources using tools like AirByte or Fivetran and loaded into your Snowflake or Databricks. Then, every day or hour, you run your DBT jobs which extract the data you need, run your ML models, and load the results into some cache or low-latency database for serving. Companies are trying to come up with solutions to simplify the process, but everything is still: batch!</p><p>If you want something more than this, it is definitely possible! But it is complex! You need an entire infrastructure that is capable of handling real-time data (i.e. Kafka), a stream processing engine (i.e. Spark Streaming, Flink, Kafka Streams), one or multiple low latency data-store depending on the query patterns of your application (i.e Redis, Aerospike, ElasticSearch), an API layer and, most importantly, a data engineering team capable of putting all these pieces together!</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="enter-the-data-apps-world">Enter the Data Apps world<a href="#enter-the-data-apps-world" class="hash-link" aria-label="Direct link to Enter the Data Apps world" title="Direct link to Enter the Data Apps world">​</a></h2><p>So, is there a way to achieve the sample simplicity of DuckDB or Pola.rs for something like this? Probably yes, and the answer is Data Apps. What are data apps? There is really no proper definition for it, but, the way I like to describe a data app is:</p><p><em>A self-contained monolith application, that is capable of efficiently serving data, and, at the same time, react to data changes in real-time and perform complex operations such as joins, aggregations, ML predictions, notifications, and more.</em></p><p>The definition is on-purpose generic. But, fundamentally I see Data Apps as the bridge between source systems and user-facing applications, enabling a high level data interactivity and actionability.</p><p>Forget about streams, caches, pipelines, etc! Just put a data app backend between the source systems and the user application and magic can happen! </p><p>Some of these ideas have been pioneered by a very successful tool called <a href="https://streamlit.io/" target="_blank" rel="noopener noreferrer">StreamLit</a>: a Python framework allowing data scientists to quickly prototype data apps using Python. While StreamLit is a beautiful and powerful tool, it has not yet unlocked the full potential of data apps, especially when an entire ecosystem on the backend side has to be connected.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-software-engineers-perspective">The software engineer’s perspective<a href="#the-software-engineers-perspective" class="hash-link" aria-label="Direct link to The software engineer’s perspective" title="Direct link to The software engineer’s perspective">​</a></h2><p>The initial idea of StreamLit was to primarily let data scientists with no UI development experience to showcase their work and let users interact with their ML models. Now, let’s think of data apps from a full-stack or frontend engineer. What I’d want is a quick way pulling production data from multiple sources, process it in real-time using familiar tools like SQL, Javascript and Python and have ready-made APIs allowing me to interact with the data. I want to query it and I want to trigger events that might propagate back to the source system and, again, in real-time, see how my changes affected the system.</p><p>As a full-stack engineer I want to superpower of a full data engineering team!</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-bottom-line">The bottom line<a href="#the-bottom-line" class="hash-link" aria-label="Direct link to The bottom line" title="Direct link to The bottom line">​</a></h2><p>If all this is possible, it means all the complexity needed for a typical data platform with a lambda or kappa architecture is gone. Batch workloads can be easily handled using tools like DuckDB and real-time workflows can be easily handled by a bunch of real-time data apps disseminated in the organization sitting between the source systems and the users.</p><p><em>The philosophy behind all this is what led us to create</em> <a href="https://github.com/getdozer/dozer" target="_blank" rel="noopener noreferrer"><em>Dozer</em></a><em>. A real-time data app backend specifically targeted to full-stack and frontend engineers. Our mission is to give data superpowers to the full-stack developer!</em></p>]]></content:encoded>
            <category>data</category>
            <category>api</category>
            <category>rust</category>
            <category>dataplatform</category>
        </item>
        <item>
            <title><![CDATA[Two things that Rust does better than C++]]></title>
            <link>https://getdozer.io/blog/rust-cpp-move-and-dispatch</link>
            <guid>https://getdozer.io/blog/rust-cpp-move-and-dispatch</guid>
            <pubDate>Mon, 13 Feb 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Two language features of Rust that are better than C++, namely its ownership model and trait object system. These compare favorably to C++'s move semantics and virtual functions, respectively, and provide insights into why Rust has gained popularity among many developers.  
]]></description>
            <content:encoded><![CDATA[<p>At <a href="https://github.com/getdozer/dozer" target="_blank" rel="noopener noreferrer">Dozer</a>, we have adopted Rust as our main programming language, despite many of our team members having a strong background in C++. This is because Rust offers a combination of expressiveness, safety and ergonomics through its language constructs, which we find appealing.</p><p>In this post, we will discuss two language features that we believe Rust handles better than C++, namely its ownership model and trait object system. These compare favorably to C++'s move semantics and virtual functions, respectively, and provide insights into why Rust has gained popularity among many developers.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="ownership-vs-move-semantics">Ownership vs Move Semantics<a href="#ownership-vs-move-semantics" class="hash-link" aria-label="Direct link to Ownership vs Move Semantics" title="Direct link to Ownership vs Move Semantics">​</a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="phenomena">Phenomena<a href="#phenomena" class="hash-link" aria-label="Direct link to Phenomena" title="Direct link to Phenomena">​</a></h3><p>Consider following Rust code (<a href="https://play.rust-lang.org/?version=stable&amp;mode=debug&amp;edition=2021&amp;gist=1517db73ea6b2c94cfa4c779b9471199" target="_blank" rel="noopener noreferrer">playground</a>):</p><div class="language-rust codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-rust codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">struct Struct;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">impl Drop for Struct {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    fn drop(&amp;mut self) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        println!("dropped");</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">fn main() {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    let a: Struct = Struct;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    let _b: Struct = a;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you run it, there's a single output line:</p><div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">dropped</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The C++ code that behaves most similarly (<a href="https://www.sololearn.com/compiler-playground/cNLPphJeqrGl" target="_blank" rel="noopener noreferrer">playground</a>):</p><div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token macro property directive-hash" style="color:#36acaa">#</span><span class="token macro property directive keyword" style="color:#00009f">include</span><span class="token macro property" style="color:#36acaa"> </span><span class="token macro property string" style="color:#e3116c">&lt;iostream&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">struct</span><span class="token plain"> </span><span class="token class-name">Struct</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token function" style="color:#d73a49">Struct</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">default</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token function" style="color:#d73a49">Struct</span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> Struct </span><span class="token operator" style="color:#393A34">&amp;</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">delete</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token function" style="color:#d73a49">Struct</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Struct </span><span class="token operator" style="color:#393A34">&amp;&amp;</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">default</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Struct </span><span class="token operator" style="color:#393A34">&amp;</span><span class="token keyword" style="color:#00009f">operator</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> Struct</span><span class="token operator" style="color:#393A34">&amp;</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">delete</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token operator" style="color:#393A34">~</span><span class="token function" style="color:#d73a49">Struct</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">cout </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"destructed"</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">endl</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">int</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Struct a</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Struct b </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token function" style="color:#d73a49">move</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">a</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>It outputs two lines:</p><div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">destructed</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">destructed</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>We can see the C++ destructor is executed twice.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="analysis">Analysis<a href="#analysis" class="hash-link" aria-label="Direct link to Analysis" title="Direct link to Analysis">​</a></h3><p>The root of the problem is that C++ only provides rvalue references as a special type at the language level, and move semantics are implemented by the user according to convention. From the compiler's perspective, an object that has been moved is still an intact object. This brings not only the problem of destructors being executed multiple times (although this problem has already brought additional runtime overhead), but also imposes two burdens on class authors in C++:</p><ul><li>The destructor must correctly handle objects that have been moved.</li><li>In all public interfaces, correctly handle objects that have been moved, or transfer this burden to the class user.</li></ul><p>The first is obvious. Regarding the second, due to the fact that correctly handling objects that have been moved in all public interfaces usually brings runtime overhead, the responsibility of not using objects that have been moved has been imposed on almost all C++ users, while class authors usually only provide an interface for querying whether an object has been moved.</p><p>A typical example of the second is <code>std::unique_ptr</code>, and any user of <code>std::unique_ptr</code> must check if it is null.</p><p>C++'s move semantics greatly reduce the usability of RAII. When the user gets an object, they always need to consider whether the resource it manages has been moved. This increases the mental burden on the programmer and is a breeding ground for bugs.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="trait-object-vs-virtual-function">Trait Object vs Virtual Function<a href="#trait-object-vs-virtual-function" class="hash-link" aria-label="Direct link to Trait Object vs Virtual Function" title="Direct link to Trait Object vs Virtual Function">​</a></h2><p>Consider following Rust code (<a href="https://play.rust-lang.org/?version=stable&amp;mode=debug&amp;edition=2021&amp;gist=84b56815fd358bb63e1601354a907bc9" target="_blank" rel="noopener noreferrer">playground</a>):</p><div class="language-rust codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-rust codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">trait Trait {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    fn f(&amp;self);</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">struct Impl;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">impl Trait for Impl {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    fn f(&amp;self) {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        println!("f from Impl");</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">fn main() {</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    let a: Impl = Impl;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    let b: &amp;dyn Trait = &amp;a;</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    b.f();</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    println!("Size of Impl is {}", std::mem::size_of::&lt;Impl&gt;());</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The output is:</p><div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">f from Impl</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Size of Impl is 0</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The fact that the size of the <code>Impl</code> struct is <code>0</code> means that whether or not runtime polymorphism is used has no impact on the memory layout of the struct itself.</p><p>The C++ code that behaves most similarly (<a href="https://www.sololearn.com/compiler-playground/cFgXc9YWBOe2" target="_blank" rel="noopener noreferrer">playground</a>):</p><div class="language-cpp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-cpp codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token macro property directive-hash" style="color:#36acaa">#</span><span class="token macro property directive keyword" style="color:#00009f">include</span><span class="token macro property" style="color:#36acaa"> </span><span class="token macro property string" style="color:#e3116c">&lt;iostream&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">Trait</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">public</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">virtual</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">void</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">f</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">Impl</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token base-clause keyword" style="color:#00009f">public</span><span class="token base-clause"> </span><span class="token base-clause class-name">Trait</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">public</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">void</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">f</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">override</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">cout </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"f from Impl"</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">endl</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">int</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Impl a</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    Trait </span><span class="token operator" style="color:#393A34">&amp;</span><span class="token plain">b </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> a</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    b</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">f</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">cout </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Size of Impl is "</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">sizeof</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">a</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&lt;&lt;</span><span class="token plain"> std</span><span class="token double-colon punctuation" style="color:#393A34">::</span><span class="token plain">endl</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The output is:</p><div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">f from Impl</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">Size of Impl is 8</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>This is the output on a 64-bit system, where due to the use of runtime polymorphism, each <code>Impl</code> object holds an 8-byte virtual table pointer.</p><p>Compared to Rust's trait object, C++ runtime polymorphism is not a zero overhead abstraction. The additional 8-byte storage overhead is often unacceptable, and the fact that the virtual table pointer changes object memory layout greatly limits the scope of its application.</p>]]></content:encoded>
            <category>rust</category>
        </item>
        <item>
            <title><![CDATA[Data as a product - The role of APIs]]></title>
            <link>https://getdozer.io/blog/data-apis-role</link>
            <guid>https://getdozer.io/blog/data-apis-role</guid>
            <pubDate>Tue, 28 Jun 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[What are the challenges of implementing efficient and scalable data APIs?]]></description>
            <content:encoded><![CDATA[<p><em>What are the challenges of implementing efficient and scalable data APIs? </em></p><p><img loading="lazy" alt="Rust Programming" src="/blog/assets/images/crab_sea-a907ed7c0a1087a583d9f2b33e229a82.webp" width="1040" height="585" class="img_ev3q"></p><p>Thanks to the adoption of cloud data warehouse platforms like Snowflake or Databricks every organization is producing more and more data. This data is later processed by data analysts to extract insights or by data scientists to build predictive models to help business decisions. Data analysts generally use tools like DBT to write SQL transformations, while Data Scientists generally prefer using a Python stack or AutoML tools like DataRobot or H2O. In both cases, all results are written back to the same data warehouse for easier accessibility. </p><p>To consume this data, companies have started building analytical dashboards, which are playing an important role in monitoring the health of the business and help drive strategic decisions.
More recently, companies started to realize the value of this data in other contexts. Reverse ETL like Hightouch or Census, for instance, unlocks its value in operational use cases by making insights or predictions available in cloud SaaS applications. This is very useful, for example, to improve the efficiency of an e-mail marketing campaign.</p><p>Use cases, however, are not just limited to internal consumption. In multiple scenarios, it's extremely useful to expose this data directly to the end-user as part of the product experience. Think of the fintech industry, for example, where companies need to make this data readily available from the user's mobile app in order to improve their product's UX.</p><p>This seems a very easy task to achieve, but in reality, it can require a lot of work from a diverse group of people. Let's understand more!
Data Warehouses like Snowflake or Databricks are specifically designed for analytical purposes. This means they are not suited for low-latency querying and point lookups. However, these are the typical requirements in a microservice serving customer applications or a mobile app; fast response time is a prerequisite to implementing a good UX for the user. For this reason, data sitting in the data warehouse needs to be moved to a different type of storage that is capable of offering these capabilities. During this process, data must be properly prepared and indexed and an API layer must be created in front of it, so that product engineers can build their applications on top. All this process is quite challenging and requires a lot of data engineering work. </p><p>Let's look at some challenges in detail:</p><ul><li><strong>Data Models</strong>: microservices and front-end developers are used to working with hierarchical data models (like JSON or Protobuf) while Data Analysts and Data Scientists are more comfortable with tabular data. In order to better fit API use cases, it is ideal to put in place mechanisms to automatically denormalize and transform data from tabular to hierarchical representations. </li><li>Data Integrity: In some situations incremental movement of data is okay, but in some other scenarios it is required that a dataset is replaced completely with a new version of the data. In these cases, it is important to ensure that an "all-or-nothing" pattern is applied, preventing the mixup of old and new data during deployment.</li><li><strong>Seamless to Consumer</strong>: Once a new version of the data is deployed it is important that the consumer will start using the new version of the data in an automated fashion.</li><li><strong>Easy Rollbacks</strong>: In cases where wrong data gets deployed, it must be possible to rollback to an older version with minimal effort in order to avoid any disruption in user functionality.</li><li><strong>Fine-Grained observability and RCA</strong>: It is possible that, for any reason, some wrong data is served to the user. In those situations, it is essential to have a proper observability tool that is capable of tracking each API user request and tracing it back to the source data.</li><li><strong>Low-latency</strong>: The way data is represented and indexed depends very much on the consumption pattern. Sometimes it is necessary to look up data by a primary key, some other times by multiple secondary keys, some other times by a geographic location, and so on. A storage layer that sits in front of APIs must be able to satisfy these kinds of lookups very efficiently and at extremely low latency.</li><li><strong>Auto-scaling</strong>: APIs need to handle spikes of traffic efficiently. This is generally achieved with auto-scaling. This is an easy task when a stateless API server needs to be auto-scaled, but it is much harder when APIs and storage need to be scaled together. </li></ul><p>All the challenges I described above are what we are solving with Dozer. We are aiming to automate the data extraction and preparation process to make it efficiently serviceable through APIs. Stay tuned for more!</p>]]></content:encoded>
            <category>api</category>
            <category>data</category>
        </item>
        <item>
            <title><![CDATA[Improve your Postgres query performance through a CDC pipeline - Part 1]]></title>
            <link>https://getdozer.io/blog/postgres-cdc-query-performance</link>
            <guid>https://getdozer.io/blog/postgres-cdc-query-performance</guid>
            <pubDate>Sun, 19 Jun 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[Ever come across occasions when you run into query performance issues for important queries that run on your database? This is when most companies will look to introducing a caching layer to improve the speed of queries.]]></description>
            <content:encoded><![CDATA[<p>Ever come across occasions when you run into query performance issues for important queries that run on your database? This is when most companies will look to introducing a caching layer to improve the speed of queries. </p><p><img loading="lazy" alt="Dozer" src="/blog/assets/images/dozer-000ce18e1a59acadc6f4723674af4052.png" width="1020" height="414" class="img_ev3q"></p><p>In many scenarios you can probably fix your performance issues by introducing the right set of indexes. Or maybe denormalizing some fields to reduce the join overhead. These come with their own set of challenges such as having to write in two places etc and may even require code change. You might be working with a legacy platform where changing code is not straightforward. </p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="what-are-your-options-in-implementing-a-caching-layer-">What are your options in implementing a caching layer ?<a href="#what-are-your-options-in-implementing-a-caching-layer-" class="hash-link" aria-label="Direct link to What are your options in implementing a caching layer ?" title="Direct link to What are your options in implementing a caching layer ?">​</a></h4><p>Application Layer Cache using Redis/ Memcached / Dynamo Db etc
Replicate data using CDC to an alternative DB/store optimised for your queries.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="1-application-layer-cache-using-redis--memcached--dynamodb">1) Application Layer Cache using Redis / Memcached / Dynamodb<a href="#1-application-layer-cache-using-redis--memcached--dynamodb" class="hash-link" aria-label="Direct link to 1) Application Layer Cache using Redis / Memcached / Dynamodb" title="Direct link to 1) Application Layer Cache using Redis / Memcached / Dynamodb">​</a></h4><p>This is a widely used approach where you would implement a caching layer by adapting one of these caching strategies. </p><ul><li>Cache Aside: Maintain data in both cache and primary DB </li><li>Read Through: Typically implemented using a library / framework where it talks to the db if there is a cache miss</li><li>Write Through:  Write passes through </li><li>Write Back:  Write to cache first</li></ul><p>Each of these strategies comes with its own set of pros and cons but the main differentiation is that application code has to deal with this complexity of caching logic. </p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="2-replicate-data-through-cdc-to-a-secondary-database">2) Replicate data through CDC to a secondary database.<a href="#2-replicate-data-through-cdc-to-a-secondary-database" class="hash-link" aria-label="Direct link to 2) Replicate data through CDC to a secondary database." title="Direct link to 2) Replicate data through CDC to a secondary database.">​</a></h4><p>This approach has been gaining traction for read heavy operations. With tools such as Debezium and AWS Database Migration Service, companies are building pipelines that move data using a replication approach.  The diagram below illustrates the typical components involved. </p><p><img loading="lazy" alt="components" src="/blog/assets/images/components-0850894066d688c5027bc9293531e3eb.png" width="1002" height="518" class="img_ev3q"></p><p>This can be implemented without modifying the original implementation.
There are some considerations to take note of:
Does data need to be real time?
What types of indexes suit your querying needs?
How do you guarantee availability?
What happens if the schema in the primary database changes?
What are the cost involved?
This requires data engineering and skilled engineers to build and maintain. </p><p>In the next article (Part 2) we will be publishing a sample repository that demonstrates some of this in code.
We are very excited at Dozer to build an end to end system that takes care of this exact problem statement. We are currently in the build phase and will publish our repository soon for developers to try. Please sign up on Dozer Website  to get early access.</p>]]></content:encoded>
            <category>cdc</category>
            <category>postgres</category>
            <category>mysql</category>
            <category>debezium</category>
        </item>
        <item>
            <title><![CDATA[Are Rust, C++ and WASM the new tools for Data Engineering?]]></title>
            <link>https://getdozer.io/blog/dozer-rust-wasm-c++</link>
            <guid>https://getdozer.io/blog/dozer-rust-wasm-c++</guid>
            <pubDate>Sat, 18 Jun 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[Traditional tools for data engineering are suffering in performance and scalability. JVM-based tools are becoming outdated, while new languages are becoming increasingly popular. Will Rust and WASM replace the current data engineering JVM-based stack ?]]></description>
            <content:encoded><![CDATA[<p><em>Traditional tools for data engineering are suffering in performance and scalability. JVM-based tools are becoming outdated, while new languages are becoming increasingly popular. Will Rust and WASM replace the current data engineering JVM-based stack ?</em></p><p><img loading="lazy" alt="Rust Programming" src="/blog/assets/images/wasm-and-rust-c2b9e9dd8d01499c881d4ff217947891.webp" width="730" height="487" class="img_ev3q"></p><p>I started my career as a C/C++ developer 20 years go working on network protocols and embedded systems. Over time, I moved more and more to work in the data space and my level of abstraction started to move up in the stack, with obviously less control of what is going on under the hood. When you go from C/C++ to Java, everything seems rosy in the beginning but, soon, when you start struggling with memory allocation, garbage collection and similar things you realise that you are loosing the power you had in your hands during your old C/C++ days. The advantage of the JVM, though, is the pluggability. </p><p>If you design your software well, you can pretty much allow any customisation to be plugged in at a binary level, just by adding adding a new JAR to your classpath. Where things get trickier is however scriptability. In many situations you want your software to be scriptable using languages like Javascript. It is possible, but the level of integration between scripting languages and the JVM are not that great. And many times, performance is poor. Think for example at the Spark and Python integration. That required a bridge like Py4J to make it work, but at a huge performance cost. Now things got better with support for new formats like Arrow, but I remember the first version of PySpark was pretty crappy and almost unusable. </p><p>However, I have a feeling things are starting to change. People are realising that maybe, the JVM is not really the best option for building data intensive applications. But what's the alternative? Recently Rust started to become very popular also thanks to the support of the blockchain community and the developer community started to realise that it can be used to build large and scalable systems. And...where do we need scalability today? Data! We have to handle more and more data and, clearly, the current tooling is not scaling up. It proves the fact that Databricks went through a complete rewrite of Apache Spark in C++, with huge benefits in terms of performance and scalability. At the same time you see several startups taking a similar direction. Look at RedPanda, who is implementing a much leaner version of Kafka entirely in C++. Many companies are following and will follow this trend.</p><p>But how to allow pluggability in these systems? Meet WASM, the new kid on the block. WASM is fundamentally a machine-level language that can integrate seamlessly with C++ and Rust. The beauty of it is that WASM can be generated from multiple languages like C, C++, AssemblyScript (a variation of TypeScript), Rust, Kotlin and others. You can even compile a full Python interpreter to WASM and host the execution of a Python script! As more and more language will support compilation into WASM or LLVM, the possibilities are endless.
Now I think you understand where I'm getting! By bridging together high performant languages like C++ or Rust with WASM we get teh best of both worlds: performance, scalability and pluggability.
I truly believe in this new pattern and that is the reason why at Dozer, we are building the next generation Data APIs stack entirely using Rust and WASM. Stay tuned!</p>]]></content:encoded>
            <category>wasm</category>
            <category>rust</category>
        </item>
    </channel>
</rss>