Dozer | Start building real-time data apps in minutes Blog

Retrieval Augmented Generation (RAG) workflow with Dozer

Thu, 16 Nov 2023 00:00:00 GMT

Large Language Models (LLMs) are a type of AI that is trained on a massive amount of text data. This allows LLMs to generate text, translate languages, write content and answer in an informative way. However, LLMs are not perfect. They often make mistakes and produce text that is not coherent or relevant to the topic at hand. LLMs can sometimes generate inaccurate or misleading information, even if it sounds plausible. This is because they learn from statistical patterns in the data, which may not always correspond to reality. This issue can be particularly problematic in applications where factual accuracy is crucial.

One particular challenge lies in the issue of hallucinations, where LLMs produce outputs that are factually inaccurate or misleading. This phenomenon, often stemming from outdated training data, can have significant implications for the reliability and trustworthiness of LLMs.

Issues with LLMs

Hallucinations Problem:

Hallucinations in LLMs occur when the model's predictions deviate from reality, generating text that is inconsistent with the input or the broader context. This can manifest in various forms, such as fabricating facts, expressing outdated opinions, or drawing erroneous conclusions from data. The underlying cause of these hallucinations can be traced back to the training data upon which LLMs are built.

LLMs are trained on massive amounts of text and code, encompassing a vast repository of human knowledge. However, this data is not without its biases and imperfections. It may reflect societal prejudices, contain outdated information, or simply lack the nuance and context required for accurate understanding. When LLMs are trained on such flawed data, they may inherit these biases and imperfections, leading to hallucinations in their outputs.

Out-of-Date Training Data:

The presence of outdated data in training sets further exacerbates the hallucination problem. As technology and society evolve, information becomes obsolete, and LLMs trained on such data may struggle to keep pace with the changing world. This can lead to the generation of factually incorrect information or outdated opinions, undermining the credibility of LLMs and limiting their usefulness in real-world applications.

Retrieval Augmented Generation (RAG) is a promising approach to address the issues of hallucination and out-of-date training data in large language models (LLMs). RAG combines the strengths of LLMs with those of retrieval-based systems to generate more accurate and reliable text.

How RAG Works

RAG works by first retrieving relevant passages from an external knowledge source, such as a search engine or a document database. These passages are then used to provide context and anchor the LLM's generation process. This helps to ensure that the generated text is consistent with the retrieved information and less likely to contain hallucinations.

Librarian and the Writer Analogy

Imagine a large language model (LLM) as a highly skilled writer, but with a limited knowledge base. While this writer can craft compelling narratives and compose insightful essays, they lack access to the vast expanse of information available in the world. That's where RAG comes in, acting as the writer's resourceful assistant.

RAG functions like a diligent librarian, scouring external data stores to gather relevant information tailored to the user's request. This context-rich information, whether it's real-time updates, user-specific details, or even factual data that hasn't made it into the LLM's training set, is then seamlessly integrated into the writer's prompt.

With this enhanced prompt, the LLM is empowered to produce even more informative and personalized responses, akin to a writer armed with a wealth of background research. RAG, in essence, bridges the gap between the LLM's knowledge and the vast sea of information, elevating its capabilities to new heights.

Comparing RAG with Fine-tuning LLMs

Fine-tuning takes a pre-trained LLM model and trains it more on a smaller dataset, which was not used before to improve performance with relevant task.

RAG is particularly well-suited for scenarios where you can enrich your LLM prompt with information that was not available during its training phase. This includes real-time data, personal or user-specific data, and contextual information relevant to the prompt. By incorporating such external knowledge, RAG enables LLMs to generate more accurate, relevant, and personalized responses.

Challenges with RAG

When working with data that is siloed or in real-time, implementing RAG can present significant challenges. Siloed data refers to information that is isolated or segregated within specific systems or databases, making it difficult to access and integrate with other data sources. Real-time data, on the other hand, is constantly changing and requires immediate processing to maintain relevance.

Siloed Data

Retrieval-Augmented Generation (RAG) relies on accessing and retrieving relevant information from external data sources to enhance the capabilities of large language models (LLMs). However, when the data is siloed or isolated within specific systems or databases, it becomes difficult for RAG to effectively utilize this information. This poses significant challenges for the implementation of RAG, as it hinders the LLM's ability to generate comprehensive and informative responses.

Real-time Data:

Real-time data, which is constantly changing and requires immediate processing to maintain relevance, presents another set of challenges for RAG. The LLM needs to be able to access and process real-time data streams with minimal latency to ensure that the generated text is always relevant and up-to-date. This can be challenging due to the high volume and dynamic nature of real-time data.

How Dozer can help with both the Challenges

Dozer is a powerful Data Access backend that simplifies the process of building and deploying data-driven applications. It provides a unified interface to access and process data from multiple sources, including databases, APIs, and streaming platforms. This enables developers to build applications that leverage real-time data without having to worry about the underlying infrastructure.

This means that you can use Dozer to ingest data from any source, such as a database in real-time, which allows you to mitigate the issues related to siloed data and real-time data.

Here are few ways Dozer helps with RAG

Real-time data ingestion: Dozer can be used to ingest real-time data from a variety of sources, such as social media feeds, customer interactions, and sensor data. This data can then be used to provide RAG models with the most up-to-date information.

Data transformation: Dozer's streaming SQL engine can be used to transform and process data in real time. This can be used to clean and prepare data for use by RAG models, as well as to extract features that are relevant to the task at hand.

Contextual information: Dozer can be used to store and manage contextual information, such as user profiles and knowledge graphs. This information can then be used to provide RAG models with a richer understanding of the context of the task at hand.

The resourceful assistant

So continuing the story of the writer and the librarian, Dozer is the resourceful assistant. Who steps into the scene, armed with its vast knowledge of data sources and its ability to seamlessly integrate external information. It acts as a bridge between the writer (LLM) and the vast library of information (external data sources), enabling the writer to access and utilize a broader range of knowledge.

Just as the librarian guides the writer to relevant books and articles, Dozer guides the LLM to the most pertinent data sources, providing it with the context and information needed to craft even more informative and personalized responses.

Dozer's role extends beyond mere retrieval; it also helps the writer process and transform the retrieved information, ensuring that it is in a format that can be readily incorporated into the text generation process. This collaboration between the writer, the librarian, and the assistant elevates the quality of the generated text, making it more comprehensive, accurate, and tailored to the user's needs.

With Dozer on board, the writer can confidently venture into unexplored territories of knowledge, knowing that its resourceful assistant will always be there to provide the necessary support and guidance. Together, they form an unstoppable team, capable of producing text that is not only informative but also insightful, engaging, and truly remarkable.

Conclusion

In conclusion, the combination of large language models (LLMs) and retrieval-augmented generation (RAG) has the potential to revolutionize the way we interact with computers. By providing LLMs with access to real-time data, personal data, and contextual information, RAG enables LLMs to generate more accurate, relevant, and personalized responses. Dozer is a data infrastructure platform that can be used to build and deploy RAG applications. It provides a number of features that can be helpful for RAG development, such as a streaming SQL engine for real-time data transformation, support for a variety of data sources, a distributed architecture that can scale to handle large data volumes, and a variety of security and compliance features.

The future of LLM-powered applications is bright, and RAG is playing a key role in this evolution. With Dozer, developers can easily build and deploy RAG applications that can take advantage of the latest advances in LLM technology.

In the upcoming articles, we will explore how to build RAG applications using Dozer and OpenAI's assistant. Stay tuned!

For more information and examples, check out the Dozer GitHub repository.

Stay tuned for more updates and exciting use cases of Dozer and OpenAI assistant.

Hyper-personalized chatbots using LLMs, Dozer and Vector Databases

Tue, 13 Jun 2023 00:00:00 GMT

In the realm of language model-based (LLM) applications, leveraging the power of artificial intelligence and natural language processing has become increasingly prevalent. One such application is the creation of chatbots that conversationally interact with users. In this article, we explore how Dozer can significantly enhance the capabilities of LLM-based applications. Using a chatbot for a bank as our illustrative use case, we delve into the challenges faced in contextualizing the chatbot's knowledge and how Dozer can provide a solution by enriching the customer profile.

Requirements

To deliver a personalized and effective user experience, our LLM chatbot needs to possess knowledge in two key areas:

Understanding the bank's products and services
Having a comprehensive customer profile

Traditionally, the primary method of providing information to an LLM is through the use of context. However, there are inherent limitations to this approach, particularly concerning the limited space available for context. LLM models typically have constraints on context size, often limited to a few thousand tokens, although newer models like Anthropic have expanded this to support contexts of up to 100,000 tokens. Despite these advancements, incorporating extensive knowledge, such as comprehensive bank product information or detailed customer profiles, within the context remains challenging.

To overcome this limitation, a dynamic approach to context population is necessary. Rather than relying solely on a fixed context, the context can be dynamically populated based on the user prompt. This allows for the inclusion of specific and relevant information related to the user's query, enabling the LLM chatbot to provide more accurate and tailored responses. By dynamically adjusting the context, the chatbot can access the necessary knowledge and adapt its understanding to better address the user's needs.

A practical approach is to leverage text embeddings and a vector database. This approach involves the creation of a vector database that stores the text embeddings of the bank's product and service information. Text embeddings represent the semantic meaning of the text and can capture the relationships and similarities between different pieces of information.

In the context of our bank's chatbot, this approach can for example be applied when a user inquires about credit card products. Here's how it works:

Vector database population: The bank's credit card information, including details such as card types, benefits, requirements, and features, is transformed into text embeddings. These embeddings capture the essential characteristics of the credit card descriptions and specifications, representing them as numerical vectors.
User query processing: When a user interacts with the chatbot and asks a question about credit card options, the chatbot processes the query and extracts relevant keywords, such as "credit card," "options," or specific card types.
Similarity search: The extracted keywords and contextual information are used to perform a similarity search within the vector database. The search aims to find the text embeddings that are most similar to the user's query, focusing on credit card-related information. By measuring the similarity between the user query and the stored text embeddings, the chatbot identifies the most relevant credit card details.
Context population: The retrieved credit card knowledge, aligned with the user's query, is dynamically populated into the chatbot's context. This means that the relevant credit card information becomes part of the context considered by the chatbot when generating responses.

By leveraging text embeddings and a vector database, our bank's chatbot can efficiently retrieve and utilize the most relevant credit card information based on the user's query.

But, there is a problem!

However, even with the integration of the vector database to enhance the LLM chatbot's knowledge of the bank's products and services, there is an essential missing component - the customer profile. The chatbot lacks knowledge about individual customers and their specific details, which is crucial for providing relevant and personalized responses. For instance, when discussing credit card eligibility criteria, factors such as the customer's annual income play a significant role. Without access to detailed customer information, the chatbot may struggle to provide accurate and contextually appropriate responses. To truly create personalized experiences, it is essential to incorporate the customer's profile, including their financial history, subscribed products, investments, and other relevant data, into the chatbot's context. This way, the chatbot can deliver tailored information and meet the specific needs of each customer, enhancing their overall experience.

This is where Dozer plays a vital role. With its data integration capabilities, Dozer can seamlessly gather and consolidate customer data from various sources, such as core banking systems, CRM platforms, and transaction databases. By connecting to these sources and capturing real-time data updates, Dozer ensures that the customer profile remains accurate and up to date.

By leveraging a comprehensive customer profile, the LLM chatbot can access the necessary information to tailor its responses to the specific customer. Whether the customer is inquiring about credit card options, loan eligibility, or account details, the chatbot can draw from the enriched customer profile to provide relevant and personalized answers.

Putting it all together

The diagram above represents the architecture of an intelligent chatbot system that leverages Dozer, a vector database for storing the bank's knowledge base, and an LLM app powered by langchain for conversational interactions with users. At the core of the architecture is Dozer, which aggregates customer data from multiple source systems, ensuring a comprehensive and up-to-date customer profile. The vector database serves as the repository for the bank's general knowledge, encompassing products, services, policies, and more. The LLM app, integrated with langchain, acts as the intelligent conversational interface, leveraging the enriched customer profile from Dozer and the knowledge base from the vector database. Together, these components enable the chatbot to deliver personalized and accurate responses, providing users with a seamless and engaging banking experience.

Let's give it a try with ChatGPT

To validate our assumptions, we have provided ChatGPT a list of credit card options and a comprehensive customer profile:

- EveryDay VISA Card: 
    Benefits: 5% cashback on grocery shopping, 8% cashback on Esso, Shell, Chevron, 0.3% cashback on all other expenses
    Requirements: $30,000 minimum annual income
    Annual fee: 35$
- LiveFresh VISA Card: 
    Benefits: Up to 5% cashback on Online & Visa contactless spend, Additional 5% Green Cashback on selected Eateries, Retailers and Transport Services, 0.3% Cashback on All Other Spend
    Requirements: $30,000 minimum annual income
    Annual fee: 35$
- Miles&More VISA Card: 
    Benefits: 10 miles per dollar on hotel transactions at Kaligo, 6 miles on flight, hotel and travel packages at Expedia, 3 miles on online flights & hotel transactions (capped at S$5,000 per month), 2 miles per $ on overseas spend, 1.2 miles per $ on local spend, Receive 10,000 bonus miles when you pay for your annual fee
    Requirements: $30,000 minimum annual income
    Annual fee: 215$ + free second card
- Vantage VISA Card: 
    Benefits: Earn 1.5 miles per $ in local spend, Eaern 2.2 miles per $ in foreign spend, Earn up to 6 miles per S$1 on Expedia bookings,  Up to 19% off on fuel spending at Esso
    Requirements: $60,000 minumum annual income
    Annual fee: 215$
- Super VISA Card:
    Benefits:10X Points (4 miles per $1) on online purchases, 3X Points (1.2 miles per $1) on overseas purchases, 1X  Point (0.4 miles per $1) on other purchases
    Requirements: $50,000 minimum annual income
    Annual fee: 70$

First name: John
Last name: Smith
annual income: 55,000$
address: 33 Tampines Street 86, Singapore 528573
Phone: 7763 6678

Owned priducts:
 - checking account - balance: $32,122
 - debit card - outstanding balance: $1,100

last month spending pattern:
  Local spend:
  - fuel 3,122$
  - groceries: $7,233
  - travel: $12,122
  - food & beverages: $1455
  - transportation: $345

  Foreign spend:
  - accomodation: $3433
  - food and beverages: $1344

We then asked ChatGPT to start acting like a virtual bank teller. We will not provide a full transcript of the conveconversation, but only few interesting parts.

In the conversation below, the chatbot has effectively utilized the customer's profile and annual income to recommend suitable financial products. It analyzed the user's annual income, which is $55,000, and made recommendations based on this information. Notably, the chatbot did not propose credit cards that required a higher annual income, such as the Vantage VISA Card which has an income requirement of $60,000.

Also, when asked to estimate potential rewards, the chatbot used the customer's specific spending habits, applying different mile earning rates per spending category, to provide a personalized and detailed estimate of the possible miles to be earned with the Miles&More VISA Card.

In the forthcoming post, we will provide a comprehensive example demonstrating how to implement such a system utilizing Langchain, Dozer, and a vector database.

Shapes and Forms of Structured Data: SCD Types, Master Full, Master Incremental, Unitemporal, and Bitemporal

Tue, 30 May 2023 00:00:00 GMT

Structured data forms the foundation of many data-driven systems and is crucial for effective data analysis and decision-making. Within the realm of structured data, there are different shapes and forms that enable organizations to manage and utilize data in diverse ways. In this blog post, we will explore several important concepts related to structured data, including SCD (Slowly Changing Dimensions) types, Master Full, Master Incremental, Unitemporal, and Bitemporal data.

Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions refer to the nature of data that evolves over time in a data warehousing context. SCDs capture changes in dimensional attributes, such as customer addresses, product specifications, or employee roles, while maintaining historical records. There are different SCD types to manage these changes:

SCD Type 1

This type overwrites existing data with updated information, effectively losing historical details. It is suitable when historical data is not critical, such as in cases where only the most recent values are needed.

SCD Type 2

Type 2 preserves historical data by creating new records for each change, typically through the addition of an effective date and expiration date. This allows for accurate tracking of historical changes and provides a complete audit trail.

Let's consider an example of Slowly Changing Dimensions (SCD) Type 2 in the context of a customer database. Suppose we have a table that stores customer information, including their name, address, and membership status. Initially, a customer named John Smith signs up with the address "123 Main Street" and is assigned a membership status of "Standard."

Later, John moves to a new address, "456 Oak Avenue." Instead of updating the existing record, SCD Type 2 creates a new record with the updated address and assigns it a new effective date, indicating when the change took place. The previous record for John Smith remains in the database, capturing the historical address and its associated time frame. The updated record reflects John's current address and effective date.

By utilizing SCD Type 2, the customer database can maintain a complete audit trail of changes. It allows for tracking John's address history, which can be useful for analyzing customer behavior, understanding migration patterns, or generating accurate reports based on specific time periods.

Customer ID	Name	Address	Membership Status	Effective Start Date	Effective End Date
1	John Smith	123 Main Street	Standard	2021-01-01	2022-02-28
1	John Smith	456 Oak Avenue	Standard	2022-03-01	(Current Record)

SCD Type 3

Type 3 retains limited historical information by adding additional columns to capture a limited number of changes. It sacrifices full historical tracking but can be useful in cases where only recent changes need to be analyzed.

Let's continue with the same customer database example, but this time, let's explore how Slowly Changing Dimensions (SCD) Type 3 would handle changes in customer information.

Initially, John Smith signs up with the address "123 Main Street" and is assigned a membership status of "Standard." Instead of creating a new record for each change, SCD Type 3 adds additional columns to capture limited historical information.

When John moves to a new address, "456 Oak Avenue," the existing record is updated with the new address, but the previous address is retained in a separate column designated for the previous value. In addition, a separate column captures the effective date of the change.

So, in the SCD Type 3 example, the customer record for John Smith would contain the following information: his current address, the previous address, and the date of the address change.

By utilizing SCD Type 3, the customer database retains some historical information while still allowing for efficient storage and retrieval. This approach is useful when limited historical tracking is required, and only a few key changes need to be captured and analyzed.

Customer ID	Name	Current Address	Previous Address	Address Change Date	Membership Status
1	John Smith	456 Oak Avenue	123 Main Street	2022-03-01	Standard

Master Full and Master Incremental

Master Full is a form of structured data management where a centralized "master" dataset contains all relevant information about a specific entity. In this approach, updates to the master dataset are performed in bulk, typically by replacing the entire dataset with a fresh copy. Master Full is suitable when the dataset is relatively small or when updates occur infrequently. It ensures consistency across systems that rely on the master data but may not be ideal for real-time or frequent updates.

Master Incremental is another form of structured data management where changes to the master dataset are made incrementally, without replacing the entire dataset. Instead, only the modified or new records are updated. This approach is efficient when dealing with large datasets or when frequent updates occur. Master Incremental allows for faster processing times and minimizes the need to process unchanged data, but it requires careful tracking of changes and synchronization between systems.

Let's illustrate the concepts of Master Full and Master Incremental using an example of a product catalog management system.

In the Master Full approach, the entire product catalog is replaced with a fresh copy when updates are made. The "Status" column represents the active state of each product. When a product is no longer available or needs to be deleted, the entire record is removed from the dataset. However, this approach does not retain any historical information about deleted products.

Product ID	Name	Description	Price	Effective Date	Status
1	Product A	Description of Product A	$10.99	2021-01-01	Active
2	Product B	Description of Product B	$15.99	2021-01-01	Active
3	Product C	Description of Product C	$8.99	2021-01-01	Active

In the Master Incremental approach, updates are made by modifying only the relevant records. New products or modified details are added as new records with their respective effective dates. For representing deletes, a new record is added with an updated status indicating that the product is now inactive or deleted. In the example above, Product B's status is changed to "Inactive" with an effective date of June 15, 2022, indicating that it is no longer available.

By implementing the Master Incremental approach, the product catalog management system maintains a history of changes, including deletes, by adding new records rather than removing existing ones. This allows for accurate tracking and analysis of product data over time while efficiently managing updates and deletions.

Product ID	Name	Description	Price	Effective Date	Status
1	Product A	Description of Product A	$10.99	2021-01-01	Active
2	Product B	Description of Product B	$15.99	2021-01-01	Active
3	Product C	Description of Product C	$8.99	2021-01-01	Active
4	Product D	Description of Product D	$12.99	2022-05-10	Active
2				2022-06-15	Inactive

Unitemporal and Bitemporal

Unitemporal data refers to structured data that incorporates a single valid time dimension. It captures not only the current state of the data but also the validity period during which each record was considered accurate. Unitemporal data is valuable in scenarios where the historical context of changes is crucial for analysis, compliance, or auditing purposes. It enables tracking changes and conducting retrospective analysis based on different time periods.

Bitemporal data combines two time dimensions: valid time and transaction time. Valid time represents the period during which a record is considered accurate, while transaction time captures when a change was made to the record. Bitemporal data is commonly used in scenarios where analyzing the temporal aspects of data changes is critical, such as in financial systems or legal applications. It allows for precise tracking of changes and their timing, providing a comprehensive historical view.

Let's consider a real estate property database that tracks the historical ownership and value of properties. We want to capture the current and past ownership details along with their validity periods.

Property ID	Owner	Start Date	End Date
1	John Smith	2010-01-01	2015-06-30
1	Sarah Johnson	2015-07-01	(Current)
2	Alex Williams	2012-03-15	(Current)
3	Emma Thompson	2014-05-10	2022-08-31

In this Unitemporal example, the data captures the current and historical ownership information for properties. Each property has a unique Property ID. For instance, Property ID 1 was owned by John Smith from January 1, 2010, to June 30, 2015. Then, the ownership was transferred to Sarah Johnson from July 1, 2015, until the current date, which is represented as "Current." Property ID 2 is currently owned by Alex Williams, and Property ID 3 was owned by Emma Thompson until August 31, 2022.

By utilizing Unitemporal data, the property database can maintain a complete history of ownership records, enabling analysis based on specific timeframes and generating accurate reports reflecting changes in property ownership over time.

Let's continue with the real estate property database example, but this time, we will illustrate the Bitemporal approach to track the historical ownership and value of properties, capturing both valid time and transaction time.

Property ID	Owner	Start Date	End Date	Transaction Date	Property Value
1	John Smith	2010-01-01	2015-06-30	2010-01-05	$100,000
1	Sarah Johnson	2015-07-01	(Current)	2015-07-10	$150,000
2	Alex Williams	2012-03-15	(Current)	2012-03-20	$200,000
3	Emma Thompson	2014-05-10	2022-08-31	2014-05-15	$300,000

In this Bitemporal example, the data includes both the valid time and transaction time. Each property has a unique Property ID, and ownership records are associated with specific owners. Additionally, the transaction date represents when the ownership change or value update occurred.

For instance, Property ID 1 was initially owned by John Smith from January 1, 2010, to June 30, 2015, with a transaction date of January 5, 2010, and a property value of $100,000. The ownership was then transferred to Sarah Johnson, effective from July 1, 2015, until the current date, with a transaction date of July 10, 2015, and an updated property value of $150,000.

Similarly, Property ID 2 is currently owned by Alex Williams, with a transaction date of March 20, 2012, and a property value of $200,000. Property ID 3 was owned by Emma Thompson until August 31, 2022, with a transaction date of May 15, 2014, and a property value of $300,000.

By utilizing Bitemporal data, the property database can accurately track the historical ownership records and property values, considering both the valid time and transaction time. This enables precise analysis of property ownership and value changes over specific time periods, facilitating comprehensive historical reporting and audit trails.

Conclusion

Understanding the various shapes and forms of structured data is essential for effective data management and analysis. From managing changes in Slowly Changing Dimensions (SCDs) to adopting Master Full or Master Incremental approaches, and from leveraging the temporal aspects of Unitemporal and Bitemporal data, each concept offers unique benefits

Dozer Goes Open Source: Empowering the Community to Build Real-time Data Apps

Fri, 21 Apr 2023 00:00:00 GMT

We are excited to announce that Dozer, is now open source! The Apache 2.0 license 🎉. With this move, we aim to empower the community to build and scale real-time data applications more effectively.

Dozer simplifies the process of connecting applications to various data sources, such as PostgreSQL, Kafka, or other databases & sources, enabling developers to easily unify data across different sources. By making Dozer open source, we are inviting developers to contribute to its growth and help shape the future of real-time data applications.

What is Dozer?

Dozer is a Real-Time Analytical Layer for your LLMs and Data Products. It enables easy real-time data products development, deployment and maintenance. Aim of the product is to enable developers to build customer facing analytical products without having to worry about building infrastructure. Dozer is built in Rust and utilises Clickhouse for serving low latency analytics. With a few lines of SQL and a simple YAML configuration, you can build, deploy and maintain full data backends. Dozer is designed to be easy to use, scalable, and flexible, making it an ideal platform for building real-time data applications.

Why Open Source?

There are several reasons why we decided to make Dozer open source:

Community-driven innovation: We want developers from all over the world to contribute their ideas,and provide valuable insights, improvements, and fixes, leading to a more innovative and robust platform, and faster development cycles.
Transparency and trust: We want to enable users to view and understand the underlying code, fostering trust in the platform and ensuring that it meets their needs and expectations. Especially how data is being stored and processed.
Collaboration and learning: We encourage developers to collaborate, share ideas, and learn from one another. We want to foster a strong community that helps developers grow their skills and expertise in the real-time data space.

Getting Started with Dozer

To help you get started with Dozer, we've created documentation and tutorial blogs that guide you through the process of setting up, configuring, and using Dozer for your data applications. You can find these resources on our official documentation site and can also follow Dozer on dev.to or additional resources on getting to know & use Dozer better!

How to Contribute?

We welcome contributions from the community! If you're interested in contributing to Dozer, please check out our GitHub repository for the guidelines. Whether you want to submit a bug report, suggest a new feature, or contribute code, we appreciate your help in making Dozer even better.

We also encourage you to check out the following resources to know other ways of contributions:

Show and tell GitHub discussion forum: We have setup a GitHub discussion forum, where you can share your Dozer projects, and experience, ask questions, do a feature request, and connect with fellow Dozer community members. Feel free to join the conversation and share your own projects & ideas! Additionally, we have also created a discord channel to give developers an opportunity to showcase and talk about their projects. Join our discord channel, If you’ve built something using Dozer, we’d love to see it!

You can also join our community discord to see what we are cooking at Dozer 👩‍🍳 👨‍🍳

Conclusion

We believe that by making Dozer open source, we are empowering the developer community to build amazing real-time data applications. We look forward to seeing the project and ideas that arise from this collaborative effort, and we're excited to work together to shape the future of real-time data! Together, we can shape the future of real-time data and contribute to the ongoing success of open source software. 🚀

Happy coding ! 🚀👩‍💻👨‍💻

Why you might not even need a data platform

Thu, 16 Mar 2023 00:00:00 GMT

Every company I meet today has a data platform. And if they don’t have one, they want one. Problem is that building and maintaining a data platform is not trivial. First, multiple tools need to be integrated together: Airflow, Spark, Presto, Kafka, Flink, Snowflake, and potentially many more, but, more importantly, a dedicated engineering team must be setup to maintain and make sure everything runs smoothly. And, what usually happens is that, after data has been accumulated for months and months, the cost of running such infrastructure is higher than the benefit.

So the question is: do you really need a data platform ?

Get back to the basics

Let’s take an example of a mid-size company embarking into an adventure of building a data platform. Generally, they do it for two purposes:

Data analytics: Being able to generate analytical dashboards from historical data
AI and advanced use cases such as real-time user personalisation

Typically, for the first use case you’d setup a Snowflake or Databricks instance and dump all your data there. But wait! Do you really need it? Very likely you will not have Petabytes of data to manage. How about something leaner?

If you are familiar with the data space, you’d have probably recently heard about tools like Pola.rs, Datafusion or DuckDB! If you have not heard about them, they are small and highly efficient OLAP query engines that can achieve impressive performance. The reason why they are so efficient is because their authors have made the decision to go back to the basics. Forget about distributed data processing frameworks like Apache Spark with inefficient network shuffling! Forget about 20 years old languages like Java or Scala (With all the GC problems they bring along)! Embrace simplicity using lower-level languages like C/C++, even better, Rust, and squeeze every CPU cycle to get as much performance as possible.

So, it’s pretty trivial to dump all your data from your OLTP databases into an S3 bucket, and bring up multiple ad-hoc instances of DuckDB, Pola.rs or DataFusion, run all your OLAP queries and shuts everything down. All for a negligible TCO. Multiple companies realised the potential of such approach and are building what I call a “Poor’s men data platforms” around these tools. MotherDuck is doing this with DuckDB, for example.

How about real-time ?

While this lean approach is very easy to achieve for batch workloads, it is not that trivial when we start addressing more complex use cases like AI or real time personalisation. Real-time is a lot harder and, in many scenarios, the goal of real-time use cases is not just producing analytical dashboards, but it is the full integration of the data with customer-facing applications, enabling another level of interactivity. The simplest example is probably user personalisation. For such a use case, data from multiple sources need to be combined, an ML model might be applied, and, in some scenarios, data should be updated based on user behaviour. All this in real time!

Achieving this today is not trivial. Some companies have given up to handle all this in real time, because it’s simply too complex and expensive. Think for instance of how reverse ETL and personalisation APIs are really implemented today in most cases: everything is still batch! Data is pulled from your sources using tools like AirByte or Fivetran and loaded into your Snowflake or Databricks. Then, every day or hour, you run your DBT jobs which extract the data you need, run your ML models, and load the results into some cache or low-latency database for serving. Companies are trying to come up with solutions to simplify the process, but everything is still: batch!

If you want something more than this, it is definitely possible! But it is complex! You need an entire infrastructure that is capable of handling real-time data (i.e. Kafka), a stream processing engine (i.e. Spark Streaming, Flink, Kafka Streams), one or multiple low latency data-store depending on the query patterns of your application (i.e Redis, Aerospike, ElasticSearch), an API layer and, most importantly, a data engineering team capable of putting all these pieces together!

Enter the Data Apps world

So, is there a way to achieve the sample simplicity of DuckDB or Pola.rs for something like this? Probably yes, and the answer is Data Apps. What are data apps? There is really no proper definition for it, but, the way I like to describe a data app is:

A self-contained monolith application, that is capable of efficiently serving data, and, at the same time, react to data changes in real-time and perform complex operations such as joins, aggregations, ML predictions, notifications, and more.

The definition is on-purpose generic. But, fundamentally I see Data Apps as the bridge between source systems and user-facing applications, enabling a high level data interactivity and actionability.

Forget about streams, caches, pipelines, etc! Just put a data app backend between the source systems and the user application and magic can happen!

Some of these ideas have been pioneered by a very successful tool called StreamLit: a Python framework allowing data scientists to quickly prototype data apps using Python. While StreamLit is a beautiful and powerful tool, it has not yet unlocked the full potential of data apps, especially when an entire ecosystem on the backend side has to be connected.

The software engineer’s perspective

The initial idea of StreamLit was to primarily let data scientists with no UI development experience to showcase their work and let users interact with their ML models. Now, let’s think of data apps from a full-stack or frontend engineer. What I’d want is a quick way pulling production data from multiple sources, process it in real-time using familiar tools like SQL, Javascript and Python and have ready-made APIs allowing me to interact with the data. I want to query it and I want to trigger events that might propagate back to the source system and, again, in real-time, see how my changes affected the system.

As a full-stack engineer I want to superpower of a full data engineering team!

The bottom line

If all this is possible, it means all the complexity needed for a typical data platform with a lambda or kappa architecture is gone. Batch workloads can be easily handled using tools like DuckDB and real-time workflows can be easily handled by a bunch of real-time data apps disseminated in the organization sitting between the source systems and the users.

The philosophy behind all this is what led us to create Dozer. A real-time data app backend specifically targeted to full-stack and frontend engineers. Our mission is to give data superpowers to the full-stack developer!

Two things that Rust does better than C++

Mon, 13 Feb 2023 00:00:00 GMT

At Dozer, we have adopted Rust as our main programming language, despite many of our team members having a strong background in C++. This is because Rust offers a combination of expressiveness, safety and ergonomics through its language constructs, which we find appealing.

In this post, we will discuss two language features that we believe Rust handles better than C++, namely its ownership model and trait object system. These compare favorably to C++'s move semantics and virtual functions, respectively, and provide insights into why Rust has gained popularity among many developers.

Ownership vs Move Semantics

Phenomena

Consider following Rust code (playground):

struct Struct;

impl Drop for Struct {
    fn drop(&mut self) {
        println!("dropped");
    }
}

fn main() {
    let a: Struct = Struct;
    let _b: Struct = a;
}

If you run it, there's a single output line:

dropped

The C++ code that behaves most similarly (playground):

#include 

struct Struct {
    Struct() = default;
    Struct(const Struct &) = delete;
    Struct(Struct &&) = default;
    Struct &operator=(const Struct&) = delete;
    ~Struct() {
        std::cout << "destructed" << std::endl;
    }
};

int main() {
    Struct a;
    Struct b = std::move(a);
    return 0;
}

It outputs two lines:

destructed
destructed

We can see the C++ destructor is executed twice.

Analysis

The root of the problem is that C++ only provides rvalue references as a special type at the language level, and move semantics are implemented by the user according to convention. From the compiler's perspective, an object that has been moved is still an intact object. This brings not only the problem of destructors being executed multiple times (although this problem has already brought additional runtime overhead), but also imposes two burdens on class authors in C++:

The destructor must correctly handle objects that have been moved.
In all public interfaces, correctly handle objects that have been moved, or transfer this burden to the class user.

The first is obvious. Regarding the second, due to the fact that correctly handling objects that have been moved in all public interfaces usually brings runtime overhead, the responsibility of not using objects that have been moved has been imposed on almost all C++ users, while class authors usually only provide an interface for querying whether an object has been moved.

A typical example of the second is std::unique_ptr, and any user of std::unique_ptr must check if it is null.

C++'s move semantics greatly reduce the usability of RAII. When the user gets an object, they always need to consider whether the resource it manages has been moved. This increases the mental burden on the programmer and is a breeding ground for bugs.

Trait Object vs Virtual Function

Consider following Rust code (playground):

trait Trait {
    fn f(&self);
}

struct Impl;

impl Trait for Impl {
    fn f(&self) {
        println!("f from Impl");
    }
}

fn main() {
    let a: Impl = Impl;
    let b: &dyn Trait = &a;
    b.f();
    println!("Size of Impl is {}", std::mem::size_of::());
}

The output is:

f from Impl
Size of Impl is 0

The fact that the size of the Impl struct is 0 means that whether or not runtime polymorphism is used has no impact on the memory layout of the struct itself.

The C++ code that behaves most similarly (playground):

#include 

class Trait {
public:
    virtual void f() const = 0;
};

class Impl: public Trait {
public:
    void f() const override {
        std::cout << "f from Impl" << std::endl;
    }
};

int main() {
    Impl a;
    Trait &b = a;
    b.f();
    std::cout << "Size of Impl is " << sizeof(a) << std::endl;
    return 0;
}

The output is:

f from Impl
Size of Impl is 8

This is the output on a 64-bit system, where due to the use of runtime polymorphism, each Impl object holds an 8-byte virtual table pointer.

Compared to Rust's trait object, C++ runtime polymorphism is not a zero overhead abstraction. The additional 8-byte storage overhead is often unacceptable, and the fact that the virtual table pointer changes object memory layout greatly limits the scope of its application.

Data as a product - The role of APIs

Tue, 28 Jun 2022 00:00:00 GMT

What are the challenges of implementing efficient and scalable data APIs?

Thanks to the adoption of cloud data warehouse platforms like Snowflake or Databricks every organization is producing more and more data. This data is later processed by data analysts to extract insights or by data scientists to build predictive models to help business decisions. Data analysts generally use tools like DBT to write SQL transformations, while Data Scientists generally prefer using a Python stack or AutoML tools like DataRobot or H2O. In both cases, all results are written back to the same data warehouse for easier accessibility.

To consume this data, companies have started building analytical dashboards, which are playing an important role in monitoring the health of the business and help drive strategic decisions. More recently, companies started to realize the value of this data in other contexts. Reverse ETL like Hightouch or Census, for instance, unlocks its value in operational use cases by making insights or predictions available in cloud SaaS applications. This is very useful, for example, to improve the efficiency of an e-mail marketing campaign.

Use cases, however, are not just limited to internal consumption. In multiple scenarios, it's extremely useful to expose this data directly to the end-user as part of the product experience. Think of the fintech industry, for example, where companies need to make this data readily available from the user's mobile app in order to improve their product's UX.

This seems a very easy task to achieve, but in reality, it can require a lot of work from a diverse group of people. Let's understand more! Data Warehouses like Snowflake or Databricks are specifically designed for analytical purposes. This means they are not suited for low-latency querying and point lookups. However, these are the typical requirements in a microservice serving customer applications or a mobile app; fast response time is a prerequisite to implementing a good UX for the user. For this reason, data sitting in the data warehouse needs to be moved to a different type of storage that is capable of offering these capabilities. During this process, data must be properly prepared and indexed and an API layer must be created in front of it, so that product engineers can build their applications on top. All this process is quite challenging and requires a lot of data engineering work.

Let's look at some challenges in detail:

Data Models: microservices and front-end developers are used to working with hierarchical data models (like JSON or Protobuf) while Data Analysts and Data Scientists are more comfortable with tabular data. In order to better fit API use cases, it is ideal to put in place mechanisms to automatically denormalize and transform data from tabular to hierarchical representations.
Data Integrity: In some situations incremental movement of data is okay, but in some other scenarios it is required that a dataset is replaced completely with a new version of the data. In these cases, it is important to ensure that an "all-or-nothing" pattern is applied, preventing the mixup of old and new data during deployment.
Seamless to Consumer: Once a new version of the data is deployed it is important that the consumer will start using the new version of the data in an automated fashion.
Easy Rollbacks: In cases where wrong data gets deployed, it must be possible to rollback to an older version with minimal effort in order to avoid any disruption in user functionality.
Fine-Grained observability and RCA: It is possible that, for any reason, some wrong data is served to the user. In those situations, it is essential to have a proper observability tool that is capable of tracking each API user request and tracing it back to the source data.
Low-latency: The way data is represented and indexed depends very much on the consumption pattern. Sometimes it is necessary to look up data by a primary key, some other times by multiple secondary keys, some other times by a geographic location, and so on. A storage layer that sits in front of APIs must be able to satisfy these kinds of lookups very efficiently and at extremely low latency.
Auto-scaling: APIs need to handle spikes of traffic efficiently. This is generally achieved with auto-scaling. This is an easy task when a stateless API server needs to be auto-scaled, but it is much harder when APIs and storage need to be scaled together.

All the challenges I described above are what we are solving with Dozer. We are aiming to automate the data extraction and preparation process to make it efficiently serviceable through APIs. Stay tuned for more!

Improve your Postgres query performance through a CDC pipeline - Part 1

Sun, 19 Jun 2022 00:00:00 GMT

Ever come across occasions when you run into query performance issues for important queries that run on your database? This is when most companies will look to introducing a caching layer to improve the speed of queries.

In many scenarios you can probably fix your performance issues by introducing the right set of indexes. Or maybe denormalizing some fields to reduce the join overhead. These come with their own set of challenges such as having to write in two places etc and may even require code change. You might be working with a legacy platform where changing code is not straightforward.

What are your options in implementing a caching layer ?

Application Layer Cache using Redis/ Memcached / Dynamo Db etc Replicate data using CDC to an alternative DB/store optimised for your queries.

1) Application Layer Cache using Redis / Memcached / Dynamodb

This is a widely used approach where you would implement a caching layer by adapting one of these caching strategies.

Cache Aside: Maintain data in both cache and primary DB
Read Through: Typically implemented using a library / framework where it talks to the db if there is a cache miss
Write Through: Write passes through
Write Back: Write to cache first

Each of these strategies comes with its own set of pros and cons but the main differentiation is that application code has to deal with this complexity of caching logic.

2) Replicate data through CDC to a secondary database.

This approach has been gaining traction for read heavy operations. With tools such as Debezium and AWS Database Migration Service, companies are building pipelines that move data using a replication approach. The diagram below illustrates the typical components involved.

This can be implemented without modifying the original implementation. There are some considerations to take note of: Does data need to be real time? What types of indexes suit your querying needs? How do you guarantee availability? What happens if the schema in the primary database changes? What are the cost involved? This requires data engineering and skilled engineers to build and maintain.

In the next article (Part 2) we will be publishing a sample repository that demonstrates some of this in code. We are very excited at Dozer to build an end to end system that takes care of this exact problem statement. We are currently in the build phase and will publish our repository soon for developers to try. Please sign up on Dozer Website to get early access.

Are Rust, C++ and WASM the new tools for Data Engineering?

Sat, 18 Jun 2022 00:00:00 GMT

Traditional tools for data engineering are suffering in performance and scalability. JVM-based tools are becoming outdated, while new languages are becoming increasingly popular. Will Rust and WASM replace the current data engineering JVM-based stack ?

I started my career as a C/C++ developer 20 years go working on network protocols and embedded systems. Over time, I moved more and more to work in the data space and my level of abstraction started to move up in the stack, with obviously less control of what is going on under the hood. When you go from C/C++ to Java, everything seems rosy in the beginning but, soon, when you start struggling with memory allocation, garbage collection and similar things you realise that you are loosing the power you had in your hands during your old C/C++ days. The advantage of the JVM, though, is the pluggability.

If you design your software well, you can pretty much allow any customisation to be plugged in at a binary level, just by adding adding a new JAR to your classpath. Where things get trickier is however scriptability. In many situations you want your software to be scriptable using languages like Javascript. It is possible, but the level of integration between scripting languages and the JVM are not that great. And many times, performance is poor. Think for example at the Spark and Python integration. That required a bridge like Py4J to make it work, but at a huge performance cost. Now things got better with support for new formats like Arrow, but I remember the first version of PySpark was pretty crappy and almost unusable.

However, I have a feeling things are starting to change. People are realising that maybe, the JVM is not really the best option for building data intensive applications. But what's the alternative? Recently Rust started to become very popular also thanks to the support of the blockchain community and the developer community started to realise that it can be used to build large and scalable systems. And...where do we need scalability today? Data! We have to handle more and more data and, clearly, the current tooling is not scaling up. It proves the fact that Databricks went through a complete rewrite of Apache Spark in C++, with huge benefits in terms of performance and scalability. At the same time you see several startups taking a similar direction. Look at RedPanda, who is implementing a much leaner version of Kafka entirely in C++. Many companies are following and will follow this trend.

But how to allow pluggability in these systems? Meet WASM, the new kid on the block. WASM is fundamentally a machine-level language that can integrate seamlessly with C++ and Rust. The beauty of it is that WASM can be generated from multiple languages like C, C++, AssemblyScript (a variation of TypeScript), Rust, Kotlin and others. You can even compile a full Python interpreter to WASM and host the execution of a Python script! As more and more language will support compilation into WASM or LLVM, the possibilities are endless. Now I think you understand where I'm getting! By bridging together high performant languages like C++ or Rust with WASM we get teh best of both worlds: performance, scalability and pluggability. I truly believe in this new pattern and that is the reason why at Dozer, we are building the next generation Data APIs stack entirely using Rust and WASM. Stay tuned!

Dozer | Start building real-time data apps in minutes Blog

Retrieval Augmented Generation (RAG) workflow with Dozer

Issues with LLMs​

Hallucinations Problem:​

Out-of-Date Training Data:​

How RAG Works​

Librarian and the Writer Analogy​

Comparing RAG with Fine-tuning LLMs​

Challenges with RAG​

Siloed Data​

Real-time Data:​

How Dozer can help with both the Challenges​

Here are few ways Dozer helps with RAG​

The resourceful assistant​

Conclusion​

Hyper-personalized chatbots using LLMs, Dozer and Vector Databases

Requirements​

But, there is a problem!​

Putting it all together​

Let's give it a try with ChatGPT​

Shapes and Forms of Structured Data: SCD Types, Master Full, Master Incremental, Unitemporal, and Bitemporal

Slowly Changing Dimensions (SCD)​

SCD Type 1​

SCD Type 2​

SCD Type 3​

Master Full and Master Incremental​

Unitemporal and Bitemporal​

Conclusion​

Dozer Goes Open Source: Empowering the Community to Build Real-time Data Apps

What is Dozer?​

Why Open Source?​

Getting Started with Dozer​

How to Contribute?​

Conclusion​

Why you might not even need a data platform

Get back to the basics​

How about real-time ?​

Enter the Data Apps world​

The software engineer’s perspective​

The bottom line​

Two things that Rust does better than C++

Ownership vs Move Semantics​

Phenomena​

Analysis​

Trait Object vs Virtual Function​

Data as a product - The role of APIs

Improve your Postgres query performance through a CDC pipeline - Part 1

What are your options in implementing a caching layer ?​

1) Application Layer Cache using Redis / Memcached / Dynamodb​

2) Replicate data through CDC to a secondary database.​

Are Rust, C++ and WASM the new tools for Data Engineering?

Issues with LLMs

Hallucinations Problem:

Out-of-Date Training Data:

How RAG Works

Librarian and the Writer Analogy

Comparing RAG with Fine-tuning LLMs

Challenges with RAG

Siloed Data

Real-time Data:

How Dozer can help with both the Challenges

Here are few ways Dozer helps with RAG

The resourceful assistant

Conclusion

Requirements

But, there is a problem!

Putting it all together

Let's give it a try with ChatGPT

Slowly Changing Dimensions (SCD)

SCD Type 1

SCD Type 2

SCD Type 3

Master Full and Master Incremental

Unitemporal and Bitemporal

Conclusion

What is Dozer?

Why Open Source?

Getting Started with Dozer

How to Contribute?

Conclusion

Get back to the basics

How about real-time ?

Enter the Data Apps world

The software engineer’s perspective

The bottom line

Ownership vs Move Semantics

Phenomena

Analysis

Trait Object vs Virtual Function

What are your options in implementing a caching layer ?

1) Application Layer Cache using Redis / Memcached / Dynamodb

2) Replicate data through CDC to a secondary database.