[PDF]LEVERAGING LANGCHAIN AGENTS TO AUTOMATE DATA ANALYSIS FOR SAAS

LOGICIEL DE GESTION GRATUIT POUR VOTRE BUSINESS

Contact the Author

Please sign in to contact this author

ISSN 2710 - 1673 Artificial Intelligence 2024 Ne 2

UDC: 004.93 https://doi.org/10.15407/jai2024.02.107

M. Shash

State University of Information and Communication Technology, Ukraine
7, Solomyanska Str., Kyiv, 03110

max.shash @ gmail.com

https://orcid.org/0009-0009-3274-53 18

LEVERAGING LANGCHAIN AGENTS TO AUTOMATE
DATA ANALYSIS FOR SAAS

Abstract. This paper investigates the implementation of LangChain, a language model-powered framework, in
automating data analysis within the SaaS sector.The approach included setting up LangChain agents for exploratory,
univariate, and bivariate analyses, as well as hypothesis testing, transforming extensive data into human language text
answers. Experiments confirmed the effectiveness of the proposed method using GPT-3.5 LLM agents, tested on the
Amazon AWS SaaS Sales Dataset. Identified deficiencies need to be addressed for complex queries and comprehensive
reports. Future research prospects include improving the method for complex queries, providing more detailed
information about companies and business models, creating report templates, and training the model to solve complex
questions. To automate data analysis, the method of using LangChain agents was proposed. A software implementation
was developed, and data analysis indicators were studied using SaaS sales data as a case study. The study demonstrated
LangChain agents’ capability to automate data analysis processes in the SaaS industry. Future research will aim to
expand its application across more complex data, larger number of data questions, and pre-trained LLMs.

Keywords: LangChain, LangChain agent, AI, LLM, artificial intelligence, data analysis, data analysis
automation.

Abbreviations business strategies and maintaining
Agent is a LangChain agent; competitiveness [1]. Traditional data
Al is an Artificial Intelligence; processing in this sector depends heavily on
ChatGPT is an instance of the GPT model by the expertise of data analysts. This reliance
OpenAI; can strain budgets and lead to operational
Dataframe is a data structure in Python; slowdowns and scaling difficulties, issues that
EDA is an Exploratory Data Analysis; are intensified by the rapid increase in data
GPT is a Generative Pre-trained Transformer; volume and complexity. The introduction of
gpt-3.5-turbo-0125 is a version of GPT-3.5 new technologies like LangChain and
model Retrieval-Augmented Generation (RAG)
LangChain is a framework for LLMs; heralds a significant transition towards more
LLM is a Large Language Model; sustainable and streamlined approaches to
OpenAI is an AI research and deployment data management using LLMs.
company, Object of Study: The object of this study
Pandas is a library in Python is the automation of data analysis processes
RAG is a Retrieval-Augmented Generation; within the SaaS sector, aimed at automating
SaaS is a Software as a Service; traditional, manual methods of data
Temperature is a hyperparameter for LLM. examination and interpretation by data

analysis with AI chatbots using LangChain

Nomenclature agents.

Aj is a vector created by embedding Subject of Study: The subjects of this
mechanism; study include Al-driven technologies—
Bj is a vector created by embedding specifically LangChain and RAG—that
mechanism;

facilitate the automation of data analytics.
These technologies enable the execution of
complex tasks such as exploratory data

Q; is an angle between vectors A; and Bj.

Introduction analysis, univariate and bivariate analysis, and

In the rapidly changing Software as a hypothesis testing through AI chatbots,
Service (SaaS) industry, effectively using reducing the need for extensive human
advanced data analytics is crucial for guiding intervention.

107

ISSN 2710 - 1673 Artificial Intelligence 2024 Ne 2

Purpose of the Work

The primary aim of this paper is to
explore the potential of LangChain agents and
other related AI technologies to automate data
analysis in the SaaS industry.

@

Executive Team

Data Analytics

(CEO, CTO, CMO) :
Complex question that
Product ; requires data
‘Q) Managers ; @)
Team
Sales and

@

‘Q) Financial Team

Marketing Team

“@

Answer based on data

Problem statement

The conventional approach to data
analysis in SaaS sector is deeply dependent on
the expertise of data analysts.

DataBase /
DataWareHouse

Request to get data

(SQL, Python, "5

Web Analytics

Product Analytics

Response with data

Other data
sources

Fig. 1. The common process of data analysis interactions with humans and data

This dependence introduces substantial
obstacles: it leads to bottlenecks in the
operational workflow, increases operational
expenditures, and constrains SaaS company
ability to scale their data analytics functions
effectively. Such challenges are intensifying
as the volume and complexity of data expand
exponentially.

The advent of AI and associated
technologies like —Retrieval-Augmented
Generation (RAG) and LangChain offers a
transformative solution to these bottlenecks.
By integrating these technologies, businesses
can leverage Al-driven chatbots to undertake
a broad spectrum of data analyses, ranging
from basic exploratory data analysis (EDA) to
more sophisticated univariate and bivariate
analyses, as well as hypothesis testing. The
implementation of these tools challenges
traditional methodologies by _ enabling
automated, efficient data processing that does
not compromise on depth or accuracy.

Review of the literature

LangChain is a versatile and powerful
framework that facilitates the development of
applications powered by large language
models (LLMs), such as ChatGPT, especially
within the context of automating data analysis

108

for SaaS or other sectors [9,10]. Various
resources highlight how LangChain can
significantly enhance data analysis processes
through its ability to integrate LLMs for
natural language understanding and
generation, offering a new level of interaction
and automation in data-related tasks.

LLMs can effectively perform market
sentiment analysis on Reddit posts, with
potential for competitive performance against
existing supervised models [5].

The new method using large language
models (LLMs) improves qualitative data
analysis by automating keypoints extraction
and relevance evaluation, achieving higher
accuracy and reducing time and effort in
various application settings [6].

Mergen, an R package, leverages Large
Language Models (LLMs) for data analysis
code generation and execution, enabling
humans to conduct data analysis by
describing objectives and desired analyses
through clear text [7].

For instance, LangChain simplifies the
creation of applications that leverage LLMs
for tasks like text summarization, question
answering, and chatbot development. It
allows these applications to interact with
various data sources, enabling more context-

ISSN 2710 - 1673 Artificial Intelligence 2024 Ne 2

aware, responsive, and intelligent systems.
Developers can define use cases, build logic
with flexible prompts and chains, and set
context to guide application behavior, which
enhances performance and user experience
[3].

LangChain can be used to analyze
various data input sources such as PDF
documents including text, images, tables, and
other embedded elements [2] as well as
summarize documents and provide answers
based on the document content [8].

In terms of automation, LangChain's
applicability extends to specific data analysis
tasks such as Exploratory Data Analysis
(EDA), hypothesis testing, and _ bivariate
analysis using agents created to interface with
data frames directly, simplifying — the
workflow and improving the efficiency of
data-driven decisions [4].

These capabilities demonstrate
LangChain's potential to revolutionize data
analysis, providing a robust tool that can

@

Executive Team

a 11 Al ChatBot
AAD LangChain Agent

(CEO, CTO, CMO)
Complex question that
Product ; requires data
() Managers : @)
tle
Sales and Pats

@

Qa Financial Team

Marketing Team

“@

Answer based on data

accommodate a wide array of analytical tasks,
fostering innovation, and enabling automated
data analysis.

Materials and methods

In the article we suggest to replace data
analyst with AI Chatbot based on LangChain
agent. The agent receives data-related
questions that needs data to be analyzed in
order to answer the question.
The suggestion is to use LangChain agent that
involves leveraging an LLM in tandem with a
series of predefined actions. The agent
employs a reasoning engine to select the most
appropriate actions to answer the questions.
The agent can manage multiple tasks such as
receiving data from the database, run such
tasks in loops, and use additional context to
provide contextually-aware answer.

Firstly, based on the description of
tools, the agent decides which tool should be
used to get relevant information.

S

DataBase /
DataWareHouse

Request to get data

(SQL, Python, "5

Web Analytics

Product Analytics

~
o

Response with da

Other data
sources

Fig. 2. Replacing the data analysts with AI ChatBot

Secondly, the agent performs actions,
such as generating Python code to extract
required data from a Pandas dataframe, and
considers the context of the obtained results.
It also seeks additional information from
other sources, like conducting a Google
search or referencing descriptions of SaaS
domains.

Lastly, the agent checks the results and
repeats the process to get the desired

109

information to answer the question in human
language to be understanded by the person
who asked the question.

Recognizing the likelihood of absence
of domain knowledge and company-specific
data within LLM, the decision was made to
create a second agent enhanced with the
incorporation of such information.

ISSN 2710 - 1673 Artificial Intelligence 2024 Ne 2

@

Executive Team
(CEO, CTO, CMO)

Product ,
requires data
() Managers @ .
ole
Sales and a

@

@

Marketing Team

Domain
Knowledge

Complex question that

Un Al ChatBot
aA: LangChain Agent

Answer based on data

(g) Financial Team

DataBase /
DataWareHouse

Ss Web Analytics

Product Analytics

Request to get dat
(SQL, Python, etc)

Response with data

Other data
sources

Fig. 3. Providing additional domain knowledge to AI agent

The domain knowledge data is split into
the discrete chunks, each representing a
distinct unit of information. Subsequently, an
embedding algorithm processes these
segmented data pieces, generating
embeddings that are then inserted into a
vector database. Each vector is an n-

osine similarity= cos(8) =

dimensional space and is represented by
numerical values.

User’s question is processed by the
embeddings mechanism and embedding is
computed. Then the cosine similarity between
the user’s embedding vector and every
database vector is calculated using the
following formula [4]:

AB SRA
TANIA

A few examples of vector similarity are shown in Figure 4 below.

most similar vectors
cos(8) =1

somehow similar vectors
cos(@) = 0.98481

not similar vectors
cos(8) =0

Fig. 4. An illustration of vector similarity for two vectors A and B with an angle 0

The most similar vectors are returned as
the ones that contain the most relevant data.
These vectors form the context for the
subsequent input to the LangChain agent.

LangChain agent takes into account this
information when answering the question.

Experiments
Two LangChain agents were developed:
the first one had access to data in a Pandas

110

dataframe, while the second one had access
not only to the dataframe but also to
contextual text describing the dataset,
company, and domain knowledge. The
contextual information was provided as text
input.

The dataset comprised SaaS sales data
sourced from AWS, consisting of a single
table with 18 columns and 9994 rows. This

ISSN 2710 - 1673 Artificial Intelligence 2024 Ne 2

data was imported from a CSV file into a
Pandas dataframe.

Both LangChain agents were evaluated
in addressing various data analysis tasks,
including exploratory data analysis (EDA),
univariate and bivariate analysis, complex
inquiries, hypothesis generation based on the
data, and hypothesis testing using the t-test.

Twenty analytical questions covering
diverse analysis types were formulated and
presented to both LangChain agents in natural
language. The responses were recorded,
compared against each other, and analyzed.

The OpenAI GPT-3.5-turbo-0125
served as the Language Model (LLM) with
specified parameters: temperature=O and
max_tokens=1000.

The langchain_experimental.agents.
agent_toolkits.-pandas.base.
create_pandas_dataframe_agent API was
employed to create the dataframe agent.

To enhance the second LangChain
agent with context, hardcoded text input was
provided. Subsequently, the 20 questions
were posed to each LangChain agent, and the
resulting responses were evaluated across
various metrics by comparing them to a set of
pre-validated answers.

These metrics were utilized for the
analysis and evaluation of the LangChain
agents’ performance:

Correctness — Assess whether the agent
delivered correct response or solution.

Comprehensiveness — Evaluate whether
the agent's responses covers all the details of
the question.

Clarity — Assess whether the response is
clear. Responses that are overly verbose or
unclear may hinder user understanding.

Novelty — Measure the degree of
novelty in each response compared to existing

information. This metric is particularly
relevant for hypothesis generation and
explanations.

Relevance — Rate the relevance of the
answer to the dataset and inquiries.

For each question, results are
categorized as follows: 2 — fully corresponds,
1 — partially corresponds, 0 — does not
correspond, N/A — result not provided due to
LangChain agent issue.

Results

The results of conducted experiments is
presented in the Table 1. Here we use the
following notation: Agent | is a LangChain
agent without additional context, Agent 2 is a
LangChain agent with additional context. The
metrics (Correctness, Relevance,
Comprehensiveness, Clarity, Novelty) were
calculated as average values.

The table 1 shows that for several
metrics (Correctness and Relevance) agent 2
showed higher results. While for the other
metrics the results are similar. It’s also seen
that the use of the proposed method of
instance significance determining allows in
practice to select a subsample of smaller
volume from of the original sample, enough
to construct neural network models with the
required accuracy, reducing the time to build
models.

Table 1. The experiment results on agent performance by analysis types

Metric Correctness, avg | Relevance, avg | Comprehensiveness, avg Clarity, avg Novelty, avg
Analysis Agent | Agent | Agent | Agent Agent Agent Agent | Agent | Agent | Agent
type 1 2 1 2 1 2 1 2 1 2
Bivanate’ | -3,00°| 200: | 4:00: || 1.00 1.00 1.00 1.00 | 1.00 | 1.00 | 1.00
analysis
Somes 83> | 633. || 20.67 || O67 0.00 0.00 0.33 | 0.33 | 0.00 | 0.00
analysis
Exploratory
Data 1.60 1.80 1.20 1.60 1.00 1.00 1.60 1.60 0.80 0.80
Analysis
Hypomests || -5.69:-' |) 5:60. ||100" | 1.00 1.00 1.00 1.00 | 1.00 | 1.00 | 1.00
testing
Mnevanlele: |S 0014 | 23,008 || As00»||F 95 1.00 1.00 1.00 | 1.00 | 1.00 | 1.00
analysis
Average 1.59 1.63 0.97 1.10 0.80 0.80 0.99 0.99 0.76 0.76

111

ISSN 2710 - 1673 Artificial Intelligence 2024 Ne 2

(>) # Question 14 - Hypotheses. How is the distribution of Quantity per Order across various industries *voeo88 8 J wo
agent. invoke("""As a helpful data analyst, Plot the distribution of Quantity per Order for the Transportation
industries using kde plot. Add the distribution of Quantity per Order for the Finance industry on the same plot.
Also provide a legend and label the x and y axises. """)

Gb :6: FutureWarning:

‘shade’ is now deprecated in favor of ‘fill’; setting “fill=True’.
This will become an error in seaborn v@.14.0; please update your code.

:7: FutureWarning:
‘shade’ is now deprecated in favor of ‘fill’; setting *fill=True’.
This will become an error in seaborn v0.14.0; please update your code.
Distribution of Quantity per Order for Transportation and Finance Industries

0.254 (3 Transportation

™ Finance

Density

0.0 2.5 5.0 75 10.0 12.5 15.0
Quantity per Order

Here is the KDE plot showing the distribution of Quantity per Order for the Transportation and Finance industries. The plot displays the d

- The blue curve represents the Transportation industry.
- The orange curve represents the Finance industry.

The x-axis is labeled as ‘Quantity per Order’ and the y-axis is labeled as ‘Density’. The legend indicates which curve corresponds to each

> Finished chain.
{'input':

‘As a helpful data analyst, Plot the distribution of Quantity per Order for the Transportation industries using kde plot. Add

the distribution of Quantity per Order for the Finance industry on the same plot. Also provide a legend and\nlabel the x and y axises. ',
‘output': “Here is the KDE plot showing the distribution of Quantity per Order for the Transportation and Finance industries. The plot
displays the density of quantity values per order for each industry.\n\n- The blue curve represents the Transportation industry.\n- The
orange curve represents the Finance industry.\n\nThe x-axis is labeled as ‘Quantity per Order' and the y-axis is labeled as ‘Density’.

The legend indicates which curve corresponds to each industry."}

Fig. 5. The example of interactions with LangChain agent that received a human-text input and produces
plain-text output as well visual plot to support the answer

Figure 1 graphically illustrates an
example of the output generated by
LangChain agents instance. It shows that
LangChain agents are capable to receive user
question, analyze data, provide text output as
well as visual charts using Python packages
such as Seaborn.

Discussion

As it is evident from the Figure 1, in
>>>

This item is one of the free files. You are able to download this item for free for a limited time. Updates and support are only available if you purchase this item

Download this file for free (0)

Regular license