2411.08561

Model: nemotron-free

# LogLLM: Log-based Anomaly Detection Using Large Language Models **Authors**: Wei Guan, Jian Cao1, Shiyou Qian, Jianqi Gao, Chun Ouyang > 1Corresponding author.1Department of Computer Science and Engineering, SJTU, Shanghai, China2The School of Information Systems, QUT, Brisbane, Australia{guan-wei, cao-jian, qshiyou, 193139}@sjtu.edu.cn, c.ouyang@qut.edu.au Abstract Software systems often record important runtime information in logs to help with troubleshooting. Log-based anomaly detection has become a key research area that aims to identify system issues through log data, ultimately enhancing the reliability of software systems. Traditional deep learning methods often struggle to capture the semantic information embedded in log data, which is typically organized in natural language. In this paper, we propose LogLLM, a log-based anomaly detection framework that leverages large language models (LLMs). LogLLM employs BERT for extracting semantic vectors from log messages, while utilizing Llama, a transformer decoder-based model, for classifying log sequences. Additionally, we introduce a projector to align the vector representation spaces of BERT and Llama, ensuring a cohesive understanding of log semantics. Unlike conventional methods that require log parsers to extract templates, LogLLM preprocesses log messages with regular expressions, streamlining the entire process. Our framework is trained through a novel three-stage procedure designed to enhance performance and adaptability. Experimental results across four public datasets demonstrate that LogLLM outperforms state-of-the-art methods. Even when handling unstable logs, it effectively captures the semantic meaning of log messages and detects anomalies accurately. Index Terms: System log, anomaly detection, large language model, deep learning, log analysis I Introduction Ensuring high availability and reliability is crucial for large-scale software-intensive systems [1, 2]. As these systems become more complex and expansive, the occurrence of anomalies becomes unavoidable [3, 4]. Even a minor issue can lead to performance degradation, data integrity problems, and substantial losses in both customers and revenue. Therefore, anomaly detection is vital for maintaining the health and stability of complex software-intensive systems [5]. Software-intensive systems typically produce console logs that record system states and critical runtime events [6]. Engineers can utilize this log data to evaluate system health, identify anomalies, and trace the root causes of issues. However, due to the potentially vast volume of logs, manually analyzing them for anomalies can be both labor-intensive and prone to mistakes [7]. Consequently, log-based anomaly detection has emerged as a key area in automated log analysis, focusing on the automatic identification of system anomalies through log data. Numerous deep learning-based methods [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] for log-based anomaly detection have been proposed. These methods typically employ sequential deep learning models such as LSTM [23] and transformers [24]. These methods can be further divided into reconstruction-based methods [8, 9, 10, 11, 12, 13, 14, 15] and binary classification-based methods [16, 17, 18, 19, 20, 21, 22]. Reconstruction-based methods involve designing and training a deep neural network to reconstruct input log sequences, with anomalies detected based on reconstruction errors. The underlying principle is that anomalous samples cannot be accurately reconstructed. Binary classification-based methods, on the other hand, involve designing a binary classifier to classify samples as either normal or anomalous. These methods often require labeled anomalies for training purposes. It is recognized that system logs are documented in natural language and contain a significant amount of semantic information. Nevertheless, traditional deep learning-based methods struggle to effectively capture this information. In recent years, significant advancements have been achieved in LLMs, such as GPT-4 [25], Llama 3 [26], and ChatGLM [27]. These models are characterized by their vast parameter sizes and are pretrained on substantially larger datasets, ranging from several gigabytes to terabytes in size. This extensive pretraining equips them with remarkable language comprehension abilities, enabling superior performance in tasks such as summarization, paraphrasing, and instruction following even in zero-shot scenarios [28]. Existing methods that utilize LLMs for log-based anomaly detection can be categorized into prompt engineering-based [29, 7, 30, 31] and fine-tuning-based [32, 33, 34, 35, 3, 36, 37, 38, 39, 40] approaches. Prompt engineering-based methods leverage the zero/few-shot capabilities of LLMs to detect anomalies based solely on the models’ internal knowledge. However, these methods often struggle to customize solutions for specific datasets, leading to suboptimal detection performance. Fine-tuning-based methods integrate LLMs into deep neural networks and tailor them to user-specific datasets. Nevertheless, these methods encounter challenges such as limited semantic understanding, suboptimal LLM utilization (relying solely on LLMs for semantic information extraction), and insufficient consideration of input data format, which can lead to memory overflow. To tackle the aforementioned challenges, we propose LogLLM, a novel log-based anomaly detection framework that harnesses LLMs. Unlike traditional methods that rely on log parsers for template extraction, LogLLM preprocesses log messages using regular expressions, thereby streamlining the entire process. LogLLM, a fine-tuning-based method, utilizes BERT, a transformer encoder-based model, to extract semantic vectors from log messages. Additionally, it employs Llama, a transformer decoder-based model, to classify log sequences. To ensure coherence in log semantics, we introduce a projector that aligns the vector representation spaces of BERT and Llama. Our framework is trained using a novel three-stage procedure designed to enhance both performance and adaptability. As illustrated in Section V-G, LLMs frequently face out-of-memory challenges due to their extensive parameter sizes [41]. Directly inputting the entire log sequence (by concatenating log messages into a long string) into Llama can lead to out-of-memory issues and potentially confuse the LLM, making it difficult to focus on key points for distinguishing anomalies. By adopting BERT to summarize each log message, LogLLM effectively mitigates these problems. We conduct experiments across four public datasets, and the results demonstrate that LogLLM outperforms state-of-the-art methods. Even when handling unstable logs, where new log templates frequently emerge due to software evolution, it effectively captures the semantic meaning of log messages and detects anomalies accurately. The ablation study confirms the effectiveness of the three-stage training procedure. The main contributions of our work are as follows: - We introduce LogLLM, a novel log-based anomaly detection framework leveraging LLMs. This study marks the first attempt to simultaneously employ transformer encoder-based and decoder-based LLMs, specifically BERT and Llama, for log-based anomaly detection. - We propose a novel three-stage procedure to optimize the training and coordination of different components within the deep model, enhancing both performance and adaptability. - We conduct extensive experiments on four publicly available real-world datasets, demonstrating that LogLLM achieves exceptional performance. II Related Work In this section, we explore related work in the field of log-based anomaly detection, with a particular focus on deep learning-based methods. We give special attention to approaches that utilize pretrained LLMs. II-A Traditional Deep Learning for Log-based Anomaly Detection Many traditional deep learning-based methods for log-based anomaly detection have been proposed. These works can be grouped into two types based on the training paradigm: reconstruction-based methods and binary classification-based methods. Reconstruction-based methods [8, 9, 10, 11, 12, 13, 14, 15] involve designing and training a deep neural network to reconstruct input log sequences. Anomalies are detected based on reconstruction errors. Normal log sequences can be reconstructed with minimal errors, while anomalous log sequences cannot be effectively reconstructed, resulting in significantly higher reconstruction errors. These methods consistently train the deep model on normal data that is free of anomalies, which means they are semi-supervised. DeepLog [8] adopts LSTM to predict the next log template ID based on past log sequences. Similarly, LogAnomaly [9] predicts the next log template ID based on both sequential and quantitative patterns. Autoencoders (AEs) [10, 11, 12, 13] and generative adversarial networks (GANs) [14, 15] are widely used in reconstruction-based methods. For example, LogAttn [10] adopts an AE that incorporates a temporal convolutional network (TCN) to capture temporal semantic correlations and a deep neural network (DNN) to capture statistical correlations. Duan et al. [14] use a GAN, where an encoder-decoder framework based on LSTM serves as the generator. Convolutional neural networks (CNNs) are used as the discriminator. The reconstruction error is calculated based on the difference between the input and the output from the generator. Binary classification-based methods [16, 17, 18, 19, 20, 21, 22] often employ deep neural networks that output either one or two values. Typically, a single value represents the probability that a sample belongs to the anomalous class, and anomalies are detected by applying a threshold to convert this probability into a binary classification. When two values are output, they represent the probabilities of the sample belonging to the normal and anomalous classes, respectively. Most methods [16, 17, 18, 19, 20] typically train deep models in a supervised manner. For example, Zhang et al. [16] propose LayerLog, which integrates word, log, and logseq layers to extract semantic features from log sequences. CNNs are utilized in [17, 18] to develop a binary classifier. LogRobust [19] integrates a pre-trained Word2Vec model, specifically FastText [42], and combines it with TF-IDF weights to learn representation vectors of log templates. These vectors are then fed into an attention-based Bi-LSTM model for anomaly detection. LogGD [20] transforms log sequences into graphs and utilizes a graph transformer neural network that combines graph structure and node semantics for log-based anomaly detection. Some work [21, 22] involves training binary classifiers in a semi-supervised manner. For example, Trine [21] uses a transformer encoder [24] to encode normal log sequences into vector representations and a generator to produce random fake vector representations. The discriminator, which is composed of a transformer and a multi-layer perceptron (MLP), is trained to distinguish whether the given vector representations are normal log sequences and it is subsequently used to detect anomalies. PLELog [22] tackles the challenge of insufficient labeling by employing probabilistic label estimation and develops an attention-based GRU neural network for anomaly detection. It is acknowledged that system logs are recorded in natural language and contain a substantial amount of semantic information. However, traditional deep learning-based methods face challenges in capturing this information. II-B LLMs for Log-based Anomaly Detection Existing LLMs can be categorized into transformer encoder-based models, such as BERT [43], RoBERTa [44], and SpanBERT [45], and transformer decoder-based models, including GPT-4 [25], Llama 3 [26], and ChatGLM [27]. Two prevalent strategies for utilizing LLMs are prompt engineering and fine-tuning. Prompt engineering-based methods [29, 7, 30, 31] detect anomalies solely by relying on the internal knowledge of LLMs. These methods typically employ transformer decoder-based models. For instance, Qi et al. [7] employ ChatGPT for zero-shot and few-shot log-based anomaly detection, utilizing prompt templates that integrate the log sequence directly. However, this approach becomes impractical when using a large window size for grouping log messages. Egersdoerfer et al. [30] address this issue by maintaining a summary-based memory, which summarizes the previous log messages, eliminating the need to input the entire log sequence for anomaly detection. RAGLog [31] uses a retrieval augmented generative (RAG) framework [46] to analyze log entries by querying its store of samples of normal log entries. They design prompt templates for LLMs to determine whether a queried log entry is normal or abnormal. Prompt engineering-based methods often struggle to customize solutions for specific datasets, which can lead to suboptimal detection performance in particular datasets. <details> <summary>2411.08561v5/x1.png Details</summary> ![1c3eed027c25f52143fd5ac02360c37a0bbbc8f4a42b3b9c130dc51876483d3b](http://localhost:8000/v1/image/1c3eed027c25f52143fd5ac02360c37a0bbbc8f4a42b3b9c130dc51876483d3b) ### Visual Description # Technical Document Extraction ## Header Section - **Line 1**: `1117838978 2005.06.03 R02-M1-N0-C:J12-U11 2005-06-03-15.49.38.026704 R02-M1-N0-C:J12-U11 RAS KERNEL INFO` *Translation*: No non-English text detected. - **Line 2**: `1117843015 2005.06.03 R21-M1-N6-C:J08-U11 2005-06-03-16.56.55.309974 R21-M1-N6-C:J08-U11 RAS KERNEL INFO` *Translation*: No non-English text detected. - **Line 3**: `1117848119 2005.06.03 R16-M1-N2-C:J17-U01 2005-06-03-18.21.59.871925 R16-M1-N2-C:J17-U01 RAS KERNEL INFO` *Translation*: No non-English text detected. ## Content Section - **Line 1**: `instruction cache parity error corrected` *Translation*: No non-English text detected. - **Line 2**: `141 double-hummer alignment exceptions` *Translation*: No non-English text detected. - **Line 3**: `CE sym 2, at 0x0b85eee0, mask 0x05` *Translation*: No non-English text detected. ## Metadata - **Language**: English (no non-English text detected). - **Structure**: - Header: Timestamped kernel information with identifiers (e.g., `R02-M1-N0-C:J12-U11`). - Content: System error logs (cache parity errors, alignment exceptions, memory addresses). ## Spatial Grounding - **Header**: Left-aligned block with three entries, each containing: - Timestamp (e.g., `2005.06.03`). - Device/component identifier (e.g., `R02-M1-N0-C:J12-U11`). - RAS kernel info (e.g., `RAS KERNEL INFO`). - **Content**: Right-aligned block with three entries: - Error descriptions (e.g., `instruction cache parity error corrected`). - Memory addresses (e.g., `0x0b85eee0`). - Masks (e.g., `0x05`). ## Component Isolation - **Header**: - Focus: Kernel versioning and timestamps. - Format: `[Timestamp] [Identifier] [Date-Time] [Kernel Info]`. - **Content**: - Focus: System errors and memory diagnostics. - Format: `[Error Description] [Memory Address] [Mask]`. ## Notes - No charts, diagrams, or data tables present. - All text is technical/logistical, likely from system logs or kernel diagnostics. - No legends, axis titles, or visual trends to analyze. </details> Figure 1: An example of a system log. Fine-tuning-based methods [32, 33, 34, 35, 3, 36, 37, 38, 39, 40] incorporate LLMs into deep neural networks and customize them to the user’s own dataset. Some methods [32, 33, 34, 35], although adopting transformer encoder-based LLMs for anomaly detection, do not capture the semantic information within log sequences. For example, LogBERT [32] and LAnoBERT [33] utilize BERT to reconstruct the input sequence of log template IDs (IDs of log string templates) and detect anomalies based on reconstruction errors, disregarding the semantic information. Other methods [3, 36, 37, 38, 39] use transformer encoder-based LLMs solely for extracting semantic information from log messages and then employ either smaller models [3, 36, 37, 38] or distance-based comparison [39] for classification. For instance, NeuralLog [3] leverages BERT to extract semantic vectors from raw log messages, which are subsequently used to detect anomalies via a transformer-based classification model. Similarly, RAPID [39] utilizes transformer encoder-based models to extract semantic vectors and performs anomaly detection by comparing each query log sequence with its nearest document log sequence. Hadadi et al. [40] directly input template sequences parsed from log sequences, into GPT models and fine-tune it to accurately predict sequence labels. However, this approach faces two key challenges. First, the boundaries between templates within the sequences are unclear, making it difficult for the model to learn the sequential dependencies. Second, each template may be tokenized into multiple tokens by the LLM’s tokenizer, and a single sequence can contain numerous log templates. As a result, an excessive number of tokens may be generated, often exceeding the token (memory) limits of LLMs [41], thereby restricting the length of sequences that can be processed. These two challenges are further demonstrated in Section V-G. LogLLM is a fine-tuning-based method that utilizes BERT for extracting semantic vectors from log messages and Llama, a transformer decoder-based model, for log sequence classification. This method aligns the vector representation spaces of BERT and Llama using a projector. The use of BERT ensures clear boundaries between log messages, as each message is represented by a distinct embedding vector, thereby enhancing classification performance. Moreover, when memory and parameter size of Llama are held constant, this approach can handle longer sequences compared to directly tokenizing the entire log sequence using Llama’s tokenizer. III Preliminaries To establish the groundwork for subsequent sections, we introduce the system log, which records the system’s events and internal states during runtime. A system log contains a list of log messages in chronological order. Fig. 1 presents a snippet of a raw system log generated by the BGL (the BlueGene/L supercomputer system), with each log message ordered according to the recorded time. These raw log messages are semi-structured texts consisting of a header and content. The header, determined by the logging framework, includes information such as timestamp, verbosity level (e.g., WARN/INFO), and component [47]. The log content comprises a constant part (keywords that reveal the log template) and a variable part (parameters that carry dynamic runtime information). In this paper, we focus solely on the content of each log message. <details> <summary>2411.08561v5/x2.png Details</summary> ![18d8df56d55483897c01e4e12fe7c150e10ad18baeadb434a40377797879c4da](http://localhost:8000/v1/image/18d8df56d55483897c01e4e12fe7c150e10ad18baeadb434a40377797879c4da) ### Visual Description # Technical Document Extraction: System Log Analysis ## Image Description The image depicts a system log visualization with three primary components: 1. A main system log panel 2. Two highlighted log sequences (Log Sequence #1 and #2) 3. A session window connection between log sequences All text is in English. No non-English content detected. --- ## System Log Panel ### Structure - **Format**: Vertical list with line numbers (1-4) and log entries - **Highlighting**: Block IDs in red (#FF0000) and blue (#0000FF) - **Timestamp Format**: `HH:MM:SS.mmm` ### Log Entries | Line | Log Entry | |------|---------------------------------------------------------------------------| | 1 | `Receiving block blk_-5632101276183739500` `src: /10.251.27.63:36433` `dest: /` | | 2 | `BLOCK* NameSystem.allocateBlock: /user/root/randtxt4/_temporary/task_200811101024/part-00999.` `blk_-5632101276183739500` | | 3 | `Receiving block blk_-4506306395053060141` `src: /10.251.193.175:40709` `dest: /` | | 4 | `PacketResponder 0 for block blk_-5632101276183739500 terminating` | --- ## Log Sequence #1 ### Header - **Title**: `Log Sequence #1` - **Block ID**: `blk_-5632101276183739500` (matches main log entry #2) ### Entries | Line | Log Entry | |------|---------------------------------------------------------------------------| | 1 | `Receiving block blk_-5632101276183739500` `src: /10.251.27.63:36433` `dest: /` | | 2 | `BLOCK* NameSystem.allocateBlock: /user/root/randtxt4/_temporary/task_200811101024/part-00999.` `blk_-5632101276183739500` | | 3 | `PacketResponder 0 for block blk_-5632101276183739500 terminating` | --- ## Log Sequence #2 ### Header - **Title**: `Log Sequence #2` - **Block ID**: `blk_-4506306395053060141` (matches main log entry #3) ### Entries | Line | Log Entry | |------|---------------------------------------------------------------------------| | 1 | `Receiving block blk_-4506306395053060141` `src: /10.251.193.175:40709` `dest: /` | --- ## Session Window Connection - **Visual Representation**: Double-headed arrow between Log Sequence #1 and #2 - **Purpose**: Indicates temporal relationship between block operations --- ## Key Observations 1. **Block ID Correlation**: - Log Sequence #1 corresponds to block `blk_-5632101276183739500` - Log Sequence #2 corresponds to block `blk_-4506306395053060141` 2. **Temporal Pattern**: - Block reception precedes allocation (`BLOCK*` command) - PacketResponder termination follows allocation 3. **Network Activity**: - Two distinct source IPs: - `10.251.27.63` (block #1) - `10.251.193.175` (block #2) 4. **Port Usage**: - Source ports: `36433` and `40709` - Destination port unspecified (`/`) in all entries --- ## Data Table Reconstruction </details> (a) Session window <details> <summary>2411.08561v5/x3.png Details</summary> ![6f938ded2e61c9a5235569e8f8f2eec9063bf97e6d63110cc44e772e832993dc](http://localhost:8000/v1/image/6f938ded2e61c9a5235569e8f8f2eec9063bf97e6d63110cc44e772e832993dc) ### Visual Description # Technical Document Extraction: System Log Analysis ## 1. System Log Structure The image presents a system log analysis using a sliding window approach. Key components: ### 1.1 Original System Log | Entry # | Log Content | |---------|-----------------------------------------------------------------------------| | 1 | `iar 00106ba8 dear 0246dd1c` | | 2 | `5052567 floating point alignment exceptions` | | 3 | `CE sym 25, at 0x10e1bce0, mask 0x40` | | 4 | `invalid operation exception (software)...0` | | 5 | `L3 ecc control register: 00000000` | ### 1.2 Sliding Window Parameters - **Window Size**: 2 entries - **Step Size**: 2 entries - **Direction**: Left-to-right progression ## 2. Log Sequence Generation The sliding window creates overlapping log sequences: ### 2.1 Log Sequence #1 | Entry # | Content | |---------|-------------------------------------------------------------------------| | 1 | `iar 00106ba8 dear 0246dd1c` | | 2 | `5052567 floating point alignment exceptions` | ### 2.2 Log Sequence #2 | Entry # | Content | |---------|-------------------------------------------------------------------------| | 1 | `CE sym 25, at 0x10e1bce0, mask 0x40` | ### 2.3 Log Sequence #3 | Entry # | Content | |---------|-------------------------------------------------------------------------| | 1 | `L3 ecc control register: 00000000` | ## 3. Temporal Relationships - **Sequence Overlap**: Each window shares one entry with the previous window - **Progression Pattern**: - Sequence 1: Entries 1-2 - Sequence 2: Entries 3-4 - Sequence 3: Entries 5-? ## 4. Error Code Analysis Key error patterns identified: 1. **Floating Point Alignment Exceptions** (Entry 2) - Hex Address: `0x10e1bce0` - Mask: `0x40` 2. **Invalid Operation Exception** (Entry 4) - Software-related (truncated message) 3. **ECC Control Register** (Entry 5) - Value: `00000000` (all zeros) ## 5. Spatial Grounding - **Legend Position**: Not applicable (no visual legend present) - **Component Isolation**: - Header: "System Log" title - Main Chart: 5-entry log table - Footer: Sliding window visualization ## 6. Trend Verification - **Error Frequency**: - Floating point exceptions appear once in initial window - ECC register shows zero state in final window - **Temporal Correlation**: - CE sym error precedes invalid operation exception by one entry ## 7. Data Reconstruction ### Original Log Table | # | System Log | |---|----------------------------------------------------------------------------| | 1 | iar 00106ba8 dear 0246dd1c | | 2 | 5052567 floating point alignment exceptions | | 3 | CE sym 25, at 0x10e1bce0, mask 0x40 | | 4 | invalid operation exception (software)...0 | | 5 | L3 ecc control register: 00000000 | ### Log Sequence Table | Sequence # | Window Contents | |------------|---------------------------------------------------------------------------------| | 1 | 1: iar 00106ba8 dear 0246dd1c 2: 5052567 floating point alignment exceptions | | 2 | 3: CE sym 25, at 0x10e1bce0, mask 0x40 | | 3 | 5: L3 ecc control register: 00000000 | ## 8. Critical Observations 1. **Memory Address Correlation**: - Floating point exception at `0x10e1bce0` precedes CE sym error 2. **ECC State**: - Final window shows all-zero ECC register (potential memory integrity issue) 3. **Software Exception**: - Truncated message suggests possible buffer overflow or instruction fault ## 9. Language Analysis - **Primary Language**: English (technical terminology) - **Hexadecimal Notation**: - Memory addresses (`0x10e1bce0`) - Bit masks (`0x40`) - Register values (`00000000`) ## 10. System Implications - **Error Chaining**: - Hardware exceptions (floating point) precede software exceptions - **Diagnostic Value**: - Log sequences enable temporal correlation of hardware/software faults - **ECC Monitoring**: - Zero ECC register suggests potential memory error detection failure </details> (b) Sliding window Figure 2: Illustrative examples of log message partitioning. The log messages can be grouped into log sequences (i.e., series of log messages that record specific execution flows) based on session or fixed/sliding windows [48]. Session window partitioning groups log messages according to their session IDs, thereby generating sequences that include the log messages within each session. For example, Fig. 2a illustrates the HDFS [49] logs undergoing the session window grouping process, where the block_id serves as the session ID. In contrast, fixed/sliding window partitioning groups log messages based on a fixed size (window size), which can be defined by either the time span or the number of log messages. This method creates sequences that capture snapshots of system log messages over time. For example, Fig. 2b illustrates the BGL [50] logs undergoing the sliding window grouping process, with a window size of 2 messages and a step size of 2 messages. The objective of log-based anomaly detection is to identify anomalous log sequences, facilitating the recognition of potential issues within the system’s operational behavior. IV Methodology In this section, we present our innovative anomaly detection framework, LogLLM. As illustrated in Fig. 3, the log sequence undergoes preprocessing using regular expressions before being fed into a deep neural network that integrates BERT [43], a projector, and Llama [26] for log sequence classification. In the following sections, we will provide detailed insights into log sequence preprocessing, the architecture of the deep model, and the model training procedure. <details> <summary>2411.08561v5/x4.png Details</summary> ![646a3c130f28924f0f41ffaec5baa441ef9966b602b74dc95e76e1081001ffc4](http://localhost:8000/v1/image/646a3c130f28924f0f41ffaec5baa441ef9966b602b74dc95e76e1081001ffc4) ### Visual Description # Model Architecture Technical Document ## Overview The diagram illustrates a hybrid model architecture combining **Llama (LLM)** with **BERT** components for sequence processing and anomaly detection. The architecture is divided into three primary sections: **Preprocessing**, **Model Architecture**, and **Output Question**. --- ## 1. Model Architecture Components ### 1.1 Llama (LLM) Layer - **Color**: Purple rectangle spanning the entire width of the diagram. - **Function**: Receives embedding vectors from lower layers. - **Connections**: - Receives **Embedding Vectors** from both left and right branches (pink and green arrows). ### 1.2 Embedding Vectors - **Left Branch**: - **Pink Embedding Vectors**: Generated by "Llama Tokenizer & Llama Embedding Layer". - **Green Embedding Vectors**: Generated by "Projector" (connected to BERT components). - **Right Branch**: - **Pink Embedding Vectors**: Generated by "Llama Tokenizer & Llama Embedding Layer". - **Green Embedding Vectors**: Generated by "Projector" (connected to BERT components). ### 1.3 BERT Integration - **Left Branch**: - **BERT Tokenizer & BERT**: Connected to a "Projector" (blue dashed box). - **Projector**: Maps BERT embeddings to Llama-compatible vectors. - **Right Branch**: - Identical structure to the left branch (mirrored). ### 1.4 Preprocessing Section - **Input**: Log sequences processed via regular expressions. - **Examples**: - `CE sym <*>, at <*>, mask <*>` - `Generating core <*>` - **Output**: Structured log sequences (e.g., `CE sym 2, at 0x0b85eee0, mask 0x05`). --- ## 2. Data Flow 1. **Preprocessing**: - Raw log sequences are preprocessed using regex. - Outputs structured sequences (e.g., `Generating core 14627`). 2. **Embedding**: - Sequences are tokenized and embedded via: - **Llama Tokenizer & Embedding Layer** (pink path). - **BERT Tokenizer & BERT** (orange path, then projected via "Projector"). 3. **Model Inference**: - Embedding vectors from both paths feed into the **Llama (LLM)**. 4. **Output**: - Final question: `"Is this sequence normal or anomalous? \\n"`. --- ## 3. Color Coding & Labels | Color | Component | Legend Reference | |-------------|------------------------------------|------------------| | Pink | Llama Tokenizer & Embedding Layer | Top-left arrow | | Green | Projector | Middle arrows | | Purple | Llama (LLM) Layer | Top rectangle | | Orange | BERT Tokenizer & BERT | Middle boxes | --- ## 4. Spatial Grounding - **Top Section**: Llama (LLM) layer spans the entire width. - **Middle Section**: - Left and right branches are symmetric, each containing: - Preprocessing → Embedding → Projector → BERT → Llama. - **Bottom Section**: Preprocessing inputs and output question. --- ## 5. Key Trends & Observations - **Dual Pathway Design**: Separate processing streams for Llama-native and BERT-derived embeddings. - **Hybrid Embedding**: Combines Llama and BERT embeddings via projectors for Llama input. - **Anomaly Detection Focus**: Final question implies classification of sequence normality. --- ## 6. Missing Elements - No numerical data, charts, or tables present. - No explicit axis titles or legends beyond component labels. --- ## 7. Technical Notes - **Tokenizer Roles**: - **Llama Tokenizer**: Converts text to Llama-specific tokens. - **BERT Tokenizer**: Converts text to BERT-specific tokens (e.g., `mask`, ``). - **Projector Function**: Bridges BERT and Llama embedding spaces. --- This architecture suggests a system for analyzing system log messages (e.g., `CE sym 2, at 0x0b85eee0, mask 0x05`) to determine normality using a hybrid Llama-BERT model. </details> Figure 3: The framework of LogLLM. Notably, the model includes a single instance of BERT and the projector. IV-A Preprocessing Considering that the log message content includes variable parameters carrying dynamic runtime information, which is always irrelevant to the anomalies and complicates deep model training, as demonstrated in Section V-F, a technique is needed to identify these parameters and replace them with a constant token. Log parsers, such as Drain [51] and Spell [52], are widely adopted in log-based anomaly detection methods and appear to be a useful technique. However, as noted by Le et al. [3], existing log parsers do not always perform correctly on all log datasets and struggle to handle out-of-vocabulary (OOV) words in new log messages, resulting in a loss of semantic information. When logs are unstable, these parsers become increasingly ineffective over time, making it difficult to support subsequent anomaly detection. Thanks to the structured log generation process, the textual format of parameters representing specific objects can be easily identified using regular expressions [53]. Consequently, we replace each variable parameter, such as account, directory path, and IP address, with ‘<*>’. Despite its simplicity, this technique offers significant performance advantages. Compared with log parsers, this preprocessing technique is more effective and does not require training. IV-B Model Architecture As shown in Fig. 3, our deep model consists of three main components: BERT, a projector, and Llama. Both BERT and Llama are pretrained LLMs. BERT is utilized to extract vector representations of log messages, while Llama is employed to classify the log sequences. The projector serves as a bridge, aligning the vector representation spaces of BERT and Llama. It is important to note that our model incorporates only one instance of BERT and one projector. IV-B 1 BERT BERT generates a semantic vector by processing the semantic vector of the classification token ([CLS]) through a linear layer followed by a tanh activation function. Each log message, once preprocessed, is encoded into a semantic vector using the BERT tokenizer and BERT model. For a preprocessed log sequence, the output of BERT is a sequence of semantic vectors $C=(c_{1},c_{2},...,c_{N})∈\mathbb{R}^{N× d_{BERT}}$ , where $N$ represents the length of the log sequence (i.e., the number of log messages) and $d_{BERT}$ is the dimension of each semantic vector (i.e., hidden size). IV-B 2 Projector The projector is a linear layer that maps the semantic vectors $C∈\mathbb{R}^{N× d_{BERT}}$ to the token embedding vectors accepted by Llama, represented as $E=(e_{1},e_{2},...,e_{N})∈\mathbb{R}^{N× d_{Llama}}$ , where $d_{Llama}$ is the hidden size of Llama. The projector is designed to align the vector representation spaces of BERT and Llama. IV-B 3 Llama To conduct prompt tuning on Llama, the transformer decoder-based LLM, we generate corresponding textual queries based on embedded log sequences. Specifically, each query consists of three components. The first component introduces the log sequence, such as " Below is a sequence of system log messages: ". The second component comprises the token embeddings $E$ output by the projector. The third component queries whether the sequence is anomalous, asking, for instance, ". Is this sequence normal or anomalous? ". The first and third components are fed into the Llama tokenizer and Llama embedding layer sequentially, producing $E_{1}∈\mathbb{R}^{A× d_{Llama}}$ and $E_{3}∈\mathbb{R}^{Q× d_{Llama}}$ , where $A$ and $Q$ are the number of tokens produced by tokenizing the first and third components, respectively. Then, the token embeddings of the three components are concatenated, represented as $[E_{1}||E||E_{3}]∈\mathbb{R}^{(A+N+Q)× d_{Llama}}$ and fed into Llama. IV-C Training IV-C 1 Minority Class Oversampling LogLLM is a supervised anomaly detection method, which means it needs labeled normal and anomalous samples for training. However, supervised anomaly detection methods often face the challenge of data imbalance, which can lead to biased model training. In an anomaly detection task, there are only two classes: normal and anomalous, and the number of instances in each class is uncertain. To cope with data imbalance, we oversample the class with fewer samples, ensuring that the proportion of the minority class is no less than $\beta$ . Formally, let the proportion of the minority class be $\alpha$ and $\alpha<\beta$ , and the total number of samples be $Sample\_num$ . To achieve a proportion of $\beta$ for the minority class, it will be oversampled to the following quantity: $$ \frac{\beta(1-\alpha)}{1-\beta}\times Sample\_num \tag{1} $$ This adjustment will make the proportion of the minority class equal to $\beta$ . IV-C 2 Training Objective Our objective is to train the deep model to predict whether a given log sequence is normal or anomalous. We fine-tune the model to respond appropriately: if the sequence is anomalous, it outputs ‘ The sequence is anomalous. ’; if normal, it outputs ‘ The sequence is normal. ’. We utilize cross-entropy loss [54] as our loss function. IV-C 3 Training Procedure To train our deep model, we follow three main stages. Stage 1. Fine-tuning Llama to capture the answer template: The first stage involves fine-tuning Llama to capture the answer template. Specifically, we train Llama to respond to the prompt ‘ Is this sequence normal or anomalous? ’ with ‘ The sequence is anomalous/normal. ’. This stage requires only a few data samples. Stage 2. Training the embedder of log messages: The second stage involves training the embedder of log messages, specifically BERT and the projector. This stage aims to project each log message to the embedding of the most suitable token in Llama, enabling Llama to discern whether the given log sequence is normal or anomalous. Stage 3. Fine-tuning the entire model: Finally, we fine-tune the entire model to ensure cohesive and accurate performance across all components. IV-C 4 Efficient Fine-Tuning on LLMs To reduce the costs involved in fine-tuning LLMs (BERT and Llama) with a substantial number of parameters, we utilize QLoRA [55] to minimize memory usage. QLoRA accomplishes this by backpropagating gradients into a frozen 4-bit quantized model, while maintaining the performance levels achieved during the full 16-bit fine-tuning process. V Experiments In this section, we conduct extensive experiments on four real-life logs to investigate the following research questions (RQs): - RQ1: How effective is LogLLM in log-based anomaly detection? - RQ2: How do different preprocessing techniques impact the performance of LogLLM? - RQ3: How effective is the embedder for Llama? - RQ4: How does the size of the Llama model affect the performance of LogLLM? - RQ5: How does each stage of the three-stage training process influence the performance of LogLLM? - RQ6: How do different levels of minority class oversampling, determined by the hyperparameter $\beta$ , affect the performance of LogLLM? LogLLM is coded in Python, and the source code is available at https://github.com/guanwei49/LogLLM. V-A Benchmark Methods To verify the superiority of the proposed method, we compare LogLLM with five state-of-the-art semi-supervised methods: DeepLog [8], LogAnomaly [9], PLELog [22], FastLogAD [34], and LogBERT [32]. We also compare it with three supervised methods: LogRobust [19], CNN [18] and NeuralLog [3], and one method that does not require training a deep model but needs some normal samples for retrieval: RAPID [39]. Notably, FastLogAD, LogBERT, NeuralLog, and RAPID adopt LLMs for anomaly detection. V-B Experimental Settings We conduct all experiments on a server equipped with an Intel Xeon Gold 6330 CPU (38 cores), 256GB of memory, and an NVIDIA A40 GPU with 48 GB of memory. In our experiment, we utilize the BERT-base model https://huggingface.co/google-bert/bert-base-uncased and Llama-3-8B model https://huggingface.co/meta-llama/Meta-Llama-3-8B as backbones. The hyperparameter $\beta$ , which is described in Section IV-C 1, is set to 30%. We use the AdamW optimizer [56] to train the model with a mini-batch size of 16. Unless otherwise specified, the training procedure is configured as follows: In the first stage, only 1,000 samples are involved with a learning rate of 5e-4. The second and third stages each consist of two epochs with a learning rate of 5e-5. For a fair comparison, we configure the hyperparameters for all compared methods according to the values provided in their original articles. V-C Metrics We evaluate the performance of these methods using the widely adopted $Precision$ , $Recall$ and $F_{1}\!-\!score$ . These metrics are calculated as follows: $$ Precision=\frac{TP}{TP+FP} \tag{2} $$ $$ Recall=\frac{TP}{TP+FN} \tag{3} $$ $$ F_{1}\!-\!score=\frac{2*Precision*Recall}{Precision+Recall} \tag{4} $$ , where $TP$ , $FN$ , $FP$ represent true positives, false negatives and false positives respectively. Precision refers to the percentage of correctly detected anomalies among all anomalies identified by the model, while recall represents the percentage of anomalies that are correctly identified from all real anomalies. The F 1 -score combines these two metrics into a single measure, providing a balanced assessment of the model’s performance in detecting anomalies. V-D Dataset TABLE I: The statistics of datasets used in the experiments. | | # Log messages # Log sequences | # Log sequences # Anomalies | Training Data Anomaly ratio | Testing Data # Log sequences | # Anomalies | Anomaly ratio | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | HDFS | 11,175,629 | 575,061 | 460,048 | 13,497 | 2.93% | 115,013 | 3,341 | 2.90% | | BGL | 4,747,963 | 47,135 | 37,708 | 4,009 | 10.63% | 9,427 | 817 | 8.67% | | Liberty | 5,000,000 | 50,000 | 40,000 | 34,144 | 85.36% | 10,000 | 651 | 6.51% | | Thunderbird | 10,000,000 | 99,997 | 79,997 | 837 | 1.05% | 20,000 | 29 | 0.15% | To evaluate our method for log-based anomaly detection, we selected four public datasets [57]: HDFS, BGL, Liberty, and Thunderbird. The details for each dataset are provided below: HDFS (Hadoop Distributed File System) dataset [49] is generated by running Hadoop-based mapreduce jobs on over 200 Amazon EC2 nodes and contains a total of 11,175,629 log messages. These log messages are grouped into different log windows based on their block_id, which reflect program executions in the HDFS, resulting in 575,061 blocks. Among these, 16,838 blocks (2.93%) indicate system anomalies. BGL (Blue Gene/L) dataset [50] is a supercomputing system log dataset collected from a BlueGene/L supercomputer system at lawrence livermore national labs (LLNL). The dataset contains 4,747,963 log messages, each of which has been manually labeled as either normal or anomalous. There are 348,460 log messages (7.34%) that are labeled as anomalous. Thunderbird dataset [50] is a publicly accessible collection of log data sourced from the Thunderbird supercomputer at sandia national laboratories (SNL). This dataset consists of both normal and anomalous messages, each of which has been manually categorized. Although the dataset contains over 200 million log messages, we focus on a subset of 10 million continuous log messages for computational efficiency. This subset includes 4,937 anomalous log messages, representing approximately 0.049% of the total. Liberty dataset [50] comprises system logs from the Liberty supercomputer at sandia national labs (SNL) in Albuquerque. This supercomputer features 512 processors and 944 GB of memory, and the dataset contains over 200 million log messages. For computational efficiency, we sample 5 million consecutive log messages, among which 1,600,525 are identified as anomalous, constituting approximately 32.01% of the total sampled messages. In the context of HDFS, we adopt a session window strategy, which involves grouping log messages into sequences based on the block_id present in each log message. Each session is labeled using ground truth. For other datasets, including BGL, Thunderbird, and Liberty, we utilize a sliding window strategy to group log messages, with a window size of 100 messages and a step size of 100 messages. A log sequence is deemed anomalous if it contains at least one anomalous log message according to the ground truth. Similar to existing work [19, 8, 9, 22, 39, 34], we split each dataset into a training set and a testing set with a ratio of 8:2 to evaluate the performance of a log-based anomaly detection approach. For the HDFS dataset, we randomly split the log sequences into training and testing data. In contrast, for the BGL, Thunderbird, and Liberty datasets, we adhere to a chronological split [6]. This strategy ensures that all log sequences in the training set precede those in the testing set, reflecting real-world conditions and mitigating potential data leakage from unstable log data. Table I summarizes the statistics of the datasets used in the experiments. V-E Performance Evaluation (RQ1) TABLE II: Experimental results on HDFS, BGL, Liberty, and Thunderbird datasets. The best results are highlighted in bold. | DeepLog LogAnomaly PLELog | $\usym{2714}$ $\usym{2714}$ $\usym{2714}$ | 0.835 0.886 0.893 | 0.994 0.893 0.979 | 0.908 0.966 0.934 | 0.166 0.176 0.595 | 0.988 0.985 0.880 | 0.285 0.299 0.710 | 0.751 0.684 0.795 | 0.855 0.876 0.874 | 0.800 0.768 0.832 | 0.017 0.025 0.808 | 0.966 0.966 0.724 | 0.033 0.050 0.764 | 0.506 0.521 0.810 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | FastLogAD | $\usym{2714}$ | 0.721 | 0.893 | 0.798 | 0.167 | 1.000 | 0.287 | 0.151 | 0.999 | 0.263 | 0.008 | 0.931 | 0.017 | 0.341 | | LogBERT | $\usym{2714}$ | 0.989 | 0.614 | 0.758 | 0.165 | 0.989 | 0.283 | 0.902 | 0.633 | 0.744 | 0.022 | 0.172 | 0.039 | 0.456 | | LogRobust | $\usym{2714}$ | 0.961 | 1.000 | 0.980 | 0.696 | 0.968 | 0.810 | 0.695 | 0.979 | 0.813 | 0.318 | 1.000 | 0.482 | 0.771 | | CNN | $\usym{2714}$ | 0.966 | 1.000 | 0.982 | 0.698 | 0.965 | 0.810 | 0.580 | 0.914 | 0.709 | 0.870 | 0.690 | 0.769 | 0.818 | | NeuralLog | $\usym{2717}$ | 0.971 | 0.988 | 0.979 | 0.792 | 0.884 | 0.835 | 0.875 | 0.926 | 0.900 | 0.794 | 0.931 | 0.857 | 0.893 | | RAPID | $\usym{2717}$ | 1.000 | 0.859 | 0.924 | 0.874 | 0.399 | 0.548 | 0.911 | 0.611 | 0.732 | 0.200 | 0.207 | 0.203 | 0.602 | | LogLLM | $\usym{2717}$ | 0.994 | 1.000 | 0.997 | 0.861 | 0.979 | 0.916 | 0.992 | 0.926 | 0.958 | 0.966 | 0.966 | 0.966 | 0.959 | Table II presents the experimental results of various log-based anomaly detection methods on the HDFS, BGL, Liberty, and Thunderbird datasets. The best results are highlighted in bold. We have the following observations: The proposed LogLLM achieves the highest F 1 -score across all datasets. On average, LogLLM’s F 1 -scores are 6.6% better than the best existing method, NeuralLog, demonstrating its effectiveness in log-based anomaly detection. Despite the adoption of LLMs in FastLogAD, LogBERT, NeuralLog, and RAPID for anomaly detection, their performance remains unsatisfactory. FastLogAD and LogBERT utilize BERT, a transformer encoder-based model, for detecting anomalies based on log sequence reconstruction errors. Their inputs consist of sequences of log template IDs (IDs of log string templates) extracted from log messages via log parsers, lacking semantic information. In contrast, NeuralLog and RAPID utilize transformer encoder-based models to extract semantic vectors from log messages. However, NeuralLog utilizes smaller models, whereas RAPID relies on distance-based comparison for anomaly sequence classification. LogLLM, on the other hand, leverages both BERT for extracting semantic vectors and Llama, a transformer decoder-based LLM, for anomaly detection. The representation spaces of BERT and Llama are aligned via a projector, fully harnessing the potential of LLMs for log-based anomaly detection. Moreover, LogLLM achieves a balance between precision and recall, indicating that it maintains low false alarm rates and minimizes missed reports. In contrast, methods like FastLogAD are excessively sensitive to anomalies, often resulting in numerous false alarms. For example, on the BGL dataset, despite FastLogAD having a recall of 1, it only achieves a precision of 0.167, making it impractical for real-world use. Similarly, methods such as DeepLog, LogAnomaly and LogBERT exhibit similar issues. On the other hand, RAPID is not sensitive enough to anomalies, leading to many undetected anomalies. For instance, on the BGL dataset, RAPID achieves a precision of 0.874 but a recall of only 0.399. Effect of labeled anomalies: As illustrated in Table II, in contrast to methods such as DeepLog, LogAnomaly, FastLogAD, LogBERT, and RAPID, which require clean datasets devoid of anomalies to build anomaly detection models, methods like PLELog, LogRobust, CNN, NeuralLog, and LogLLM demonstrate superior performance. These models are trained using not only normal samples but also labeled anomalies. For instance, these five methods achieve an average F 1 -score above 0.771 across four datasets, whereas others that do not utilize labeled anomalies perform poorly, with an average F 1 -score below 0.602 across four datasets. This demonstrates that incorporating labeled anomalies can provide a significant advantage to anomaly detection methods. TABLE III: Computational cost. | | Training time (Minutes) | Testing time (Minutes) | | --- | --- | --- | | DeepLog | 72.17 | 3.42 | | LogAnomaly | 156.16 | 7.25 | | PLELog | 315.47 | 33.59 | | LogRobust | 108.42 | 2.48 | | CNN | 98.16 | 2.16 | | FastLogAD | 254.17 | 0.29 | | LogBERT | 429.04 | 43.77 | | NeuralLog | 267.46 | 21.44 | | RAPID | 63.98 | 38.43 | | LogLLM | 1,065.15 | 64.48 | Computational cost: The time consumption of each method is presented in Table III. These results have been averaged across all the datasets. Although RAPID does not require training a deep model, the extraction and retrieval of vector representations remain time-consuming. In comparison to other methods, FastLogAD requires relatively high training time, but it has the shortest testing time because it uses only the discriminator of the model during testing. As anticipated, while our proposed LogLLM demonstrates the best performance, it also incurs the highest computational cost due to its large number of parameters. However, the testing time of LogLLM remains acceptable when compared to other methods that utilize LLMs, such as LogBERT, NeuralLog, and RAPID. V-F Different Preprocessing Techniques (RQ2) TABLE IV: Effects of different preprocessing techniques on HDFS, BGL, Liberty, and Thunderbird datasets. The best results are highlighted in bold. | | HDFS Prec. | BGL Rec. | Liberty F 1 | Thunderbird Prec. | Avg. F 1 Rec. | F 1 | Prec. | Rec. | F 1 | Prec. | Rec. | F 1 | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Raw | 0.994 | 0.991 | 0.993 | 0.943 | 0.767 | 0.846 | 0.911 | 0.908 | 0.909 | 0.806 | 0.862 | 0.833 | 0.895 | | Template ID | 0.995 | 0.945 | 0.969 | 0.775 | 0.286 | 0.418 | 0.994 | 0.270 | 0.425 | 1.000 | 0.379 | 0.550 | 0.591 | | Template | 0.991 | 1.000 | 0.995 | 0.861 | 0.919 | 0.889 | 0.968 | 0.931 | 0.949 | 0.950 | 0.655 | 0.776 | 0.902 | | RE (LogLLM) | 0.994 | 1.000 | 0.997 | 0.861 | 0.979 | 0.916 | 0.992 | 0.926 | 0.958 | 0.966 | 0.966 | 0.966 | 0.959 | TABLE V: Effects of the embedder (BERT & adapter) and LLaMA model size, where ‘Mem.’ indicates GPU memory usage (GB), and ‘Tim.’ indicates training time (Minutes). ‘-’ indicates an out-of-memory (OOM) error. | | HDFS Prec. | BGL Rec. | Liberty F 1 | Thunderbird Mem. | Tim. | Prec. | Rec. | F 1 | Mem. | Tim. | Prec. | Rec. | F 1 | Mem. | Tim. | Prec. | Rec. | F 1 | Mem. | Tim. | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | L.-1B | 0.986 | 0.995 | 0.991 | 16.5 | 1022.1 | - | - | - | - | - | 0.960 | 0.699 | 0.809 | 42.6 | 443.2 | 1.000 | 0.724 | 0.840 | 44.5 | 1732.1 | | Emb. & L.-1B | 0.996 | 0.996 | 0.996 | 8.0 | 1412.2 | 0.734 | 0.944 | 0.825 | 32.4 | 187.1 | 0.950 | 0.905 | 0.927 | 29.3 | 173.2 | 0.875 | 0.966 | 0.918 | 32.4 | 715.1 | | L.-8B | 0.988 | 0.997 | 0.992 | 43.0 | 4712.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | | Emb. & L.-8B | 0.994 | 1.000 | 0.997 | 16.6 | 2168.2 | 0.861 | 0.979 | 0.916 | 38.0 | 396.2 | 0.992 | 0.926 | 0.958 | 36.1 | 412.1 | 0.966 | 0.966 | 0.966 | 38.2 | 1284.2 | We evaluate the effectiveness of the different preprocessing techniques. The results are shown in Table IV. In this table, ‘ Raw ’ indicates that the content of log messages is not preprocessed and is directly input into the proposed deep model. ‘ Template ’ indicates that sequences of log templates produced by Drain [51], a log parser, are used as input for the proposed deep model. ‘ Template ID ’ signifies that the IDs of log templates, obtained by Drain, are simply encoded into numeric vectors using an embedding layer instead of BERT. The preprocessing technique ‘Template ID’ renders the model unable to capture the semantic information within log messages. Notably, the parser Drain is applied to the entire dataset, rather than only the training dataset, to avoid performance degradation due to the OOV problem. ‘ RE ’ indicates that regular expressions, as introduced in Section IV-A, are used for preprocessing log messages. As anticipated, the preprocessing technique ‘RE’ yields the highest F 1 -score across all datasets. Conversely, the preprocessing technique ‘Template ID’ consistently results in the lowest F 1 -score across all datasets, averaging 36.8% lower than that of ‘RE’. This can be attributed to the fact that ‘Template ID’ hinders the model’s ability to capture the semantic information within log messages, thereby impairing its capability to detect anomalies from a natural language perspective. The preprocessing techniques ‘Raw’ and ‘Template’ result in relatively good performance, but their F 1 -scores are still 6.4% and 5.7% lower than that of ‘RE’, respectively. For the preprocessing technique ‘Raw’, the variable parts (parameters that carry dynamic runtime information) within the content of each log message have little influence on anomaly detection. However, due to their high randomness, they can confuse the model, making it difficult to discern anomalies. For the preprocessing technique ‘Template’, the parser is not always reliable, sometimes incorrectly removing the constant parts or retaining the variable parts, which can lead to information loss or confusion for the model, making it difficult to discern anomalies. V-G Effect of the Embedder (RQ3) We investigate whether the embedder (BERT and adapter) is necessary for LogLLM. The results are presented in Table V. ‘L.-1B’ refers to directly inputting the log sequence (by concatenating log messages with semicolons (;) as separators into a long string) into the ‘Llama-3.2-1B’ model https://huggingface.co/meta-llama/Llama-3.2-1B. ‘Emb. & L.-1B’ represents LogLLM based on ‘Llama-3.2-1B’. As expected, with the assistance of the embedder, the model requires less GPU memory, thereby avoiding out-of-memory (OOM) errors. Additionally, it enhances model performance by clarifying the boundaries between messages within a sequence. This improved representation enables the LLM to capture sequential dependencies better. V-H Effect of the Llama Model Size (RQ4) As shown in Table V, larger LLaMA model sizes lead to better performance, at the cost of increased GPU memory usage and longer training times. On average, compared to using Llama-3.2-1B, adopting Llama-3-8B improves the F 1 -score by 4.3%, but increases GPU memory usage by 7.7 GB and extends training time by 443.2 minutes. V-I Ablation Study of the Training Procedure (RQ5) TABLE VI: Ablation study of the training procedure on HDFS, BGL, Liberty, and Thunderbird datasets. The best results are highlighted in bold. | | HDFS Prec. | BGL Rec. | Liberty F 1 | Thunderbird Prec. | Avg. F 1 Rec. | F 1 | Prec. | Rec. | F 1 | Prec. | Rec. | F 1 | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | W/O Stage 1 | 0.991 | 1.000 | 0.995 | 0.578 | 0.971 | 0.725 | 0.685 | 0.290 | 0.408 | 0.381 | 0.828 | 0.522 | 0.662 | | W/O Stage 2 | 0.994 | 1.000 | 0.997 | 0.858 | 0.920 | 0.888 | 0.995 | 0.906 | 0.949 | 0.848 | 0.966 | 0.903 | 0.934 | | W/O Stage 1&2 | 0.992 | 1.000 | 0.996 | 0.853 | 0.882 | 0.868 | 0.995 | 0.906 | 0.949 | 0.897 | 0.897 | 0.897 | 0.927 | | W/O Stage 3 | 0.993 | 0.999 | 0.996 | 0.704 | 0.776 | 0.738 | 1.000 | 0.684 | 0.812 | 0.958 | 0.793 | 0.868 | 0.854 | | LogLLM | 0.994 | 1.000 | 0.997 | 0.861 | 0.979 | 0.916 | 0.992 | 0.926 | 0.958 | 0.966 | 0.966 | 0.966 | 0.959 | We investigate the effect of each training procedure through an ablation study. The results are presented in Table VI, where ‘ W/O ’ denotes ‘ without ’. We have the following observations: Skipping any training stage results in a decrease in the F 1 -score across all datasets, demonstrating the effectiveness of our three-stage training procedure. It is noteworthy that training without stage 1 leads to the worst performance, with the F 1 -score averaged across all datasets decreasing by as much as 29.7%. However, training without stages 1&2 (only adopting training stage 3: fine-tuning the entire model) yields acceptable performance, with only a 3.2% decrease in the average F 1 -score. This demonstrates that fine-tuning Llama to capture the answer template (Stage 1) is essential before training the embedder (BERT and projector) of log messages (Stage 2). Without stage 1 (i.e., directly training the embedder), the embedder may be misdirected, resulting in incorrect semantic capture of log messages and model failure. Training without stage 3 yields relatively poor performance, with an average F 1 -score decrease of 10.5%. This indicates that sequentially fine-tuning Llama and training the embedder alone is insufficient for the model to capture anomalous patterns; cohesive fine-tuning of the entire model is essential. Training without stages 2 and 1&2 also results in a performance decrease, with average F 1 -score reductions of 2.5% and 3.2%, respectively. This demonstrates that individually training the embedder before fine-tuning the entire model can also enhance performance. This stage allows the embedder to generate better semantic vectors of log messages for Llama to discern anomalies. In summary, our proposed three-stage training procedure is well-suited for our deep model in log-based anomaly detection. <details> <summary>2411.08561v5/x5.png Details</summary> ![38099e8a14805e6798281b0dc5a7e7b83291aa9dd250c6337d628da80f27c845](http://localhost:8000/v1/image/38099e8a14805e6798281b0dc5a7e7b83291aa9dd250c6337d628da80f27c845) ### Visual Description # Technical Document Analysis of Precision vs. β (%) Chart ## 1. Labels and Axis Titles - **X-axis**: β (%) - Range: 0 to 80 (increments of 10) - Label: "β (%)" - **Y-axis**: Precision - Range: 0.00 to 1.00 (increments of 0.05) - Label: "Precision" ## 2. Legend - **Location**: Bottom-right corner - **Components**: - **HDFS**: Black squares (■) - **Liberty**: Blue triangles (▲) - **BGL**: Red circles (●) - **Thunderbird**: Green triangles (▼) ## 3. Data Series and Trends ### HDFS (Black Squares) - **Trend**: Flat line near y = 1.00 - **Data Points**: - β = 0%: 1.00 - β = 10%: 1.00 - β = 20%: 1.00 - β = 30%: 1.00 - β = 40%: 1.00 - β = 50%: 1.00 - β = 60%: 1.00 - β = 70%: 1.00 - β = 80%: 1.00 ### Liberty (Blue Triangles) - **Trend**: Slight dip from ~0.98 to ~0.97, then recovery - **Data Points**: - β = 0%: 0.98 - β = 10%: 0.98 - β = 20%: 0.99 - β = 30%: 0.98 - β = 40%: 0.98 - β = 50%: 0.98 - β = 60%: 0.98 - β = 70%: 0.99 - β = 80%: 0.99 ### BGL (Red Circles) - **Trend**: Peak at β = 40%, then decline - **Data Points**: - β = 0%: 0.85 - β = 10%: 0.85 - β = 20%: 0.84 - β = 30%: 0.86 - β = 40%: 0.88 - β = 50%: 0.86 - β = 60%: 0.85 - β = 70%: 0.81 - β = 80%: 0.78 ### Thunderbird (Green Triangles) - **Trend**: Sharp rise to ~0.96, then gradual decline - **Data Points**: - β = 0%: 0.00 - β = 10%: 0.96 - β = 20%: 0.96 - β = 30%: 0.96 - β = 40%: 0.96 - β = 50%: 0.96 - β = 60%: 0.94 - β = 70%: 0.87 - β = 80%: 0.85 ## 4. Spatial Grounding - **Legend Placement**: Bottom-right corner (confirmed via visual alignment). - **Color Consistency**: - HDFS (black) matches black squares. - Liberty (blue) matches blue triangles. - BGL (red) matches red circles. - Thunderbird (green) matches green triangles. ## 5. Key Observations - **HDFS**: Maintains maximum precision (1.00) across all β values. - **Liberty**: Stable with minor fluctuations; no significant degradation. - **BGL**: Precision peaks at β = 40% (0.88), then declines sharply. - **Thunderbird**: Rapid improvement from β = 0% to 10%, followed by gradual degradation. ## 6. Data Validation - All data points align with visual trends (e.g., Thunderbird’s sharp rise at β = 10% matches the steep green line). - No discrepancies between legend labels and line colors/markers. ## 7. Conclusion The chart compares precision metrics across four systems (HDFS, Liberty, BGL, Thunderbird) as β (%) varies. HDFS dominates with perfect precision, while Thunderbird shows the most dynamic behavior. BGL exhibits a peak-and-decline pattern, and Liberty remains relatively stable. </details> (a) Precision <details> <summary>2411.08561v5/x6.png Details</summary> ![7ed459312e23cd323557ba1f484cbed575dda27d1a9a8659391312f1e525870c](http://localhost:8000/v1/image/7ed459312e23cd323557ba1f484cbed575dda27d1a9a8659391312f1e525870c) ### Visual Description # Technical Document Analysis of Line Chart ## Chart Overview The image is a line chart titled **"Recall vs. β (%)"**, comparing the performance of four algorithms across varying β (beta) values. The chart visualizes recall metrics on the y-axis (0.00 to 1.00) against β percentage on the x-axis (0% to 80%). --- ## Key Components ### 1. **Axes** - **X-axis**: Labeled **"β (%)"**, ranging from **0% to 80%** in increments of 10%. - **Y-axis**: Labeled **"Recall"**, scaled from **0.00 to 1.00** in increments of 0.05. ### 2. **Legend** - Located in the **bottom-right corner** of the chart. - **Color-Coded Labels**: - **Gray squares**: HDFS - **Red circles**: BGL - **Blue triangles**: Liberty - **Green triangles**: Thunderbird ### 3. **Data Series #### a. **HDFS (Gray Squares)** - **Trend**: Horizontal line at **y = 1.00** across all β values. - **Key Points**: - Starts at (0%, 1.00) and remains constant. - No variation observed. #### b. **BGL (Red Circles)** - **Trend**: Sharp upward spike at **β = 30%**, then plateaus. - **Key Points**: - (0%, 0.90) - (30%, 0.98) - (80%, 0.98) #### c. **Liberty (Blue Triangles)** - **Trend**: Gradual decline from **β = 0% to 80%**. - **Key Points**: - (0%, 0.92) - (20%, 0.92) - (40%, 0.91) - (80%, 0.90) #### d. **Thunderbird (Green Triangles)** - **Trend**: Rapid ascent at **β = 30%**, then stabilizes. - **Key Points**: - (0%, 0.00) - (30%, 0.97) - (80%, 0.97) --- ## Spatial Grounding - **Legend Position**: Bottom-right corner of the chart. - **Data Point Verification**: - **Red circles (BGL)** match the legend label and trend. - **Blue triangles (Liberty)** align with the legend and declining trend. - **Green triangles (Thunderbird)** confirm the sharp rise and plateau. --- ## Trend Verification 1. **HDFS**: Flat line at maximum recall (1.00), indicating perfect performance regardless of β. 2. **BGL**: Sharp improvement at β = 30%, suggesting a threshold effect. No further gains beyond this point. 3. **Liberty**: Slight degradation in recall as β increases, implying sensitivity to β values. 4. **Thunderbird**: Dramatic improvement at β = 30%, followed by stabilization. Outperforms other algorithms post-30%. --- ## Conclusion The chart highlights algorithm-specific behaviors: - **HDFS** maintains optimal recall. - **BGL** and **Thunderbird** exhibit threshold effects at β = 30%. - **Liberty** shows diminishing returns with increasing β. All legend labels and axis markers are explicitly cross-referenced for accuracy. No additional textual or numerical data is present in the image. </details> (b) Recall <details> <summary>2411.08561v5/x7.png Details</summary> ![e6ce8cef2cd5548a9e9eec76885b15a0bb1bdf227830342702586296c31e9f86](http://localhost:8000/v1/image/e6ce8cef2cd5548a9e9eec76885b15a0bb1bdf227830342702586296c31e9f86) ### Visual Description # Technical Document Analysis of Line Chart ## Chart Overview The image depicts a line chart comparing the performance of four algorithms across varying β (%) values. The y-axis represents F₁-score (0.00–1.00), and the x-axis represents β (%) (0–80). Four data series are plotted with distinct markers and colors. --- ## Legend & Spatial Grounding - **Legend Location**: Bottom-right corner of the chart. - **Legend Entries**: - **HDFS**: Black squares (□) - **BGL**: Red circles (●) - **Liberty**: Blue triangles (▲) - **Thunderbird**: Green triangles (▲) --- ## Axis Labels - **X-axis**: β (%) - Range: 0 to 80 (increments of 10) - **Y-axis**: F₁-score - Range: 0.00 to 1.00 (increments of 0.05) --- ## Data Series Analysis ### 1. HDFS (Black Squares) - **Trend**: Flat line at F₁-score = 1.00 across all β values. - **Key Points**: - β = 0%: 1.00 - β = 10%: 1.00 - β = 20%: 1.00 - β = 30%: 1.00 - β = 40%: 1.00 - β = 50%: 1.00 - β = 60%: 1.00 - β = 70%: 1.00 - β = 80%: 1.00 ### 2. BGL (Red Circles) - **Trend**: Peaks at β = 30%, then declines. - **Key Points**: - β = 0%: 0.85 - β = 10%: 0.85 - β = 20%: 0.90 - β = 30%: 0.92 - β = 40%: 0.91 - β = 50%: 0.88 - β = 60%: 0.87 - β = 70%: 0.86 - β = 80%: 0.85 ### 3. Liberty (Blue Triangles) - **Trend**: Stable with minor fluctuations. - **Key Points**: - β = 0%: 0.95 - β = 10%: 0.95 - β = 20%: 0.96 - β = 30%: 0.95 - β = 40%: 0.95 - β = 50%: 0.95 - β = 60%: 0.95 - β = 70%: 0.95 - β = 80%: 0.95 ### 4. Thunderbird (Green Triangles) - **Trend**: Sharp rise at β = 10%, plateau until β = 50%, then decline. - **Key Points**: - β = 0%: 0.00 - β = 10%: 0.95 - β = 20%: 0.95 - β = 30%: 0.95 - β = 40%: 0.95 - β = 50%: 0.95 - β = 60%: 0.95 - β = 70%: 0.90 - β = 80%: 0.85 --- ## Cross-Reference Verification - **Legend Colors vs. Line Colors**: - Confirmed: Black (HDFS), Red (BGL), Blue (Liberty), Green (Thunderbird). - **Marker Consistency**: - All markers (□, ●, ▲) align with legend definitions. --- ## Observations 1. **HDFS** maintains perfect performance (F₁-score = 1.00) regardless of β. 2. **BGL** shows optimal performance at β = 30% (F₁-score = 0.92), with a decline at higher β values. 3. **Liberty** demonstrates consistent performance (F₁-score ≈ 0.95) across all β values. 4. **Thunderbird** exhibits a non-linear trend: rapid improvement at low β, stability in mid-range β, and degradation at high β. --- ## Conclusion The chart highlights algorithm performance trade-offs across β values. HDFS dominates in stability, while Thunderbird shows sensitivity to β thresholds. BGL and Liberty balance performance with moderate variability. </details> (c) F 1 -score <details> <summary>2411.08561v5/x8.png Details</summary> ![d8dfea1a93f6c2f5ae636e19723d908c57a3246e982caded457e6496e9fbf360](http://localhost:8000/v1/image/d8dfea1a93f6c2f5ae636e19723d908c57a3246e982caded457e6496e9fbf360) ### Visual Description # Technical Document Analysis: Training Time vs. β (%) ## Chart Description The image is a **line chart** titled **"Training time vs. β (%)"**. It visualizes the relationship between the parameter **β (percentage)** and **training time (minutes)** across four distinct algorithms. The chart spans **β = 0% to 80%** on the x-axis and **Training time = 0 to 2500 minutes** on the y-axis. --- ## Key Components ### 1. **Axis Labels** - **X-axis**: **β (%)** (ranging from 0 to 80 in increments of 10). - **Y-axis**: **Training time (minutes)** (ranging from 0 to 2500 in increments of 500). ### 2. **Legend** - Located in the **top-right corner** of the chart. - **Labels and Colors**: - **HDFS**: Black squares (`■`). - **BGL**: Red circles (`●`). - **Liberty**: Blue triangles (`▲`). - **Thunderbird**: Green triangles (`▼`). ### 3. **Data Series** Four algorithms are plotted with distinct markers and colors: 1. **HDFS** (Black squares): - **Trend**: Steep upward slope. - **Data Points**: - β = 0%: ~400 minutes. - β = 80%: ~2200 minutes. 2. **BGL** (Red circles): - **Trend**: Gradual upward slope. - **Data Points**: - β = 0%: ~10 minutes. - β = 80%: ~300 minutes. 3. **Liberty** (Blue triangles): - **Trend**: Moderate upward slope. - **Data Points**: - β = 0%: ~50 minutes. - β = 80%: ~400 minutes. 4. **Thunderbird** (Green triangles): - **Trend**: Steep upward slope. - **Data Points**: - β = 0%: ~200 minutes. - β = 80%: ~1200 minutes. --- ## Spatial Grounding - **Legend Position**: Top-right corner (standard placement for clarity). - **Data Point Verification**: - All markers (squares, circles, triangles) align with their respective legend entries. - Colors match legend definitions (e.g., red circles = BGL). --- ## Trend Verification - **HDFS**: Dominates with the steepest increase, suggesting high sensitivity to β. - **Thunderbird**: Second-steepest slope, indicating moderate sensitivity. - **Liberty**: Gradual increase, showing low sensitivity. - **BGL**: Flattest slope, implying minimal impact from β changes. --- ## Notes - No additional text, tables, or diagrams are present. - All textual information is in **English**. - No non-English content detected. --- ## Summary The chart demonstrates that **HDFS** and **Thunderbird** exhibit the highest training time increases as β rises, while **BGL** remains relatively stable. This suggests β has a significant impact on training efficiency for HDFS and Thunderbird but not for BGL or Liberty. </details> (d) Training time Figure 4: Impact of minority class oversampling. V-J Impact of Minority Class Oversampling (RQ6) Note that normal and anomalous samples in the training dataset are imbalanced, as shown in Table I. For the HDFS, BGL, and Thunderbird datasets, normal samples outnumber anomalous samples. Conversely, in the Liberty dataset, anomalous samples exceed normal samples. As described in Section IV-C 1, the hyper-parameter $\beta$ controls the proportion of the minority class by oversampling to address the data imbalance problem. In this section, we investigate the impact of $\beta$ by varying its value. Fig. 4 illustrates the performance of LogLLM on the four datasets under different magnitudes of $\beta$ . When $\beta=0$ , the samples are not oversampled; instead, the original datasets are utilized directly for training. As illustrated in Fig. 4b, for the HDFS, BGL, and Thunderbird datasets, the recall always increases, while for the Liberty dataset, recall decreases as $\beta$ increases. This can be attributed to the fact that for the HDFS, BGL, and Thunderbird datasets, when $\beta$ increases, anomalies are oversampled, making the model more prone to identifying samples as anomalies. In contrast, for the Liberty dataset, when $\beta$ increases, normal samples are oversampled, making the model more prone to identifying samples as normal. As illustrated in Fig. 4c, the trend of the F 1 -score is basically the same across all datasets. The F 1 -score increases and then decreases as $\beta$ increases. However, the LogLLM seems not to be sensitive to $\beta$ ; when $\beta$ is between 10% and 80%, the variation in the F 1 -score is no more than 0.07. Thanks to the substantial semantic knowledge embedded in LLMs, a trained model can effectively learn anomalous patterns and detect anomalies, even when the minority class constitutes only 10% of the dataset. However, LogLLM appears unable to effectively handle extremely imbalanced scenarios. For instance, in the Thunderbird dataset, anomalies constitute only 1.05% of the samples, causing the trained model to be biased and classify all samples as normal. As a result, precision, recall, and F 1 -score are all equal to 0. Compared to the BGL and Thunderbird datasets, the precision, recall and F 1 -score for the HDFS and Liberty datasets exhibit minimal variation with respect to $\beta$ . This consistency arises from the more distinct patterns between abnormal and normal samples in the HDFS and Liberty datasets, allowing LogLLM to easily differentiate them, regardless of the ratio of normal and abnormal samples. As anticipated, as $\beta$ increases, the training time also increases, as shown in Fig. 4d. This relationship arises because a higher $\beta$ leads to more oversampled data samples, as indicated by equation (1), thereby enlarging the training dataset. To summarize, minority class oversampling is essential; however, the value of the hyperparameter $\beta$ does not significantly impact the performance of LogLLM, making careful selection unnecessary. Moreover, excessively large values of $\beta$ are undesirable, as they result in prolonged training times. Values between 30% and 50% are deemed acceptable. VI Conclusion In this paper, we propose LogLLM, a novel log-based anomaly detection framework that leverages LLMs. LogLLM employs both transformer encoder-based and decoder-based LLMs, specifically BERT and Llama, for log-based anomaly detection. BERT is utilized to extract semantic vectors from log messages, while Llama is used to classify log sequences. To ensure coherence in log semantics, we introduce a projector that aligns the vector representation spaces of BERT and Llama. LogLLM is trained using an innovative three-stage procedure designed to enhance both performance and adaptability. Extensive experiments conducted on four public real-world datasets demonstrate that LogLLM achieves remarkable performance. Subsequent ablation studies further confirm the effectiveness of our three-stage training procedure. References - [1] R. S. Kazemzadeh and H.-A. Jacobsen, “Reliable and highly available distributed publish/subscribe service,” in 2009 28th IEEE International Symposium on Reliable Distributed Systems. IEEE, 2009, pp. 41–50. - [2] E. Bauer and R. Adams, Reliability and availability of cloud computing. John Wiley & Sons, 2012. - [3] V.-H. Le and H. Zhang, “Log-based anomaly detection without log parsing,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 492–504. - [4] W. Guan, J. Cao, H. Zhao, Y. Gu, and S. Qian, “Survey and benchmark of anomaly detection in business processes,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–23, 2024. - [5] S. Zhang, Y. Ji, J. Luan, X. Nie, Z. Chen, M. Ma, Y. Sun, and D. Pei, “End-to-end automl for unsupervised log anomaly detection,” Automated Software Engineering (ASE’24), 2024. - [6] V.-H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” in Proceedings of the 44th international conference on software engineering, 2022, pp. 1356–1367. - [7] J. Qi, S. Huang, Z. Luan, S. Yang, C. Fung, H. Yang, D. Qian, J. Shang, Z. Xiao, and Z. Wu, “Loggpt: Exploring chatgpt for log-based anomaly detection,” in 2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). IEEE, 2023, pp. 273–280. - [8] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, 2017, pp. 1285–1298. - [9] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, S. Tao, P. Sun et al., “Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs.” in IJCAI, vol. 19, no. 7, 2019, pp. 4739–4745. - [10] L. Zhang, W. Li, Z. Zhang, Q. Lu, C. Hou, P. Hu, T. Gui, and S. Lu, “Logattn: Unsupervised log anomaly detection with an autoencoder based attention mechanism,” in International conference on knowledge science, engineering and management. Springer, 2021, pp. 222–235. - [11] M. Catillo, A. Pecchia, and U. Villano, “Autolog: Anomaly detection by deep autoencoding of system logs,” Expert Systems with Applications, vol. 191, p. 116263, 2022. - [12] Y. Xie and K. Yang, “Log anomaly detection by adversarial autoencoders with graph feature fusion,” IEEE Transactions on Reliability, 2023. - [13] X. Zhang, X. Chai, M. Yu, and D. Qiu, “Anomaly detection model for log based on lstm network and variational autoencoder,” in 2023 4th International Conference on Information Science, Parallel and Distributed Systems (ISPDS). IEEE, 2023, pp. 239–244. - [14] X. Duan, S. Ying, W. Yuan, H. Cheng, and X. Yin, “A generative adversarial networks for log anomaly detection.” Comput. Syst. Sci. Eng., vol. 37, no. 1, pp. 135–148, 2021. - [15] Z. He, Y. Tang, K. Zhao, J. Liu, and W. Chen, “Graph-based log anomaly detection via adversarial training,” in International Symposium on Dependable Software Engineering: Theories, Tools, and Applications. Springer, 2023, pp. 55–71. - [16] C. Zhang, X. Wang, H. Zhang, J. Zhang, H. Zhang, C. Liu, and P. Han, “Layerlog: Log sequence anomaly detection based on hierarchical semantics,” Applied Soft Computing, vol. 132, p. 109860, 2023. - [17] S. Hashemi and M. Mäntylä, “Onelog: towards end-to-end software log anomaly detection,” Automated Software Engineering, vol. 31, no. 2, p. 37, 2024. - [18] S. Lu, X. Wei, Y. Li, and L. Wang, “Detecting anomaly in big data system logs using convolutional neural network,” in 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, 2018, pp. 151–158. - [19] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang, Q. Cheng, Z. Li et al., “Robust log-based anomaly detection on unstable log data,” in Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 807–817. - [20] Y. Xie, H. Zhang, and M. A. Babar, “Loggd: Detecting anomalies from system logs with graph neural networks,” in 2022 IEEE 22nd International conference on software quality, reliability and security (QRS). IEEE, 2022, pp. 299–310. - [21] Z. Zhao, W. Niu, X. Zhang, R. Zhang, Z. Yu, and C. Huang, “Trine: Syslog anomaly detection with three transformer encoders in one generative adversarial network,” Applied Intelligence, vol. 52, no. 8, pp. 8810–8819, 2022. - [22] L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, and W. Zhang, “Semi-supervised log-based anomaly detection via probabilistic label estimation,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1448–1460. - [23] S. Hochreiter, “Long short-term memory,” Neural Computation MIT-Press, 1997. - [24] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017. - [25] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. - [26] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024. - [27] T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai et al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,” arXiv preprint arXiv:2406.12793, 2024. - [28] W. Guan, J. Cao, J. Gao, H. Zhao, and S. Qian, “Dabl: Detecting semantic anomalies in business processes using large language models,” arXiv preprint arXiv:2406.15781, 2024. - [29] Y. Liu, S. Tao, W. Meng, F. Yao, X. Zhao, and H. Yang, “Logprompt: Prompt engineering towards zero-shot and interpretable log analysis,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 364–365. - [30] C. Egersdoerfer, D. Zhang, and D. Dai, “Early exploration of using chatgpt for log-based anomaly detection on parallel file systems logs,” in Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023, pp. 315–316. - [31] J. Pan, W. S. Liang, and Y. Yidi, “Raglog: Log anomaly detection using retrieval augmented generation,” in 2024 IEEE World Forum on Public Safety Technology (WFPST). IEEE, 2024, pp. 169–174. - [32] H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” in 2021 international joint conference on neural networks (IJCNN). IEEE, 2021, pp. 1–8. - [33] Y. Lee, J. Kim, and P. Kang, “Lanobert: System log anomaly detection based on bert masked language model,” Applied Soft Computing, vol. 146, p. 110689, 2023. - [34] Y. Lin, H. Deng, and X. Li, “Fastlogad: Log anomaly detection with mask-guided pseudo anomaly generation and discrimination,” arXiv preprint arXiv:2404.08750, 2024. - [35] C. Almodovar, F. Sabrina, S. Karimi, and S. Azad, “Logfit: Log anomaly detection using fine-tuned language models,” IEEE Transactions on Network and Service Management, 2024. - [36] S. Chen and H. Liao, “Bert-log: Anomaly detection for system logs based on pre-trained language model,” Applied Artificial Intelligence, vol. 36, no. 1, p. 2145642, 2022. - [37] J. L. Adeba, D.-H. Kim, and J. Kwak, “Sarlog: Semantic-aware robust log anomaly detection via bert-augmented contrastive learning,” IEEE Internet of Things Journal, 2024. - [38] Y. Fu, K. Liang, and J. Xu, “Mlog: Mogrifier lstm-based log anomaly detection approach using semantic representation,” IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3537–3549, 2023. - [39] G. No, Y. Lee, H. Kang, and P. Kang, “Training-free retrieval-based log anomaly detection with pre-trained language model considering token-level information,” Engineering Applications of Artificial Intelligence, vol. 133, p. 108613, 2024. - [40] F. Hadadi, Q. Xu, D. Bianculli, and L. Briand, “Anomaly detection on unstable logs with gpt models,” arXiv preprint arXiv:2406.07467, 2024. - [41] M. Burtsev, M. Reeves, and A. Job, “The working limitations of large language models,” MIT Sloan Management Review, vol. 65, no. 2, pp. 8–10, 2024. - [42] A. Joulin, “Fasttext. zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016. - [43] J. Devlin, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. - [44] Y. Liu, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. - [45] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” Transactions of the association for computational linguistics, vol. 8, pp. 64–77, 2020. - [46] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020. - [47] J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and benchmarks for automated log parsing,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2019, pp. 121–130. - [48] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “An evaluation study on log parsing and its use in log mining,” in 2016 46th annual IEEE/IFIP international conference on dependable systems and networks (DSN). IEEE, 2016, pp. 654–661. - [49] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan, “Online system problem detection by mining patterns of console logs,” in 2009 ninth IEEE international conference on data mining. IEEE, 2009, pp. 588–597. - [50] A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” in 37th annual IEEE/IFIP international conference on dependable systems and networks (DSN’07). IEEE, 2007, pp. 575–584. - [51] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in 2017 IEEE international conference on web services (ICWS). IEEE, 2017, pp. 33–40. - [52] M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016, pp. 859–864. - [53] V.-H. Le and H. Zhang, “Log parsing with prompt-based few-shot learning,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2438–2449. - [54] J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing: Algorithms, architectures and applications. Springer, 1990, pp. 227–236. - [55] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” Advances in Neural Information Processing Systems, vol. 36, 2024. - [56] I. Loshchilov, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. - [57] J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 355–366.

Rendering Paper...