Skip to content

yusufarbc/Apache-BigData-SIEM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Apache-BigData-SIEM πŸ›‘οΈπŸ›οΈ

License: Apache 2.0 PRs Welcome

Apache-BigData-SIEM is a high-performance, scalable security analytics platform designed to overcome the volume and cost limitations of traditional SIEM solutions. By leveraging a Data Lakehouse architecture, it provides both real-time stream processing and deep historical forensic capabilities.

Please check our Contributing Guidelines and Code of Conduct if you wish to help improve this project!

πŸš€ Architectural Overview

The project implements a modern Lakehouse pattern to ensure data is processed instantly and stored in an optimized format for long-term security analysis.

  • Ingestion: Apache Kafka β€” Distributed message buffer handling high Events Per Second (EPS) bursts.
  • Processing: Apache Spark Streaming β€” Real-time log parsing (Regex + JSON), normalization, and correlation.
  • Cataloging: Apache Hive β€” Structured SQL interface over unstructured log data in HDFS.
  • Storage: Apache Hadoop (HDFS) β€” Data Lake backbone, storing logs in Apache Parquet format.
  • Visualization: Apache Superset β€” SOC dashboards, SQL Lab, and scheduled alerts.

Platform Architecture Diagram

graph TD
    subgraph FLOG["πŸ“‘ Log Generators (Flog)"]
        FW["Flog Web<br/><i>web-logs</i>"]
        FS["Flog Syslog<br/><i>syslogs</i>"]
        FA["Flog App<br/><i>app-logs</i>"]
        FE["Flog WinEvent<br/><i>win-event-logs</i>"]
    end

    subgraph KAFKA["πŸ“¨ Messaging Layer"]
        KB["Kafka Broker<br/><b>KRaft Mode</b><br/>:9092"]
    end

    subgraph SPARK["βš™οΈ Processing Layer (Spark)"]
        SM["Spark Master<br/>:8080 Β· :7077 Β· :10000"]
        SW1["Spark Worker 1<br/>2G RAM Β· 2 Cores<br/>:8081"]
        SW2["Spark Worker 2<br/>2G RAM Β· 2 Cores<br/>:8082"]
    end

    subgraph HIVE["πŸ—‚οΈ Data Cataloging (Hive)"]
        HM["Hive Metastore<br/>:9083"]
        HS2["Hive Server2<br/>:10001 Β· :10002"]
    end

    subgraph HDFS["πŸ’Ύ Distributed Storage (HDFS)"]
        NN["Namenode<br/>:9870 Β· :8020"]
        DN1["Datanode 1"]
        DN2["Datanode 2"]
    end

    subgraph DB["πŸ—„οΈ Unified Database"]
        PG[("Postgres")]
    end

    subgraph VIZ["πŸ“Š Analytics & Visualization"]
        SRED["Superset Redis"]
        SUP["Superset<br/>:8088"]
    end

    FW -->|"web-logs"| KB
    FS -->|"syslogs"| KB
    FA -->|"app-logs"| KB
    FE -->|"win-event-logs"| KB
    KB -->|"Stream Consume"| SM
    SM --- SW1
    SM --- SW2
    SM -->|"Parquet Write"| NN
    NN --- DN1
    NN --- DN2
    HM -->|"Metadata DB"| PG
    HM -.->|"Warehouse location"| NN
    SUP -->|"Superset DB"| PG
    SUP ==>|"SQL Query"| SM
    SM ==>|"Data Read (Parquet)"| NN
    SM <-->|"Hive Metadata"| HM
    SUP --- SRED

    style FLOG fill:#f8fafc,stroke:#00b894,color:#0f172a
    style KAFKA fill:#f8fafc,stroke:#e17055,color:#0f172a
    style SPARK fill:#f8fafc,stroke:#fdcb6e,color:#0f172a
    style HIVE fill:#f8fafc,stroke:#74b9ff,color:#0f172a
    style HDFS fill:#f8fafc,stroke:#a29bfe,color:#0f172a
    style DB fill:#f8fafc,stroke:#636e72,color:#0f172a
    style VIZ fill:#f8fafc,stroke:#fd79a8,color:#0f172a
Loading

πŸ’‘ Key Features

  • Real-time Threat Detection: Immediate anomaly detection and alerting using Spark's windowing functions.
  • Advanced Threat Hunting: High-speed SQL queries over billions of rows using Spark SQL and Hive.
  • Lakehouse Efficiency: Combines the flexibility of a Data Lake with the structural performance of a Data Warehouse.
  • Cost-Effective Scalability: Built entirely on the open-source Apache ecosystem, eliminating expensive per-terabyte licensing.

πŸ› οΈ Technology Stack

  • Messaging: Apache Kafka 3.7.x (KRaft mode β€” no Zookeeper)
  • Processing Engine: Apache Spark 3.5.x (PySpark)
  • Data Warehouse: Apache Hive 4.0.x
  • Distributed Storage: Apache Hadoop 3.2.x (HDFS)
  • Visualization: Apache Superset 4.1.x
  • Environment: Docker & Docker Compose

πŸ“‚ Quick Start

We provide a simple Makefile wrapper for all Docker and Spark commands. If you do not have make installed, you can look at the Makefile and run the raw docker compose and docker exec commands.

1) Start the Platform

make up
# or: docker compose up -d

This will deploy:

  • Kafka (KRaft mode β€” no Zookeeper dependency)
  • Hadoop HDFS (1 NameNode + 2 DataNodes)
  • Hive Metastore + HiveServer2 + PostgreSQL Metastore DB
  • Spark (1 Master + 2 Workers + Thrift Server on :10000)
  • Superset + Redis (SOC dashboards at :8088)
  • 4 distributed flog producers (web-logs, syslogs, app-logs, win-event-logs)

Web UIs after startup:

Service URL
Spark Master http://localhost:8080
HDFS NameNode http://localhost:9870
Superset http://localhost:8088
Spark Worker 1 http://localhost:8081
Spark Worker 2 http://localhost:8082

2) Run the ETL Job

make run-job
# or: docker exec -it spark-master spark-submit ...

3) Validate Distributed Health

Use the operational runbooks in docs/ to verify:

  • docs/verification-guide.md β€” HDFS, Hive, Kafka, Spark health checks
  • docs/superset-guide.md β€” Superset connection, query examples, and dashboard setup
  • docs/EXAMPLE_QUERIES.md β€” Raw SQL threat-hunting queries
  • docs/CAPACITY.md β€” Sizing recommendations

πŸ“ Project Structure

  • docker-compose.yml β€” Full platform stack definition
  • Makefile β€” Convenience wrapper for common commands
  • config/hadoop/core-site.xml β€” HDFS client configuration
  • config/hadoop/hdfs-site.xml β€” HDFS replication settings
  • config/hive/hive-site.xml β€” Hive Metastore connection
  • config/spark/spark-defaults.conf β€” Spark tuning
  • flog/Dockerfile β€” Log generator image
  • flog/publish_flog.sh β€” Flog β†’ Kafka producer script
  • jobs/etl_process.py β€” Kafka β†’ Hive Parquet ETL job
  • jobs/detection_rules.py β€” Spark SQL based SIEM detection engine
  • docs/verification-guide.md β€” Operational health runbook
  • docs/superset-guide.md β€” Superset connection & SOC dashboard guide
  • docs/EXAMPLE_QUERIES.md β€” SQL threat-hunting query examples
  • docs/CAPACITY.md β€” Storage and memory sizing guide
  • research/ β€” Technical deep-dive documents for each component
  • showcase/ β€” Interactive HTML presentation of the platform