Apache-BigData-SIEM is a high-performance, scalable security analytics platform designed to overcome the volume and cost limitations of traditional SIEM solutions. By leveraging a Data Lakehouse architecture, it provides both real-time stream processing and deep historical forensic capabilities.
Please check our Contributing Guidelines and Code of Conduct if you wish to help improve this project!
The project implements a modern Lakehouse pattern to ensure data is processed instantly and stored in an optimized format for long-term security analysis.
- Ingestion: Apache Kafka β Distributed message buffer handling high Events Per Second (EPS) bursts.
- Processing: Apache Spark Streaming β Real-time log parsing (Regex + JSON), normalization, and correlation.
- Cataloging: Apache Hive β Structured SQL interface over unstructured log data in HDFS.
- Storage: Apache Hadoop (HDFS) β Data Lake backbone, storing logs in Apache Parquet format.
- Visualization: Apache Superset β SOC dashboards, SQL Lab, and scheduled alerts.
graph TD
subgraph FLOG["π‘ Log Generators (Flog)"]
FW["Flog Web<br/><i>web-logs</i>"]
FS["Flog Syslog<br/><i>syslogs</i>"]
FA["Flog App<br/><i>app-logs</i>"]
FE["Flog WinEvent<br/><i>win-event-logs</i>"]
end
subgraph KAFKA["π¨ Messaging Layer"]
KB["Kafka Broker<br/><b>KRaft Mode</b><br/>:9092"]
end
subgraph SPARK["βοΈ Processing Layer (Spark)"]
SM["Spark Master<br/>:8080 Β· :7077 Β· :10000"]
SW1["Spark Worker 1<br/>2G RAM Β· 2 Cores<br/>:8081"]
SW2["Spark Worker 2<br/>2G RAM Β· 2 Cores<br/>:8082"]
end
subgraph HIVE["ποΈ Data Cataloging (Hive)"]
HM["Hive Metastore<br/>:9083"]
HS2["Hive Server2<br/>:10001 Β· :10002"]
end
subgraph HDFS["πΎ Distributed Storage (HDFS)"]
NN["Namenode<br/>:9870 Β· :8020"]
DN1["Datanode 1"]
DN2["Datanode 2"]
end
subgraph DB["ποΈ Unified Database"]
PG[("Postgres")]
end
subgraph VIZ["π Analytics & Visualization"]
SRED["Superset Redis"]
SUP["Superset<br/>:8088"]
end
FW -->|"web-logs"| KB
FS -->|"syslogs"| KB
FA -->|"app-logs"| KB
FE -->|"win-event-logs"| KB
KB -->|"Stream Consume"| SM
SM --- SW1
SM --- SW2
SM -->|"Parquet Write"| NN
NN --- DN1
NN --- DN2
HM -->|"Metadata DB"| PG
HM -.->|"Warehouse location"| NN
SUP -->|"Superset DB"| PG
SUP ==>|"SQL Query"| SM
SM ==>|"Data Read (Parquet)"| NN
SM <-->|"Hive Metadata"| HM
SUP --- SRED
style FLOG fill:#f8fafc,stroke:#00b894,color:#0f172a
style KAFKA fill:#f8fafc,stroke:#e17055,color:#0f172a
style SPARK fill:#f8fafc,stroke:#fdcb6e,color:#0f172a
style HIVE fill:#f8fafc,stroke:#74b9ff,color:#0f172a
style HDFS fill:#f8fafc,stroke:#a29bfe,color:#0f172a
style DB fill:#f8fafc,stroke:#636e72,color:#0f172a
style VIZ fill:#f8fafc,stroke:#fd79a8,color:#0f172a
- Real-time Threat Detection: Immediate anomaly detection and alerting using Spark's windowing functions.
- Advanced Threat Hunting: High-speed SQL queries over billions of rows using Spark SQL and Hive.
- Lakehouse Efficiency: Combines the flexibility of a Data Lake with the structural performance of a Data Warehouse.
- Cost-Effective Scalability: Built entirely on the open-source Apache ecosystem, eliminating expensive per-terabyte licensing.
- Messaging: Apache Kafka 3.7.x (KRaft mode β no Zookeeper)
- Processing Engine: Apache Spark 3.5.x (PySpark)
- Data Warehouse: Apache Hive 4.0.x
- Distributed Storage: Apache Hadoop 3.2.x (HDFS)
- Visualization: Apache Superset 4.1.x
- Environment: Docker & Docker Compose
We provide a simple Makefile wrapper for all Docker and Spark commands. If you do not have make installed, you can look at the Makefile and run the raw docker compose and docker exec commands.
make up
# or: docker compose up -dThis will deploy:
- Kafka (KRaft mode β no Zookeeper dependency)
- Hadoop HDFS (1 NameNode + 2 DataNodes)
- Hive Metastore + HiveServer2 + PostgreSQL Metastore DB
- Spark (1 Master + 2 Workers + Thrift Server on :10000)
- Superset + Redis (SOC dashboards at :8088)
- 4 distributed flog producers (
web-logs,syslogs,app-logs,win-event-logs)
Web UIs after startup:
| Service | URL |
|---|---|
| Spark Master | http://localhost:8080 |
| HDFS NameNode | http://localhost:9870 |
| Superset | http://localhost:8088 |
| Spark Worker 1 | http://localhost:8081 |
| Spark Worker 2 | http://localhost:8082 |
make run-job
# or: docker exec -it spark-master spark-submit ...Use the operational runbooks in docs/ to verify:
docs/verification-guide.mdβ HDFS, Hive, Kafka, Spark health checksdocs/superset-guide.mdβ Superset connection, query examples, and dashboard setupdocs/EXAMPLE_QUERIES.mdβ Raw SQL threat-hunting queriesdocs/CAPACITY.mdβ Sizing recommendations
docker-compose.ymlβ Full platform stack definitionMakefileβ Convenience wrapper for common commandsconfig/hadoop/core-site.xmlβ HDFS client configurationconfig/hadoop/hdfs-site.xmlβ HDFS replication settingsconfig/hive/hive-site.xmlβ Hive Metastore connectionconfig/spark/spark-defaults.confβ Spark tuningflog/Dockerfileβ Log generator imageflog/publish_flog.shβ Flog β Kafka producer scriptjobs/etl_process.pyβ Kafka β Hive Parquet ETL jobjobs/detection_rules.pyβ Spark SQL based SIEM detection enginedocs/verification-guide.mdβ Operational health runbookdocs/superset-guide.mdβ Superset connection & SOC dashboard guidedocs/EXAMPLE_QUERIES.mdβ SQL threat-hunting query examplesdocs/CAPACITY.mdβ Storage and memory sizing guideresearch/β Technical deep-dive documents for each componentshowcase/β Interactive HTML presentation of the platform