LANDER:DISCERN-20251019 From Predict README version: 15775, last modified: 2026-04-22. This file describes the trace dataset "DISCERN-20251019" provided by the LANDER project. Contents • 1 LANDER Metadata • 2 Dataset Creation • 3 Dataset Contents • 4 Synthetic Data • 5 Data Schema • 6 Citation • 7 Results Using This Dataset • 8 User Annotations LANDER Metadata ┌───────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────┐ │ dataSetName │ DISCERN-20251019 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ status │ usc-web-and-predict │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ shortDesc │ DISCERN instrumentation data from SPHERE │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ longDesc │ This dataset contains 2 types of data, synthetic and real. Synthetic data records the │ │ │ legitimate and malicious workflows performed on the SPHERE testbed. The real data │ │ │ records realistic users' traces on the SPHERE testbed. Each set of data is a collection │ │ │ of CPU load, file system events, network pcap headers, interfaces log, top CPU usage │ │ │ processes, top Memory usage processes, and new processes data within the experiment │ │ │ realization in the SPHERE testbed. The real data is anonymized by hashing sections of │ │ │ Device ID, also known as the experiment FQDN: [device │ │ │ name].exp(hashed).[realization](hashed).[experiment](hashed).[project](hashed). Pcap │ │ │ contents and process names are also removed. The collection is converted to CSV, grouped │ │ │ into subdirectories based on hashed realization, and pruned by removing data sections │ │ │ shorter than 30 minutes. │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ datasetClass │ Unclassified │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ commercialAllowed │ true │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ requestReviewRequired │ true │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ productReviewRequired │ false │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ ongoingMeasurement │ false │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ submissionMethod │ Upload │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartDate │ 2025-10-19 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartTime │ 00:00:00 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndDate │ 2025-12-12 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndTime │ 00:00:00 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartDate │ │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartTime │ 00:00:00 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndDate │ 2030-01-01 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndTime │ 00:00:00 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ anonymization │ cryptopan/full │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ archivingAllowed │ false │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ keywords │ category:generic-network/behavior-data, testbed, instrumentation │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ format │ csv │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ access │ https │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ hostName │ USC-LANDER │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ providerName │ USC │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ groupingId │ │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ groupingSummaryFlag │ false │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ retrievalInstructions │ download │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ byteSize │ 21982347264 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ expirationDays │ 14 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ uncompressedSize │ │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ impactDoi │ │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ useAgreement │ dua-ni-160816 │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ irbRequired │ false │ ├───────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤ │ privateAccessInstructions │ See https://ant.isi.edu/datasets/#getting-datasets for information on obtaining this │ │ │ dataset. │ │ │ See │ └───────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────┘ Dataset Creation This data was captured at the SPHERE testbed. SPHERE is research infrastructure that runs on Merge testbed platform. Merge is testbed platform that’s composed of a centralized portal that acts as an experimentation hub for a distributed system of testbed facilities. [1] More information about the SPHERE testbed could be found in [2]. The data collection system used is [3]. The aforementioned data are collected from each node from the testbed, created by consented users in different realizations and materializations from materialization to termination during the data collection period (2025-10-19 to 2025-12-12), besides some intermittent system downtime. Afterwards, data are pruned to remove the errors in the data:  1. We did an intersection (AND) of all heartbeat data (cpu-load, proc-cpu, and proc-mem) within the same realization.  2. Heartbeats are considered continuous when each heartbeat is gapped less than 15 seconds.  3. Only the continuous data segments longer than 30 minutes are kept. Dataset Contents The dataset represents traffic and activities of the materialization in the SPHERE testbed. The data includes CPU load, file system events, network pcap headers, interfaces log, top CPU usage processes, top Memory usage processes, and new processes, and is separated into directories per experimental node per realization. The primary aim of the dataset is to support the training of Machine Learning models targeting malicious activity identification and classification in academic cyberinfrastructures, such as testbeds. This part of the data contains realistic traces of user activities, and could be combined and merged with the synthetic DISCERN data [4], which is also included in this dataset. We also provide 3 merged samples. The merged sample is created by overlapping the timestamps of a synthetic malicious dataset with a realistic dataset, and merging the Device ID of the node. Nodes are merged pairwise, so users could retrieve the combination they preferred. Users could develop their own tool or use our tool to merge the DISCERN real and synthetic data. For how we merge data, please refer utils/merge_tool.md, and the script utils/csv-merge.py. Since the data is realistic, the topologies and nodes are defined by the consented users of the SPHERE testbed during the collection time. In the dataset, we provide a list of topologies and their corresponding realizations. Here’s a preview of the list (the full list could be found at info/topology.txt): TOPOLOGY GROUP #1 Nodes: [attacker, client, router, server] Realizations: - bnsdq_ozxfs_rzona_vnoxt, - bnsdq_ozxfs_ybwfd_spbau, - bnsdq_ozxfs_zccnd_spbau, - bnsdq_ozxfs_gnnph_nmtny, ... ------------------------------------------------------------ TOPOLOGY GROUP #2 Nodes: [attacker, client, router] Realizations: - bnsdq_ozxfs_gnnph_cjnos ------------------------------------------------------------ ... Following is the dataset's hierarchical structure:     DISCERN-20251019.README.txt     copy of this README     real/              bnsdq_/     Directory for a specific experiment realization (e.g., bnsdq_aaglg_wujkt_wvfoy)             /     Directory for a specific node (e.g., passnode-data)                 cpu-load.csv     the cpu-load data of the node                 file.csv     the file change data of the node                 interface.csv     the interface setting data of the node                 network.csv     the network data of the node                 proc-cpu.csv     the top cpu processes data of the node                 proc-mem.csv     the top memory processes data of the node                 proc-new.csv     the new processes data of the node             ...         ...     other realization directories     info/              topology.txt     this file indicate the node combination of each realization         synthetic-info/     this folder contains the steps to generate synthetic data         cryptominer.md         dnsmitm.md         exfiltrate.md         internetscanner.md         llm.md         ransomware.md         spread.md         svm.md         synflood.md     utils/              csv-merge.py     merge script for merging real data and synthetic malicious data         merge_tool.md     this file details the purpose and usage of css-merge.py in tool.     merge/     merge folder contain 3 example merged realizations         bnsdq_ozxfs_rzona_uxkia-cryptominer_0/     the pairwise merge of bnsdq_ozxfs_rzona_uxkia and cryptominer_0             _/     Directory for a specific pair of nodes (e.g., attacker-data_compromised-data)                 cpu-load.csv     the cpu-load data of the node                 file.csv     the file change data of the node                 interface.csv     the interface setting data of the node                 network.csv     the network data of the node                 proc-cpu.csv     the top cpu processes data of the node                 proc-mem.csv     the top memory processes data of the node                 proc-new.csv     the new processes data of the node             ...         bnsdq_ozxfs_rzona_nmtny-exfiltrate_1/     the pairwise merge of bnsdq_ozxfs_rzona_nmtny and exfiltrate_1             ...         bnsdq_ozxfs_rzona_lnllj-ransomware_2/     the pairwise merge of bnsdq_ozxfs_rzona_lnllj and ransomware_2             ...     synthetic/              legitimate/     Directory for legitimate synthetic experiments (each scenario are ran 4 times)             dnsmitm/                      0/                      -data/                          cpu-load-res.csv                     file-res.csv                     interfaces-res.csv                     network-res.csv                     proc-cpu-res.csv                     proc-mem-res.csv                     proc-new-res.csv                 1/...                      2/...                      3/...                  llm/...                  svm/...                  synflood/...              malicious/     Directory for malicious synthetic experiments (each scenario are ran 4 times)             cryptominer/...                  exfiltrate/...                  internetscanner/...                  ransomware/...                  spread/...          Directory for merged data of the legitimate and malicious synthetic         merged/ experiments (synthetic legitimate and malicious experiments are merged pairwise, when malicious timeframe < legitimate timeframe)             dnsmitm_cryptominer/...             dnsmitm_ransomware/...             llm_cryptominer/...             svm_cryptominer/...             synflood_exfiltrate/...             synflood_spread/...             dnsmitm_exfiltrate/...             dnsmitm_spread/...             llm_spread/...             svm_exfiltrate/...             synflood_cryptominer/...             synflood_ransomware/...     .sha1sum     SHA-1 checksum The node, realization, experiment, and project name resolution follows this rule: [node-name].exp.[realization].[experiment].[project], where exp.[realization].[experiment].[project] are all hashed, and node names are kept. The file ".sha1sum" contains SHA1 checksums of individual compressed files. The integrity of the distribution thus can be checked by independently calculating SHA1 sums of files and comparing them with those listed in the file. If you have the sha1sum utility installed on your system, you can do that by executing: sha1sum --check .sha1sum This has to be done before files are uncompressed. Synthetic Data These datasets are created by DISCERN project's members on the [SPHERE research infrastructure](https://sphere-testbed.net). They are contained in the [synthetic](synthetic) folder. - legitimate folder contains several legitimate use cases: - dnsmitm - DNS MITM attack reproduction in an experiment - allowed use case for a security testbed - synflood - TCP SYN flood attack reproduction in an experiment - allowed use case for a security testbed - llm - Running an LLM in generative mode, e.g., as a security-focused chatbot - svm - Running an ML model (SVM) on a classification task, e.g., to classify legitimate from attack traffic - malicious folder contains several malicious use cases: - cryptominer - a user is using cryptomining software on an experimental node - internetscanner - an experimental node starts scanning many Internet hosts - ransomware - an experimental node is encrypted by ransomware - spread - an experimental node starts sends email to the outside (e.g., for purposes of spam, phishing or malware spread) - exfiltrate - attacker exfiltrates a file from an experimental node - merged folder contains interleaved/merged data from legitimate and malicious use cases Data Schema The data are all in CSV form. For each node, there exists maximum 7 files: cpu-load.csv, file.csv, interfaces.csv, network.csv, proc-cpu.csv, proc-mem.csv, and proc-new.csv. If a file contains 0 records, it would not be created. Each of the CSV contains schema at the first line of the file. Here's the full list of them: cpu-load.csv Schema description example timestamp Unix epoch time (in seconds) 1763583159 device_id FQDN of the node attacker.bnsdq.ozxfs.gnnph.nmtny load_core_0 cpu load percentage (in float), may have extra columns if node setup with 16.00000000021828 more cores file.csv Schema description example timestamp Unix epoch time (in seconds) 1763583159 device_id FQDN of the node attacker.bnsdq.ozxfs.gnnph.nmtny location location of the changed file /etc/systemd/network/eth1.network size size of the file (in byte) 114 hash MD5 hash of the file 02a106d89ed8b83b7b451f4a92c70897 owner owner of the file root group group of the file root interfaces.csv Schema Description Example timestamp Unix epoch time (in seconds) 1763693413 device_id FQDN of the node attacker.bnsdq.ozxfs.gnnph.nmtny interface_name Name of the network interface eth0 action The event type or state change occurring on the interface Changed hardware_addr The physical MAC (Media Access Control) address of the interface de:43:15:d3:4e:86 ips IP address(es) assigned to the interface (often blank if unassigned or 192.168.1.204 spanning) network.csv (data non-applicable could be empty or marked as N/A) Schema Description Example timestamp Unix epoch time (in seconds) when the packet was captured 1763583548 device The network interface that captured the packet eth0 length The size of the captured packet in bytes 373 link_protocol The Data Link layer protocol Ethernet network_protocol The Network layer protocol IPv4 transport_protocol The Transport layer protocol UDP application_protocol The Application layer protocol N/A ip_version The version of the Internet Protocol being used (v4 or v6) v4 src_ip The source IP address originating the traffic 172.30.0.163 dst_ip The destination IP address receiving the traffic fe80::8089:a3ff:fe29:136 src_port The source port number 68 dst_port The destination port number 67 arp_operation Indicates if the ARP packet is a request (1) or a reply (2) 1 arp_protocol Protocol type hardware is resolving (e.g., 2048 is hex 0x0800 for IPv4) 2048 arp_src_proto The sender protocol (IP) address in an ARP message 172.30.0.163 arp_dst_proto The target protocol (IP) address in an ARP message 172.30.0.122 eth_src_mac The source MAC address in the Ethernet header eth_dst_mac The destination MAC address in the Ethernet header proc-*.csv Schema Description Example timestamp Unix epoch time (in seconds) 1763590830 pid Process ID (unique identifier for the active process) 1338 ppid Parent Process ID (the ID of the process that started this one) 1 real_uid real User ID of the user who launched the process 0 effective_uid effective User ID used for privilege checks 0 saved_uid saved set-user-ID 0 filesystem_uid User ID used specifically for filesystem access checks 0 real_gid real Group ID 0 effective_gid effective Group ID 0 saved_gid saved set-group-ID 0 filesystem_gid Group ID used specifically for filesystem access checks 0 vm_peak peak virtual memory size used by the process 1331765248 vm_size current size of the program in virtual memory 1331765248 vm_hwm "High Water Mark" - the peak resident set size (physical memory) used 73248768 vm_rss current resident set size (actual physical RAM used) 0 rss_shmem amount of resident set size that is shared memory 0 vm_stk virtual memory stack size used by the process 135168 vm_data virtual memory data segment size (heap and other data) 108810240 threads number of threads currently executing inside the process 4 name name of the process is not collected in real N/A state current execution state of the process S (sleeping) device_id FQDN of the node sqli.bnsdq.ozxfs.nncmq.fnzoo cpu The CPU utilization percentage or time slice for this specific process (this 69.362517 could exceed 100 if multithreaded) Citation If you use this trace to conduct additional research, please cite it as: DISCERN real 2025 Dataset, PREDICT ID: USC-LANDER/DISCERN-20251019. Provided by the USC/LANDER project http://www.isi.edu/ant/lander. Results Using This Dataset No results yet. User Annotations Currently no annotations. Categories: • LANDER • LANDER:Datasets • Datasets