agencies

DeepDFA 실행해보기 (END) 최종 - 1 - 본문

Ⅲ. 정보보안

DeepDFA 실행해보기 (END) 최종 - 1 -

agencies 2024. 10. 14. 16:37

 

 

DeepDFA.gz
8.72MB

 

 


 

DeepDFA를 돌리는 절차는 아래와 같습니다.

환경 : colab

사전 데이터 셋 확보 : https://figshare.com/articles/dataset/Dataflow_Analysis-Inspired_Deep_Learning_for_Efficient_Vulnerability_Detection/21225413

 

Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection

Data package for "Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection", published in ICSE 2024, with updates from Artifact Evaluation.Paper link: https://www.computer.org/csdl/proceedings-article/icse/2024/021700a166/1RLIWqviwEMS

figshare.com

 

 

우선 colab을 실행시킵니다. (사전 정보)

- path 경로 설정 부분

- 몇가지 xlsx 데이터 추가

- try ~ except

- pip install ~ [설치하기]

부분만 신경을 써 주시면 됩니다.

 

※ 환경 세팅

!git clone https://github.com/ISU-PAAL/DeepDFA.git
!wget https://github.com/joernio/joern/releases/download/v1.1.1072/joern-cli.zip
!unzip joern-cli.zip -d joern-cli
!export PATH=$PATH:/content/joern-cli/joern-cli
!pip install pip==23.2.1
!pip install tqdm numpy pandas torch==1.12 "torchmetrics<0.10.0" torchsampler silence-tensorflow tensorflow scipy captum deepspeed scikit-learn tokenizers transformers tree-sitter unidiff jsonlines networkx pexpect jsonargparse fastparquet gdown nni
!pip install -f https://data.dgl.ai/wheels/cu117/repo.html "dgl<1.1.3"
!pip install pytorch-lightning==1.7.7
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!pip install virtualenv

!apt-get update
!apt-get install cuda-11-7

 

 

실행하면 위와같이 오류가 발생하긴 하지만... 우선 무시합니다.

참고로 실행하면 오래 걸립니다. (예상소요시간 13분)

 

 

 

 

#DDFA/scripts 의 preprocess.sh 실행

# bash scripts/run_prepare.sh $@
# bash scripts/run_getgraphs.sh $@ # Make sure Joern is installed!
# bash scripts/run_dbize.sh $@
# bash scripts/run_abstract_dataflow.sh $@
# bash scripts/run_absdf.sh $@

 

1단계부터 5단계까지 우리가 가야 할 부분입니다. (순차적으로 실행해야 합니다)

* 추후에 dgl을 찾을 수 없다고 한다면.. pip install dgl을 하시기 바랍니다.

 

 

※ 그 전에 13분동안 다운로드 받은 문서 중에서, 복잡하니 몇 몇 파일들을 삭제하겠습니다.

 

깔끔하게 joern-cliDeepDFA/DDFA 만 살려두겠습니다.

 

 

 

 

run_prepare.sh 파일의 내용을 확인하면, 다음과 같습니다.

 

prepare.py 을 실행하라고 합니다.

--dataset 인자로는 bigvul을 입력하시면 되겠습니다.

실제로 실행하기 앞서 경로를 설정해줍니다.

이처럼 항상 신경써 주어야 할 부분은 경로를 추가해줘야 합니다.

 

 

[ prepare.py ]

더보기

import sys
sys.path.append("/content/DeepDFA/DDFA")

import argparse
import sastvd as svd
import sastvd.helpers.datasets as svdd
import sastvd.helpers.evaluate as ivde





def bigvul():
    """Run preperation scripts for BigVul dataset."""
    print(svdd.bigvul(sample=args.sample))
    ivde.get_dep_add_lines_bigvul("bigvul", sample=args.sample)
    # svdglove.generate_glove("bigvul", sample=args.sample)
    # svdd2v.generate_d2v("bigvul", sample=args.sample)
    print("success")


def devign():
    raise NotImplementedError
    print(svdd.devign(sample=args.sample))
    ivde.get_dep_add_lines("devign", sample=args.sample)
    svdglove.generate_glove("devign", sample=args.sample)
    svdd2v.generate_d2v("devign", sample=args.sample)
    print("success")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Prepare master dataframe")
    parser.add_argument("--sample", action="store_true", help="Extract a sample only")
    parser.add_argument("--global_workers", type=int, help="Number of workers to use")
    parser.add_argument("--dataset")
    args = parser.parse_args()

    if args.global_workers is not None:
        svd.DFMP_WORKERS = args.global_workers

    if args.dataset == "bigvul":
        bigvul()
    if args.dataset == "devign":
        devign()

python prepare.py --dataset bigvul 로 prepare.py 를 실행합니다.

 

 

 

위와 같은 오류가 발생하는데, MSR_data_cleaned.csv 파일이 없어서 그렇습니다.

그렇기에 10gb 짜리 엑셀 파일을 넣어줘야 하는데,

시간이 오래 걸리니 일부만 잘라서 넣겠습니다.

 

import pandas as pd

# 출력 설정 변경: 모든 열 출력 및 길이 제한 없앰
pd.set_option('display.max_columns', None)  # 모든 열을 출력
pd.set_option('display.max_colwidth', None)  # 각 열의 내용 길이 제한 없음

# 파일을 일정 크기(chunksize)로 나누어 읽기 (여기서는 1000개의 행씩 읽음)
chunksize = 1000  # 한 번에 읽을 행 수
filename = 'MSR_data_cleaned.csv'

# 첫 번째 chunk만 로드하고 출력
chunk_iter = pd.read_csv(filename, chunksize=chunksize, low_memory=False)

# 첫 번째 chunk를 읽어서 출력하고 엑셀 파일로 저장
for chunk in chunk_iter:
    print(chunk.tail(3))  # 첫 3개 행만 출력
    chunk.head(3).to_csv("output_chunk.csv", index=False)  # 엑셀 파일로 저장
    break  # 첫 번째 chunk만 처리하고 중지

파일을 실행해서 잘려진 csv 파일을 얻었습니다.

 

 

몇십만줄 중 3줄만 테스트로 진행하겠습니다.

파일명은 MSR_data_cleaned.csv로 수정합니다.

MSR_data_cleaned.csv
0.02MB

 

 

위 파일은 DeepDFA > DDFA > storage > external > 에 넣습니다.

다시 실행을 하면 아래와 같은 오류가 발생하는데..

 

 

sastvd > helpers > datasets.py 부분(241번)을 아래와 같이 수정합니다.

j

 

 

이제 다시 실행해보면, 다음과 같은 오류가 발생됩니다.

 

 

아래 부분의 코드를 찾아서 주석처리 해줍니다.

그리고 실행합니다.!

 

 

 

[ datasets.py ]

더보기
import functools
import os
import re

import numpy as np
import pandas as pd
import sastvd as svd
from glob import glob
from pathlib import Path
import json
import traceback
import sastvd.helpers.git as svdg
import sastvd.helpers.joern as svdj
import logging

logger = logging.getLogger(__name__)


def remove_comments(text):
    """Delete comments from code."""

    def replacer(match):
        s = match.group(0)
        if s.startswith("/"):
            return " "  # note: a space and not an empty string
        else:
            return s

    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE,
    )
    return re.sub(pattern, replacer, text)


def devign(cache=True, sample=False):
    """
    Read devign dataset from JSON
    """

    savefile = (
        svd.get_dir(svd.cache_dir() / "minimal_datasets")
        / f"minimal_devign{'_sample' if sample else ''}.pq"
    )
    if cache:
        try:
            df = pd.read_parquet(savefile, engine="fastparquet").dropna()

            return df
        except FileNotFoundError:
            logger.info(f"file {savefile} not found, loading from source")
        except Exception:
            logger.exception("devign exception, loading from source")

    filename = "function.json"
    df = pd.read_json(svd.external_dir() / filename,)
    df = df.rename_axis("id").reset_index()
    df["dataset"] = "devign"

    # Remove comments
    df["before"] = svd.dfmp(df, remove_comments, "func", cs=500)
    df["before"] = df["before"].apply(lambda c: c.replace("\n\n", "\n"))

    # Remove functions with abnormal ending (no } or ;)
    df = df[
        ~df.apply(
            lambda x: x.before.strip()[-1] != "}"
            and x.before.strip()[-1] != ";",
            axis=1,
        )
    ]
    # Remove functions with abnormal ending (ending with ");")
    df = df[~df.before.apply(lambda x: x[-2:] == ");")]
    df["vul"] = df["target"]

    # # Remove samples with mod_prop > 0.5
    # dfv["mod_prop"] = dfv.apply(
    #     lambda x: len(x.added + x.removed) / len(x["diff"].splitlines()), axis=1
    # )
    # dfv = dfv.sort_values("mod_prop", ascending=0)
    # dfv = dfv[dfv.mod_prop < 0.7]
    # # Remove functions that are too short
    # dfv = dfv[dfv.apply(lambda x: len(x.before.splitlines()) > 5, axis=1)]

    if sample:
        df = df.head(50)

    minimal_cols = [
        "id",
        "dataset",
        "before",
        "target",
        "vul",
    ]
    df[minimal_cols].to_parquet(
        savefile,
        object_encoding="json",
        index=0,
        compression="gzip",
        engine="fastparquet",
    )
    return df


def mutated(subdataset, cache=True, sample=False):
    """
    Read mutated dataset from JSON
    """

    df = bigvul(cache=cache, sample=sample)
    df = df.drop(columns=["dataset", "before"])
    fp = svd.external_dir() / "mutated" / f"c_{subdataset.replace('_flip', '')}.jsonl"
    # print("loading", fp)
    mutated = pd.read_json(fp, lines=True)
    if "flip" in subdataset:
        mutated = mutated.rename(columns={"source": "before"}).drop(columns=["target"])
    else:
        mutated = mutated.rename(columns={"target": "before"}).drop(columns=["source"])
    df = pd.merge(df, mutated, left_on="id", right_on="idx",
        # how="left", #
        how="inner", # include only examples with mutated code
    )
    df["dataset"] = f"mutated_{subdataset}"
    df = df.drop(columns=["after", "added", "removed", "diff"])

    return df


def ds(dsname, cache=True, sample=False):
    if dsname == "bigvul":
        return bigvul(cache=cache, sample=sample)
    elif dsname == "devign":
        return devign(cache=cache, sample=sample)
    elif "mutated" in dsname:
        subdataset = dsname.split("_", maxsplit=1)[1]
        return mutated(subdataset, cache=cache, sample=sample)


def bigvul(cache=True, sample=False):
    """
    Read BigVul dataset from CSV
    """

    savefile = (
        svd.get_dir(svd.cache_dir() / "minimal_datasets")
        / f"minimal_bigvul{'_sample' if sample else ''}.pq"
    )
    if cache:
        try:
            df = pd.read_parquet(savefile, engine="fastparquet").dropna()

            return df
        except FileNotFoundError:
            logger.info(f"file {savefile} not found, loading from source")
        except Exception:
            logger.exception("bigvul exception, loading from source")

    filename = "MSR_data_cleaned_SAMPLE.csv" if sample else "MSR_data_cleaned.csv"
   
    df = pd.read_csv(
        svd.external_dir() / filename,
        parse_dates=["Publish Date", "Update Date"],
        dtype={
            "commit_id": str,
            "del_lines": int,
            "file_name": str,
            "lang": str,
            "lines_after": str,
            "lines_before": str,
            "Unnamed: 0": int,
            "Access Gained": str,
            "Attack Origin": str,
            "Authentication Required": str,
            "Availability": str,
            "CVE ID": str,
            "CVE Page": str,
            "CWE ID": str,
            "Complexity": str,
            "Confidentiality": str,
            "Integrity": str,
            "Known Exploits": str,
            "Score": float,
            "Summary": str,
            "Vulnerability Classification": str,
            "add_lines": int,
            "codeLink": str,
            "commit_message": str,
            "files_changed": str,
            "func_after": str,
            "func_before": str,
            "parentID": str,
            "patch": str,
            "project": str,
            "project_after": str,
            "project_before": str,
            "vul": int,
            "vul_func_with_fix": str,
        },
    )
   

    df = df.rename(columns={"Unnamed: 0": "id"})
    df["dataset"] = "bigvul"

    # Remove comments
    df["func_before"] = svd.dfmp(df, remove_comments, "func_before", cs=500)
    df["func_after"] = svd.dfmp(df, remove_comments, "func_after", cs=500)

    # Save codediffs
    svd.dfmp(
        df,
        svdg._c2dhelper,
        columns=["func_before", "func_after", "id", "dataset"],
        ordr=False,
        cs=300,
    )

    # Assign info and save
    df["info"] = svd.dfmp(df, svdg.allfunc, cs=500)
    df = pd.concat([df, pd.json_normalize(df["info"])], axis=1)

    # POST PROCESSING
    dfv = df[df.vul == 1]
    # No added or removed but vulnerable
    dfv = dfv[~dfv.apply(lambda x: len(x.added) == 0 and len(x.removed) == 0, axis=1)]
    # Remove functions with abnormal ending (no } or ;)
    dfv = dfv[
        ~dfv.apply(
            lambda x: x.func_before.strip()[-1] != "}"
            and x.func_before.strip()[-1] != ";",
            axis=1,
        )
    ]
    dfv = dfv[
        ~dfv.apply(
            lambda x: x.func_after.strip()[-1] != "}" and x.after.strip()[-1:] != ";",
            axis=1,
        )
    ]
    #아래 부분 주석처리 dfv 부분 모두 없어지기때문
    # Remove functions with abnormal ending (ending with ");")
    #dfv = dfv[~dfv.before.apply(lambda x: x[-2:] == ");")]

    try:
      # Remove samples with mod_prop > 0.5
      dfv["mod_prop"] = dfv.apply(
          lambda x: len(x.added + x.removed) / len(x["diff"].splitlines()), axis=1
      )
      dfv = dfv.sort_values("mod_prop", ascending=0)
      dfv = dfv[dfv.mod_prop < 0.7]
    except:
      print("except")


    # Remove functions that are too short
    dfv = dfv[dfv.apply(lambda x: len(x.before.splitlines()) > 5, axis=1)]
    # Filter by post-processing filtering
    keep_vuln = set(dfv["id"].tolist())
    df = df[(df.vul == 0) | (df["id"].isin(keep_vuln))].copy()

    minimal_cols = [
        "id",
        "before",
        "after",
        "removed",
        "added",
        "diff",
        "vul",
        "dataset",
    ]
    df[minimal_cols].to_parquet(
        savefile,
        object_encoding="json",
        index=0,
        compression="gzip",
        engine="fastparquet",
    )
    df[
        [
            "id",
            "commit_id",
            "vul",
            "codeLink",
            "commit_id",
            "parentID",
            "CVE ID",
            "CVE Page",
            "CWE ID",
            "Publish Date",
            "Update Date",
            "file_name",
            "files_changed",
            "lang",
            "project",
            "project_after",
            "project_before",
            "add_lines",
            "del_lines",
        ]
    ].to_csv(svd.cache_dir() / "bigvul/bigvul_metadata.csv", index=0)
    return df


def check_validity(_id, dsname, assert_no_exception=True, assert_line_number=False, assert_reaching_def=False):
    """Check whether sample with id=_id can be loaded and has node/edges.

    Example:
    _id = 1320
    with open(str(svd.processed_dir() / f"bigvul/before/{_id}.c") + ".nodes.json", "r") as f:
        nodes = json.load(f)
    """

    try:
        svdj.get_node_edges(itempath(_id, dsname))
        # check nodes
        with open(str(itempath(_id, dsname)) + ".nodes.json", "r") as f:
            nodes = json.load(f)
        nodes_valid = False
        for n in nodes:
            if "lineNumber" in n.keys():
                nodes_valid = True
                break
        if not nodes_valid:
            logger.warn("valid (%s): no line number", itempath(_id, dsname))
            if assert_line_number:
                return False
        # check edges
        with open(str(itempath(_id, dsname)) + ".edges.json", "r") as f:
            edges = json.load(f)
        edge_set = set([i[2] for i in edges])
        if "REACHING_DEF" not in edge_set and "CDG" not in edge_set:
            logger.warn("valid (%s): no dataflow", itempath(_id, dsname))
            if assert_reaching_def:
                return False
    except Exception as E:
        logger.warn("valid (%s): exception\n%s", itempath(_id, dsname), traceback.format_exc())
        if assert_no_exception:
            return False
    return True


def itempath(_id, dsname="bigvul"):
    """Get itempath path from item id. TODO: somehow give itempath of before and after."""
    return svd.processed_dir() / f"{dsname}/before/{_id}.c"


def check_valid_dataflow(_id):
    try:
        d = get_dataflow_output(_id)
        return len(d) > 0
    except Exception:
        traceback.print_exc()
        return False

def bigvul_check_valid_dataflow(df):
    valid = svd.dfmp(df, check_valid_dataflow, "id")
    df = df[valid]
    return df


def ds_filter(
    df,
    dsname,
    check_file=False,
    check_valid=False,
    vulonly=False,
    load_code=False,
    sample=-1,
    sample_mode=False,
    seed=0,
):
    """Filter dataset based on various considerations for training"""

    # Small sample (for debugging):
    if sample > 0:
        df = df.sample(sample, random_state=seed)
    assert len(df) > 0

    # Filter only vulnerable
    if vulonly:
        df = df[df.vul == 1]
    assert len(df) > 0

    # Filter out samples with no parsed file
    if check_file:
        finished = [
            int(Path(i).name.split(".")[0])
            for i in glob(str(svd.processed_dir() / dsname / "before/*nodes*"))
            if not os.path.basename(i).startswith("~")
        ]
        df = df[df.id.isin(finished)]
        logger.debug("check_file %d", len(df))
    assert len(df) > 0

    # Filter out samples with no lineNumber from Joern output
    if check_valid:
        valid_cache = svd.cache_dir() / f"{dsname}_valid_{sample_mode}.csv"
        if valid_cache.exists():
            valid_cache_df = pd.read_csv(valid_cache, index_col=0)
        else:
            valid = svd.dfmp(
                df, functools.partial(check_validity, dsname=dsname), "id", desc="Validate Samples: ", workers=6
            )
            df_id = df.id
            valid_cache_df = pd.DataFrame({"id": df_id, "valid": valid}, index=df.index)
            valid_cache_df.to_csv(valid_cache)
        df = df[df.id.isin(valid_cache_df[valid_cache_df["valid"]].id)]
        logger.debug("check_valid %d", len(df))
    assert len(df) > 0

    # NOTE: drop several columns to save memory
    if not load_code:
        df = df.drop(columns=["before", "after", "removed", "added", "diff"], errors="ignore")
    return df


def bigvul_filter(
    df,
    check_file=False,
    check_valid=False,
    vulonly=False,
    load_code=False,
    sample=-1,
    sample_mode=False,
    seed=0,
):
    return ds_filter(
        df,
        check_file=check_file,
        check_valid=check_valid,
        vulonly=vulonly,
        load_code=load_code,
        sample=sample,
        sample_mode=sample_mode,
        seed=seed,
        dsname="bigvul",
    )


def get_splits_map(dsname):
    logger.debug("loading fixed splits")
    if dsname == "bigvul" or "mutated" in dsname:
        splits = get_linevul_splits()
    if dsname == "devign":
        splits = get_codexglue_splits()
    logger.debug("splits value counts:\n%s", splits.value_counts())
    return splits.to_dict()


def get_linevd_splits_map():
    logger.debug("loading linevd splits")
    splits = pd.read_csv(svd.external_dir() / "bigvul_rand_splits.csv")
    splits = splits.set_index("id")
    logger.debug("splits value counts:\n%s", splits.value_counts())
    return splits.to_dict()


def get_linevul_splits():
    logger.debug("loading linevul splits")
    splits_df = pd.read_csv(svd.external_dir() / "linevul_splits.csv", index_col=0)
    splits = splits_df["split"]
    splits = splits.replace("valid", "val")
    return splits


def get_codexglue_splits():
    splits_df = pd.read_csv(svd.external_dir() / "codexglue_splits.csv")
    splits_df = splits_df.set_index("example_index")
    splits_df["split"] = splits_df["split"].replace("valid", "val")
    splits = splits_df["split"]
    return splits


def get_named_splits_map(split):
    logger.debug("loading %s splits", split)
    splits_df = pd.read_csv(svd.external_dir() / "splits" / f"{split}.csv", index_col=0)
    splits_df = splits_df.set_index("example_index")
    splits = splits_df["split"]
    splits = splits.replace("valid", "val")
    splits = splits.replace("holdout", "test")
    logger.debug("splits value counts:\n%s", splits.value_counts())
    return splits.to_dict()

def ds_partition(
    df, partition, dsname, split="fixed", seed=0,
):
    """Filter to one partition of bigvul and rebalance function-wise"""
    logger.debug(f"ds_partition %d %s %s %d", len(df), dsname, partition, seed)

    if split == "random":
        logger.debug("generating random splits with seed %d", seed)
        splits_map = get_splits_map(dsname)
        df_fixed_splits = df.id.map(splits_map)
        logger.debug("valid splits value counts:\n%s", df_fixed_splits.value_counts())
        df = df[df_fixed_splits != "test"].copy()
        logger.debug("holdout %d test examples from fixed dataset split. dataset len: %d", np.sum(df_fixed_splits == "test"), len(df))

        def get_label(i):
            if i < int(len(df) * 0.1):
                return "val"
            elif i < int(len(df) * 0.2):
                return "test"
            else:
                return "train"

        df["label"] = pd.Series(
            list(map(get_label, range(len(df)))),
            index=np.random.RandomState(seed=seed).permutation(df.index),
        )
        # NOTE: I verified that this always gives the same output for all runs!
        # as long as the input df is the same (should be filtered first e.g. datamodule vs. abs_df)
    elif split == "fixed":
        splits_map = get_splits_map(dsname)
        df["label"] = df.id.map(splits_map)
    elif split == "linevul":
        assert dsname == "bigvul", dsname
        splits_map = get_linevul_splits_map()
        df["label"] = df.id.map(splits_map)
    else:
        assert dsname == "bigvul", dsname
        splits_map = get_named_splits_map(split)
        df["label"] = df.id.map(splits_map)
    logger.debug("dataset value counts\n%s\ndatasethead\n%s", df.value_counts("label"), df.groupby("label").head(5))

    if partition != "all":
        df = df[df.label == partition]
        logger.info(f"partitioned {len(df)}")

    return df

def bigvul_partition(df, partition, split="fixed", seed=0,):
    return ds_partition(df, partition, "bigvul", split, seed)

def test_random():
    df = bigvul()
    df = bigvul_partition(df, seed=42, partition="all", split="random")
    print("TEST 1")
    print(df.value_counts("label"))
    for label, group in df.groupby("label"):
        print(label)
        print(group)

    sdf = bigvul()
    sdf = bigvul_partition(df, seed=42, partition="all", split="random")
    print("TEST 2")
    assert sdf["label"].to_list() == df["label"].to_list()

    odf = bigvul()
    odf = bigvul_partition(odf, seed=53, partition="all", split="random")
    print("TEST 3")
    print(odf.value_counts("label"))
    for label, group in odf.groupby("label"):
        print(label)
        print(group)
        assert len(group) == len(df[df["label"] == label])
        assert group["id"].to_list() != (df[df["label"] == label]["id"]).to_list()
    assert odf["label"].to_list() != df["label"].to_list()


single = {
    "api": False,
    "datatype": True,
    "literal": False,
    "operator": False,
}
all_subkeys = ["api", "datatype", "literal", "operator"]


def parse_limits(feat):
    if "limitsubkeys" in feat:
        start_idx = feat.find("limitsubkeys")+len("limitsubkeys")+1
        end_idx = feat.find("_",start_idx)
        if end_idx == -1:
            end_idx = len(feat)
        limit_subkeys = feat[start_idx:end_idx]
        if limit_subkeys == "None":
            limit_subkeys = None
        else:
            limit_subkeys = int(limit_subkeys)
    else:
        limit_subkeys = 1000
    if "limitall" in feat:
        start_idx = feat.find("limitall")+len("limitall")+1
        end_idx = feat.find("_",start_idx)
        if end_idx == -1:
            end_idx = len(feat)
        limit_all = feat[start_idx:end_idx]
        if limit_all == "None":
            limit_all = None
        else:
            limit_all = int(limit_all)
    else:
        limit_all = 1000
    return limit_subkeys, limit_all

def abs_dataflow(feat, dsname="bigvul", sample=False, split="fixed", seed=0):
    """Load abstract dataflow information"""

    limit_subkeys, limit_all = parse_limits(feat)

    df = ds(dsname, sample=sample)
    df = ds_filter(
        df,
        dsname,
        check_file=True,
        check_valid=True,
        vulonly=False,
        load_code=False,
        sample_mode=sample,
        seed=seed,
    )
    source_df = ds_partition(df, "train", dsname, split=split, seed=seed)

    abs_df_file = (
        svd.processed_dir()
        / dsname / f"abstract_dataflow_hash_api_datatype_literal_operator{'_sample' if sample else ''}.csv"
    )
    if abs_df_file.exists():
        abs_df = pd.read_csv(abs_df_file)
        abs_df_hashes = {}
        abs_df["hash"] = abs_df["hash"].apply(json.loads)
        logger.debug(abs_df)
        # compute concatenated embedding
        for subkey in all_subkeys:
            if subkey in feat:
                logger.debug(f"getting hashes {subkey}")
                hash_name = f"hash.{subkey}"
                abs_df[hash_name] = abs_df["hash"].apply(lambda d: d[subkey])
                if single[subkey]:
                    abs_df[hash_name] = abs_df[hash_name].apply(lambda d: d[0])
                    my_abs_df = abs_df
                else:
                    abs_df[hash_name] = abs_df[hash_name].apply(
                        lambda d: sorted(set(d))
                    )
                    my_abs_df = abs_df.explode(hash_name)
                my_abs_df = my_abs_df[["graph_id", "node_id", "hash", hash_name]]

                hashes = pd.merge(source_df, my_abs_df, left_on="id", right_on="graph_id")[hash_name].dropna()
                # most frequent
                logger.debug(f"min {hashes.value_counts().head(limit_subkeys).min()} {hashes.value_counts().head(limit_subkeys).idxmin()}")
                hashes = (
                    hashes.value_counts()
                    .head(limit_subkeys)
                    .index#.sort_values()
                    .unique()
                    .tolist()
                )
                hashes.insert(0, None)
                # with open("hashes5000", "w") as f:
                #     f.write("\n".join(map(str, hashes)))

                abs_df_hashes[subkey] = {h: i for i, h in enumerate(hashes)}

                logger.debug(f"trained hashes {subkey} {len(abs_df_hashes[subkey])}")

        if "all" in feat:
            def get_all_hash(row):
                h = {}
                for subkey in all_subkeys:
                    if subkey in feat:
                        hash_name = f"hash.{subkey}"
                        hashes = abs_df_hashes[subkey]
                        hash_values = row[hash_name]
                        if "includeunknown" in feat:
                            if single[subkey]:
                                hash_idx = [hash_values]
                            else:
                                hash_idx = hash_values
                        else:
                            if single[subkey]:
                                hash_idx = [
                                    hash_values if hash_values in hashes else "UNKNOWN"
                                ]
                            else:
                                hash_idx = [
                                    hh if hh in hashes else "UNKNOWN"
                                    for hh in hash_values
                                ]
                        h[subkey] = list(sorted(set(hash_idx)))
                return h

            source_df_hashes = pd.merge(source_df, abs_df, left_on="id", right_on="graph_id")
            abs_df["hash.all"] = source_df_hashes.apply(get_all_hash, axis=1).apply(json.dumps)
            hashes = abs_df["hash.all"]
            all_hashes = (
                abs_df["hash.all"]
                .value_counts()
                .head(limit_all)
                .index#.sort_values()
                .unique()
                .tolist()
            )
            all_hashes.insert(0, None)
            # with open("all_hashes5000", "w") as f:
            #     f.write("\n".join(map(str, all_hashes)))
            abs_df_hashes["all"] = {h: i for i, h in enumerate(all_hashes)}

        return abs_df, abs_df_hashes
    else:
        logger.warning("YOU SHOULD RUN `python sastvd/scripts/abstract_dataflow_full.py --stage 2`")


def test_abs():
    abs_df, abs_df_hashes = abs_dataflow(
        feat="_ABS_DATAFLOW_api_datatype_literal_operator", sample=False,
    )
    assert all(not all(abs_df[f"hash.{subkey}"].isna()) for subkey in all_subkeys)
    assert len([c for c in abs_df.columns if "hash." in c]) == len(all_subkeys)
    assert len(abs_df_hashes) == len(all_subkeys)


def test_abs_all():
    for featname in (
        "datatype",
        "literal_operator",
        "api_literal_operator",
        "api_datatype_literal_operator_all",
    ):
        print(featname)
        abs_df, abs_df_hashes = abs_dataflow(
            feat=f"_ABS_DATAFLOW_{featname}_all", sample=False
        )
        vc = abs_df.value_counts("hash.all")
        print(vc)
        print(len(vc.loc[vc > 1].index), "more than 1")
        print(len(vc.loc[vc > 5].index), "more than 5")
        print(len(vc.loc[vc > 100].index), "more than 100")
        print(len(vc.loc[vc > 1000].index), "more than 1000")
        print("min", vc.head(1000).min(), vc.head(1000).idxmin())


def test_abs_all_unk():
    for featname in (
        "datatype",
        "literal_operator",
        "api_literal_operator",
        "api_datatype_literal_operator_all",
    ):
        print(featname)
        abs_df, abs_df_hashes = abs_dataflow(
            feat=f"_ABS_DATAFLOW_{featname}_all_includeunknown", sample=False
        )
        vc = abs_df.value_counts("hash.all")
        print(vc)
        print(len(vc.loc[vc > 1].index), "more than 1")
        print(len(vc.loc[vc > 5].index), "more than 5")
        print(len(vc.loc[vc > 100].index), "more than 100")
        print(len(vc.loc[vc > 1000].index), "more than 1000")
        print("min", vc.head(1000).min(), vc.head(1000).idxmin())


def dataflow_1g(sample=False):
    """Load 1st generation dataflow information"""

    cache_file = svd.processed_dir() / f"bigvul/1g_dataflow_hash_all_{sample}.csv"
    if cache_file.exists():
        df = pd.read_csv(
            cache_file,
            converters={
                "graph_id": int,
                "node_id": int,
                "func": str,
                "gen": str,
                "kill": str,
            },
        )
        df["gen"] = df["gen"].apply(json.loads)
        df["kill"] = df["kill"].apply(json.loads)
        return df
    else:
        logger.warning("YOU SHOULD RUN dataflow_1g.py")


def test_1g():
    print(dataflow_1g(sample=True))


def test_generate_random():
    df = bigvul()
    df = bigvul_filter(
        df, check_file=True, check_valid=True, vulonly=False, load_code=False,
    )
    for split in ["random", "fixed"]:
        df = bigvul_partition(df, partition="all", split=split)
        print(split)
        print(df.value_counts("label", normalize=True))

def get_dataflow_output(_id):
    idpath = itempath(_id)
    dataflow_file = idpath.parent / (idpath.name + ".dataflow.json")
    with open(dataflow_file) as f:
        dataflow_data = json.load(f)
    updated_in = {}
    updated_out = {}
    for _, data in dataflow_data.items():
        data_out = data["solution.out"]
        assert len(set(updated_out.keys()) & set(data_out.keys())) == 0, "should be no overlap"
        updated_out.update(data_out)
        data_in = data["solution.in"]
        assert len(set(updated_in.keys()) & set(data_in.keys())) == 0, "should be no overlap"
        updated_in.update(data_in)
    updated_in = {int(k): v for k, v in updated_in.items()}
    updated_out = {int(k): v for k, v in updated_out.items()}
    return updated_in, updated_out

def test_debug():
    df = ds("mutated_var_rename")
    print(df)

 

이제 2단계로 넘어갑니다.

joern 을 이용해서 생성하는 부분인데, script python이 잘 진행되지 않아 , 직접 실행해줘야 합니다.

 

# bash scripts/run_getgraphs.sh 부분이니, 해당 파일을 확인해봅니다.

 

python -u sastvd/scripts/getgraphs.py bigvul --sess $jan --num_jobs 100 --overwrite

해당 부분이 잘 실행되지 않습니다. <$jan>

 

그렇기에... https://agencies.tistory.com/245

 

Joern (Deepdfa : nodes edges cpg) 생성 (완료)

※ 드디어 찾아냈습니다.사실 DeepDFA 파일 내부에 노드 엣지 cpg를 생성해주는 joern script가 있었습니다.환경은 colab에서 진행했습니다.  1. joern 1.1.1072 버전 설치2. deepdfa git으로 설치!git clone https://

agencies.tistory.com

이전에 joern 생성했던 방식을 이용할 것입니다.

 

 

우선 마찬가지로 path 설정을 진행합니다.

import sys
sys.path.append("/content/DeepDFA/DDFA")

 

joern 실행 로그 경로도 수정해줍니다.

그리고 export 해서 joern을 실행할 수 있는 경로도 바꿔줍니다.

os.environ['PATH'] += os.pathsep + "/content/joern-cli/joern-cli"

 

 

[getgraphs.py]

더보기
import sys
sys.path.append("/content/DeepDFA/DDFA")

import functools
from multiprocessing import Pool
import os
import traceback

os.environ['PATH'] += os.pathsep + "/content/joern-cli/joern-cli"

import numpy as np
import tqdm
import sastvd as svd
import sastvd.helpers.datasets as svdd
import sastvd.helpers.joern as svdj
import sastvd.helpers.joern_session as svdjs


def write_file(row):
    # Write C Files
    savedir_before = svd.get_dir(svd.processed_dir() / row["dataset"] / "before")
    fpath1 = savedir_before / f"{row['id']}.c"
    with open(fpath1, "w") as f:
        f.write(row["before"])

    if row["dataset"] == "bigvul":
        savedir_after = svd.get_dir(svd.processed_dir() / row["dataset"] / "after")
        fpath2 = savedir_after / f"{row['id']}.c"
        if len(row["diff"]) > 0:
            with open(fpath2, "w") as f:
                f.write(row["after"])
    else:
        fpath2 = None

    return fpath1, fpath2


def preprocess(row, fn):
    """Parallelise svdj functions.

    Example:
    df = svdd.bigvul()
    row = df.iloc[180189]  # PAPER EXAMPLE
    row = df.iloc[177860]  # EDGE CASE 1
    preprocess(row)
    """
    try:
        # if row["dataset"] == "bigvul":
        fpath1, fpath2 = write_file(row)

        # Run Joern on "before" code
        if args.overwrite or not os.path.exists(f"{fpath1}.edges.json"):
            fn(filepath=fpath1, verbose=args.verbose)
        elif args.verbose > 0:
            print("skipping", fpath1)

        # Run Joern on "after" code
        if args.overwrite or (row["dataset"] == "bigvul" and len(row["diff"]) > 0 and not os.path.exists(f"{fpath2}.edges.json")):
            fn(filepath=fpath2, verbose=args.verbose)
        elif args.verbose > 0:
            print("skipping", fpath2)
    except Exception:
        with open("failed_joern.txt", "a") as f:
            print(f"ERROR {row['id']}: {traceback.format_exc()}\ndata={row}", file=f)


def test_preprocess():
    """
    test that preprocessing progresses alright
    """
    row = {}
    result = preprocess(row)
    print(f"{result}")


def preprocess_whole_df_split(t):
    """
    preprocess one split of the dataframe
    """
    i, split = t
    with open(f"/content/DeepDFA/DDFA/hpc/logs/getgraphs_output_{i}.joernlog", "wb") as lf:
        sess = svdjs.JoernSession(f"getgraphs/{i}", logfile=lf, clean=True)
        sess.import_script("get_func_graph")
        try:
            fn = functools.partial(
                svdj.run_joern_sess,
                sess=sess,
                verbose=args.verbose,
                export_json=True,
                export_cpg=True,
                export_dataflow=True,
            )
            items = split.to_dict("records")
            position = 0 if not isinstance(i, int) else int(i)
            for row in tqdm.tqdm(items, desc=f"(worker {i})", position=position):
                preprocess(row, fn)
        finally:
            sess.close()


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("dataset",
        # choices=["bigvul", "devign", "sard"]
    )
    parser.add_argument("--job_array_number", type=int)
    parser.add_argument("--num_jobs", default=100, type=int)
    parser.add_argument("--partition")
    parser.add_argument("--workers", default=1, type=int)
    parser.add_argument("--run_sast", action="store_true")
    parser.add_argument("--sample", action="store_true")
    parser.add_argument("--sess", action="store_true")
    parser.add_argument("--verbose", type=int, default=0)
    parser.add_argument("--overwrite", action="store_true")
    parser.add_argument("--file_only", action="store_true")
    args = parser.parse_args()

    df = svdd.ds(args.dataset, sample=args.sample)
    if args.partition is not None:
        df = svdd.ds_partition(df, args.partition, args.dataset)

    if args.file_only:

        def write_file_pair(row):
            i, row = t
            write_file(row)

        with Pool(args.workers) as pool:
            for _ in tqdm.tqdm(
                pool.imap_unordered(write_file, df.iterrows()), total=len(df)
            ):
                pass

    # Read Data
    if args.sample:
        args.verbose = 4

    if args.job_array_number is None:
        if args.workers == 1:
            preprocess_whole_df_split(("all", df))
        else:
            splits = np.array_split(df, args.workers)
            svd.dfmp(enumerate(splits), preprocess_whole_df_split, ordr=False, workers=args.workers, cs=1)

    elif args.sess:
        splits = np.array_split(df, args.num_jobs)
        my_split = splits[args.job_array_number]
        print("processing", my_split)
        preprocess_whole_df_split((args.job_array_number, my_split))
    else:
        splits = np.array_split(df, args.num_jobs)
        split_number = args.job_array_number
        df = splits[split_number]
        svd.dfmp(
            df,
            functools.partial(preprocess, fn=svdj.run_joern),
            ordr=False,
            workers=args.workers,
        )

 

getgraphs.py를 실행합니다.

 

실행을 하면...

 

이 부분이 생성됩니다!

노드랑 엣지 dataflow 를 생성하기 위한 joern 을 실행합니다.

/content/joern-cli/joern-cli/joern --script /content/DeepDFA/DDFA/storage/external/get_func_graph.sc --params filename=/content/DeepDFA/DDFA/storage/processed/bigvul/before/0.c

 

위 처럼 진행하면 되는데, 

테스트이기 때문에 추후 스크립트로 작업을 수행하도록 진행할 것입니다.

 

 

이렇게.. 1.c 2.c 도 동일하게 명령을 실행해줍니다.

 

 

이런 파일들이 생성됩니다.

dafaflow.json의 내용은 아래와 같습니다.

 

 

 

 

※ 다음편에 계속~~~