diff options
author | scratko <m@scratko.xyz> | 2025-08-03 02:28:24 +0300 |
---|---|---|
committer | scratko <m@scratko.xyz> | 2025-08-12 03:37:52 +0300 |
commit | 07b3368ea184b2bd37a4cee2ab869c4fd3673f45 (patch) | |
tree | 8be8642cb29d94d6217c3807172e3f3347d4b991 /README.md | |
download | artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.tar.gz artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.tar.bz2 artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.zip |
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 215 |
1 files changed, 215 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..1dea4ea --- /dev/null +++ b/README.md @@ -0,0 +1,215 @@ +# Artificial Text Detection + +This project is based on my Master's thesis: +> **"Алгоритм выявления искусственно созданных текстов"** +> Nizhny Novgorod State Technical University, 2021 +> Author: Andrey Kuznetsov + +It is a C++/Qt-based application designed to detect artificially generated +scientific texts (e.g., SCIgen outputs). The detection method is based on +analyzing the internal stylistic consistency of the document using unsupervised +clustering and rank correlation metrics. + +<img src="screen_1.png" /> + +<img src="screen_2.png" /> + +<img src="screen_3.png" /> + + +## Project Overview + +The program processes input text, splits it into fragments, builds a vector +space using N-gram features, computes a pairwise distance matrix using +Spearman rank correlation, and applies clustering to detect stylistic +discontinuities. Such discontinuities are often present in machine-generated +texts. + +## Technologies Used + +- **C++17 (STL)** +- **Qt 5 Widgets & Charts** +- **Boost**: + - `boost::python` and `boost::python::numpy` — Used to exchange Numpy arrays + between C++ and Python during clustering. +- **Python 3** (invoked from C++): + - `numpy` + - `scikit-learn` + - `scikit-learn-extra` + +## Features + +### Flexible Text Input +- Manual text input via the built-in editor +- File selection via file dialog +- **Drag and drop** support for `.txt` files + +### Text Preprocessing +- Removes stop words, non-letter symbols, and repeated spaces +- Converts text to lowercase +- Implemented in a dedicated `prepare()` function for consistent cleaning + +### N-Gram Extraction and Dictionary Building +- Extracts N-grams with: + - Minimum N = 2 + - Maximum N set via UI parameters +- Builds a global dictionary from all fragments +- Performs feature selection by filtering top N-grams based on 90% frequency threshold + +### Vector Space Model Construction +- Uses the selected dictionary to vectorize document fragments +- Produces a nested structure: `vector<vector<vector<int>>>` + where: + - Outer vector = documents + - Middle vector = fragments + - Inner vector = N-gram frequencies + +### Distance Calculation Using Rank Correlation +- Calculates pairwise distances between fragments using **Spearman’s rank correlation coefficient** +- Computes **average rank dependence (ZVT)** with a sliding window approach +- Handles inter-document and intra-document fragment comparisons + +### Matrix Normalization +- Normalizes the pairwise distance matrix shape by padding shorter rows with zeros +- Ensures consistent dimensions for clustering algorithms + +### Unsupervised Clustering with Python +- Clustering algorithms supported: + - **K-Medoids** + - **K-Means** + - **Agglomerative Clustering** +- Implemented in Python using: + - `scikit-learn` + - `scikit-learn-extra` +- Distance matrix passed from C++ to Python via: + - `Boost.Python` + - `Boost.NumPy` + +### C++ Python Interoperability +To perform clustering with Python libraries, the project uses `Boost.Python` and `Boost.NumPy`: + +- Converts a `std::vector<std::vector<double>>` (distance matrix) to a NumPy array. +- Initializes the embedded Python interpreter. +- Imports the custom Python module `mymodule.py`. +- Calls one of the clustering functions: `kmedoids`, `kmeans`, or `agglClus`. +- Extracts prediction results and returns them back to C++. + +This approach allows combining the performance and UI capabilities of C++/Qt with the ML power of Python. + +### Visualization +- Displays predicted vs. real document labels in two separate charts. +- Cluster assignments are color-coded. +- Charts are rendered interactively via Qt. + +### Multithreaded Execution with Progress Bar +- Runs the detection algorithm in a separate thread using `QThread` +- Keeps the GUI responsive during processing +- Displays progress via `QProgressDialog` + + +## Algorithm Pipeline (Detailed) + +1. **Text Input** + - User provides text via input box, file dialog, or drag-and-drop. + - Text is stored in `target_doc`. + +2. **Preprocessing** (`prepare()`) + - Removes stop words (prepositions, conjunctions, etc.) + - Removes all characters except letters and spaces + - Removes repeated characters (e.g., spaces) + - Converts text to lowercase + +3. **Fragmentation** + - Text is split into equal-length chunks based on UI parameters + +4. **N-Gram Extraction** + - Combines all documents into a single corpus + - Calculates N-grams for N from 2 up to max N (from UI) + - Uses a sliding window algorithm + - Aggregates results into a dictionary + +5. **N-Gram Filtering** + - Selects N-grams above 90th percentile of frequency + - Saves filtered list to a text file + +6. **Vector Space Modeling** + - For each document, for each fragment: + - Computes vector of N-gram frequencies via `freq_in_chunk()` + - Produces nested structure of frequency vectors + +7. **Rank Correlation (Spearman's rho)** + - Calculates rank correlation distance between fragment vectors: + ρ = 1 - (6 * ∑ d_i^2) / (n(n^2 - 1)) + + - Implemented via `zv_calc()` and `correlation()` + +8. **Average Rank Dependence** + - Computes `ZVT` values for each fragment (based on 10 previous) + - Combines `ZVT` into full pairwise distance matrix + +9. **Matrix Padding** + - Matrix may have uneven rows; padded with zeros to square shape + +10. **C++ to Python Transfer** + - Converts distance matrix to NumPy arrays using `Boost.NumPy` + - Initializes embedded Python interpreter + - Calls `mymodule.py::{kmedoids, kmeans, agglClus}` + - Extracts prediction results + +11. **Post-Processing Results** + - Splits results into clusters for artificial vs. input document + - Compares distributions: + - If both match → **Artificial text** + - If different → **Human-written text** + +12. **Visualization & UI** + - Result shown in message box + - Prediction charts drawn with `QChartView` + +## Build Instructions + +### Prerequisites +Before building, make sure the following are installed: + +- **Qt 5.x** (QtWidgets, QtCharts, QtCore, QtGui) +- **Boost** (with `Boost.Python` and `Boost.NumPy` modules built) +- **Python 3.x** (tested with Python 3.8) +- Python packages: +```bash +pip install numpy scikit-learn scikit-learn-extra +``` +### Boost Notes + +- You must build Boost with Python support (`b2 --with-python`). +- Ensure that Boost is compiled with the same Python version you plan to use. +- On Windows, you may need to specify Boost and Python library paths in the `.pro` file, for example: + +```qmake +LIBS += \ + -L"C:\Program Files\boost\boost_1_76_0\stage\x86\lib" \ + -LC:/Users/<USERNAME>/AppData/Local/Programs/Python/Python38-32/libs +``` + +### Building + +- Clone the repository: + +``` +git clone https://git.scratko.xyz/artifical-text-detection + +cd artifical-text-detection +``` +- Open the `.pro` file in Qt Creator. +- Adjust library paths for Boost and Python if necessary. +- Build and run the project. + +### Running + +- Ensure that the Python interpreter can import mymodule.py. +- Make sure required Python packages are installed in the environment used by the application. + +## Notes + +- Designed to detect machine-generated documents, especially SCIgen-based fakes. +- Easily extendable to detect LLM-generated texts with modified features. +- No pretrained models required — fully unsupervised method. |