Initial commit: added source code, resources and READMEHEAD master

author: scratko <m@scratko.xyz> 2025-08-03 02:28:24 +0300
committer: scratko <m@scratko.xyz> 2025-08-12 03:37:52 +0300
commit: 07b3368ea184b2bd37a4cee2ab869c4fd3673f45 (patch)
tree: 8be8642cb29d94d6217c3807172e3f3347d4b991 /README.md
download: artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.tar.gz
artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.tar.bz2
artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.zip
1 files changed, 215 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1dea4ea
--- /dev/null
+++ b/README.md
@@ -0,0 +1,215 @@
+# Artificial Text Detection
+
+This project is based on my Master's thesis:  
+> **"Алгоритм выявления искусственно созданных текстов"**  
+> Nizhny Novgorod State Technical University, 2021  
+> Author: Andrey Kuznetsov
+
+It is a C++/Qt-based application designed to detect artificially generated
+scientific texts (e.g., SCIgen outputs). The detection method is based on
+analyzing the internal stylistic consistency of the document using unsupervised
+clustering and rank correlation metrics.
+
+<img src="screen_1.png" />
+
+<img src="screen_2.png" />
+
+<img src="screen_3.png" />
+
+
+## Project Overview
+
+The program processes input text, splits it into fragments, builds a vector
+space using N-gram features, computes a pairwise distance matrix using
+Spearman rank correlation, and applies clustering to detect stylistic
+discontinuities. Such discontinuities are often present in machine-generated
+texts.
+
+## Technologies Used
+
+- **C++17 (STL)**
+- **Qt 5 Widgets & Charts**
+- **Boost**:
+    - `boost::python` and `boost::python::numpy` — Used to exchange Numpy arrays
+    between C++ and Python during clustering.
+- **Python 3** (invoked from C++):
+    - `numpy`
+    - `scikit-learn`
+    - `scikit-learn-extra`
+
+## Features
+
+### Flexible Text Input
+- Manual text input via the built-in editor
+- File selection via file dialog
+- **Drag and drop** support for `.txt` files
+
+### Text Preprocessing
+- Removes stop words, non-letter symbols, and repeated spaces
+- Converts text to lowercase
+- Implemented in a dedicated `prepare()` function for consistent cleaning
+
+### N-Gram Extraction and Dictionary Building
+- Extracts N-grams with:
+  - Minimum N = 2
+  - Maximum N set via UI parameters
+- Builds a global dictionary from all fragments
+- Performs feature selection by filtering top N-grams based on 90% frequency threshold
+
+### Vector Space Model Construction
+- Uses the selected dictionary to vectorize document fragments
+- Produces a nested structure: `vector<vector<vector<int>>>`
+    where:
+    - Outer vector = documents
+    - Middle vector = fragments
+    - Inner vector = N-gram frequencies
+
+### Distance Calculation Using Rank Correlation
+- Calculates pairwise distances between fragments using **Spearman’s rank correlation coefficient**
+- Computes **average rank dependence (ZVT)** with a sliding window approach
+- Handles inter-document and intra-document fragment comparisons
+
+### Matrix Normalization
+- Normalizes the pairwise distance matrix shape by padding shorter rows with zeros
+- Ensures consistent dimensions for clustering algorithms
+
+### Unsupervised Clustering with Python
+- Clustering algorithms supported:
+    - **K-Medoids**
+    - **K-Means**
+    - **Agglomerative Clustering**
+- Implemented in Python using:
+    - `scikit-learn`
+    - `scikit-learn-extra`
+- Distance matrix passed from C++ to Python via:
+    - `Boost.Python`
+    - `Boost.NumPy`
+
+### C++ Python Interoperability
+To perform clustering with Python libraries, the project uses `Boost.Python` and `Boost.NumPy`:
+
+- Converts a `std::vector<std::vector<double>>` (distance matrix) to a NumPy array.
+- Initializes the embedded Python interpreter.
+- Imports the custom Python module `mymodule.py`.
+- Calls one of the clustering functions: `kmedoids`, `kmeans`, or `agglClus`.
+- Extracts prediction results and returns them back to C++.
+
+This approach allows combining the performance and UI capabilities of C++/Qt with the ML power of Python.
+
+### Visualization
+- Displays predicted vs. real document labels in two separate charts.
+- Cluster assignments are color-coded.
+- Charts are rendered interactively via Qt.
+
+### Multithreaded Execution with Progress Bar
+- Runs the detection algorithm in a separate thread using `QThread`
+- Keeps the GUI responsive during processing
+- Displays progress via `QProgressDialog`
+
+
+##  Algorithm Pipeline (Detailed)
+
+1. **Text Input**
+    - User provides text via input box, file dialog, or drag-and-drop.
+    - Text is stored in `target_doc`.
+
+2. **Preprocessing** (`prepare()`)
+    - Removes stop words (prepositions, conjunctions, etc.)
+    - Removes all characters except letters and spaces
+    - Removes repeated characters (e.g., spaces)
+    - Converts text to lowercase
+
+3. **Fragmentation**
+    - Text is split into equal-length chunks based on UI parameters
+
+4. **N-Gram Extraction**
+    - Combines all documents into a single corpus
+    - Calculates N-grams for N from 2 up to max N (from UI)
+    - Uses a sliding window algorithm
+    - Aggregates results into a dictionary
+
+5. **N-Gram Filtering**
+    - Selects N-grams above 90th percentile of frequency
+    - Saves filtered list to a text file
+
+6. **Vector Space Modeling**
+    - For each document, for each fragment:
+    - Computes vector of N-gram frequencies via `freq_in_chunk()`
+    - Produces nested structure of frequency vectors
+
+7. **Rank Correlation (Spearman's rho)**
+    - Calculates rank correlation distance between fragment vectors:  
+        ρ = 1 - (6 * ∑ d_i^2) / (n(n^2 - 1))
+
+    - Implemented via `zv_calc()` and `correlation()`
+
+8. **Average Rank Dependence**
+    - Computes `ZVT` values for each fragment (based on 10 previous)
+    - Combines `ZVT` into full pairwise distance matrix
+
+9. **Matrix Padding**
+    - Matrix may have uneven rows; padded with zeros to square shape
+
+10. **C++ to Python Transfer**
+    - Converts distance matrix to NumPy arrays using `Boost.NumPy`
+    - Initializes embedded Python interpreter
+    - Calls `mymodule.py::{kmedoids, kmeans, agglClus}`
+    - Extracts prediction results
+
+11. **Post-Processing Results**
+    - Splits results into clusters for artificial vs. input document
+    - Compares distributions:
+        - If both match → **Artificial text**
+        - If different → **Human-written text**
+
+12. **Visualization & UI**
+    - Result shown in message box
+    - Prediction charts drawn with `QChartView`
+
+## Build Instructions
+
+### Prerequisites
+Before building, make sure the following are installed:
+
+- **Qt 5.x** (QtWidgets, QtCharts, QtCore, QtGui)
+- **Boost** (with `Boost.Python` and `Boost.NumPy` modules built)
+- **Python 3.x** (tested with Python 3.8)
+- Python packages:
+```bash
+pip install numpy scikit-learn scikit-learn-extra
+```
+### Boost Notes
+
+- You must build Boost with Python support (`b2 --with-python`).
+- Ensure that Boost is compiled with the same Python version you plan to use.
+- On Windows, you may need to specify Boost and Python library paths in the `.pro` file, for example:
+
+```qmake
+LIBS += \
+    -L"C:\Program Files\boost\boost_1_76_0\stage\x86\lib" \
+    -LC:/Users/<USERNAME>/AppData/Local/Programs/Python/Python38-32/libs
+```
+
+### Building
+
+- Clone the repository:
+    
+```
+git clone https://git.scratko.xyz/artifical-text-detection
+
+cd artifical-text-detection
+```
+- Open the `.pro` file in Qt Creator.
+- Adjust library paths for Boost and Python if necessary.
+- Build and run the project.
+
+### Running
+
+- Ensure that the Python interpreter can import mymodule.py.
+- Make sure required Python packages are installed in the environment used by the application.
+
+##  Notes
+
+- Designed to detect machine-generated documents, especially SCIgen-based fakes.
+- Easily extendable to detect LLM-generated texts with modified features.
+- No pretrained models required — fully unsupervised method.
author	scratko <m@scratko.xyz>	2025-08-03 02:28:24 +0300
committer	scratko <m@scratko.xyz>	2025-08-12 03:37:52 +0300
commit	07b3368ea184b2bd37a4cee2ab869c4fd3673f45 (patch)
tree	8be8642cb29d94d6217c3807172e3f3347d4b991 /README.md
download	artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.tar.gz artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.tar.bz2 artifical-text-detection-07b3368ea184b2bd37a4cee2ab869c4fd3673f45.zip