From ef8a3f6c3e20178ee520f1e6bedbc866e3c9b490 Mon Sep 17 00:00:00 2001
From: scratko <m@scratko.xyz>
Date: Sun, 3 Aug 2025 02:28:24 +0300
Subject: Initial commit: added source code, resources and README

---
 README.md | 141 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 141 insertions(+)
 create mode 100644 README.md

(limited to 'README.md')
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..6fc782a
--- /dev/null
+++ b/README.md
@@ -0,0 +1,141 @@
+# Artificial Text Detection
+
+This project is based on my Master's thesis:  
+> **"Алгоритм выявления искусственно созданных текстов"**  
+> Nizhny Novgorod State Technical University, 2021  
+> Author: Andrey Kuznetsov
+
+It is a C++/Qt-based application designed to detect artificially generated
+scientific texts (e.g., SCIgen outputs). The detection method is based on
+analyzing the internal stylistic consistency of the document using unsupervised
+clustering and rank correlation metrics.
+
+<img src="screen_1.png" />
+
+<img src="screen_2.png" />
+
+<img src="screen_3.png" />
+
+
+## Project Overview
+
+The program processes input text, splits it into fragments, builds a vector
+space using N-gram features, computes a pairwise distance matrix using
+Spearman rank correlation, and applies clustering to detect stylistic
+discontinuities. Such discontinuities are often present in machine-generated
+texts.
+
+## Technologies Used
+
+- **C++17**
+- **Qt 5 (Widgets, Charts, QThread)**
+- **Boost Libraries**:  
+  - `boost::python` and `boost::python::numpy` for C++/Python bridge
+- **Python 3**  
+  - `numpy`, `scikit-learn`, `scikit-learn-extra`
+
+## Features
+
+- **Flexible Text Input**
+    - Manual text input via editor
+    - File selection via file dialog
+    - **Drag and drop** support for `.txt` files
+
+- **Text Preprocessing**
+    - Removes stop words, non-letter symbols, repeated spaces
+    - Converts to lowercase
+
+- **N-Gram Extraction and Dictionary Building**
+    - Extracts N-grams (min N=2; max N set via UI)
+    - Builds global dictionary from fragments
+    - Performs percentile-based feature selection
+
+- **Vector Space Model Construction**
+    - Vectorizes fragments based on N-gram frequencies
+    - Computes document-fragment-feature structure: `vector<vector<vector<int>>>`
+
+- **Distance Calculation Using Rank Correlation**
+    - Calculates pairwise distance matrix between fragments using **Spearman correlation**
+    - Computes average rank dependence (`ZVT`) using windowed comparisons
+
+- **Matrix Normalization**
+    - Normalizes matrix shape by padding with zeros for clustering
+
+- **Unsupervised Clustering with Python**
+    - Supports: **K-Medoids**, **K-Means**, **Agglomerative Clustering**
+    - Clustering is performed using Python (`scikit-learn`, `scikit-learn-extra`)
+    - Data exchange is handled via `Boost.Python` + `Boost.NumPy`
+
+- **C++ ⇄ Python Integration**
+    - Converts C++ matrix to Python `numpy.ndarray`
+    - Calls clustering methods from Python module `mymodule.py`
+    - Extracts prediction labels back into C++
+
+- **Visual Output**
+    - Scatter charts for predicted vs. real document labels
+    - Color-coded clusters in real-time UI
+
+- **Multithreaded Execution with Progress Bar**
+    - GUI remains responsive via `QThread`
+    - Shows status using `QProgressDialog`
+
+##  Algorithm Pipeline (Detailed)
+
+1. **Text Input**
+    - User provides text via input box, file dialog, or drag-and-drop.
+    - Text is stored in `target_doc`.
+
+2. **Preprocessing** (`prepare()`)
+    - Removes stop words (prepositions, conjunctions, etc.)
+    - Removes all characters except letters and spaces
+    - Removes repeated characters (e.g., spaces)
+    - Converts text to lowercase
+
+3. **Fragmentation**
+    - Text is split into equal-length chunks based on UI parameters
+
+4. **N-Gram Extraction**
+    - Combines all documents into a single corpus
+    - Calculates N-grams for N from 2 up to max N (from UI)
+    - Uses a sliding window algorithm
+    - Aggregates results into a dictionary
+
+5. **N-Gram Filtering**
+    - Selects N-grams above 90th percentile of frequency
+    - Saves filtered list to a text file
+
+6. **Vector Space Modeling**
+    - For each document, for each fragment:
+    - Computes vector of N-gram frequencies via `freq_in_chunk()`
+    - Produces nested structure of frequency vectors
+
+7. **Rank Correlation (Spearman's rho)**
+    - Calculates rank correlation distance between fragment vectors:  
+        ρ = 1 - (6 * ∑ d_i^2) / (n(n^2 - 1))
+
+    - Implemented via `zv_calc()` and `correlation()`
+
+8. **Average Rank Dependence**
+    - Computes `ZVT` values for each fragment (based on 10 previous)
+    - Combines `ZVT` into full pairwise distance matrix
+
+9. **Matrix Padding**
+    - Matrix may have uneven rows; padded with zeros to square shape
+
+10. **C++ to Python Transfer**
+    - Converts distance matrix to NumPy arrays using `Boost.NumPy`
+    - Initializes embedded Python interpreter
+    - Calls `mymodule.py::{kmedoids, kmeans, agglClus}`
+    - Extracts prediction results
+
+11. **Post-Processing Results**
+    - Splits results into clusters for artificial vs. input document
+    - Compares distributions:
+        - If both match → **Artificial text**
+        - If different → **Human-written text**
+
+12. **Visualization & UI**
+    - Result shown in message box
+    - Prediction charts drawn with `QChartView`
+
+
-- 
cgit v1.2.3