Fgselectivearabicbin
Arabic characters can have multiple shapes depending on position (initial, medial, final, isolated). A purely binary filter cannot handle shape-dependent replacements unless it implements a full Arabic shaping engine, which increases complexity.
Using SIMD instructions (e.g., AVX-512 on x86, NEON on ARM), a modern fgselectivearabicbin can scan 32–64 bytes at once, testing each against the Arabic Unicode range boundaries. This yields speeds over 2 GB/s on a single core. fgselectivearabicbin
| Feature | grep + iconv | Python re on decoded text | FGSelectiveArabicBin |
|---------|----------------|----------------------------|--------------------------|
| Works on raw binary with null bytes | No | No (unless binary mode, but then regex fails on UTF-8) | ✅ Yes |
| Preserves original non-Arabic binary | Yes (but cannot modify) | No (decoding loses original offsets) | ✅ Can modify selectively |
| Speed on 1 GB mixed binary data | ~8 seconds | ~45 seconds (decoding overhead) | ~1.5 seconds (SIMD) |
| Handles invalid UTF-8 sequences | No (decoder error) | No (UnicodeDecodeError) | ✅ Yes (skips/replaces) |
| Arabic-specific ligature control | No | Via external libraries (e.g., CamelTools) | ✅ Built-in | Arabic characters can have multiple shapes depending on
This is the core "selective" component. It applies rules such as: This is the core "selective" component