Today I learned: command > script

Had to compare two files at work today. Actually, I had to compare one file to a series of files to see what data exists in both of them. This technically comes down to a LEFT JOIN where we only want left column data when it exists in the right column.

So, in writing a script in PHP it comes down to:

<?php
ini_set('MEMORY_LIMIT', '256M');
if (!file_exists($argv[1])) { die('file ' . $argv[1] . ' not found'); }
if (!file_exists($argv[2])) { die('file ' . $argv[2] . ' not found'); }
$fp = fopen($argv[1], 'rt');
$lines = [];
do {
  $line = trim(fgets($fp));
  if (strlen($line) > 0) {
    $lines[] = $line;
  }
} while (!feof($fp));
fclose($fp);
$fp = fopen($argv[2], 'rt');
do {
  $line = trim(fgets($fp));
  if (strlen($line) > 0) {
    if (in_array($line, $lines)) {
      echo "$line\n";
    }
  }
} while (!feof($fp));
fclose($fp);

This script, albeit working like a charm, takes a while with large amounts of records.

After some googling this script isn’t really necessary if you use grep correctly. You also gain the speed of an executable in one fell swoop.

$ grep -Fxf [file1] [file2]

Output is exactly the same.

Today I learned: regex > loop

In writing “quad-quad”, which is a set of four 4-letter speak-able words that can be used as a user-friendly “bookmark” into easily finding a record, I was writing a “quick” program to extract the contents of wikidatawiki-20220820-pages-articles-multistream.xml (a wikipedia dump) and came into this large delay in the following loop:

$alphas = 'qwertyuiopasdfghjklzxcvbnm ';
$newline = '';
for ($x = 0; $x < strlen($line); $x++) {
    $c = substr($line, $x, 1);
    if (strpos($alphas, $c) !== false) {
        $newline = $newline . $c;
    else {
        $newline = $newline . ' ';
    }
}

The loops main purpose is to sanitize any non-letter data by replacing unknown characters with a space for later processing. The end result would be words that I could filter down to 4-character words and tally them up.

When the program read a line around 1mb in length it would “hang” for a bit as it chewed through the data. In a nutshell 25,100,655 bytes of data would take 24m36s. It was time to optimize.

Replacing the previous with the following regex performance was increased immensely.

$newline = preg_replace('/[^a-z]/', ' ', $line);

The same amount of data took 1.892s.

Lesson: If you don’t know regexes, learn regexes.