Finding parent child word relationships

I just updated my project showing how NumPy’s arrays and other python objects can be used to Extract, Transform, and Load (ETL) parent/child word-relationship data into a SQLite database, A parent/child word relationship is one where some or all of the letters of a focal word can be used to spell another word: an anagram!

See this page for an interactive graphic showing some of the parent/child word relationships.

Version 1.0 can be found here.

The difference between this version and version 1.0 is an expanded set of parent/child word determination techniques and more explicit delineation between techniques. There were four techniques in v1.0 and six techniques in v2.0. Each of the six techniques produce the same output: the identification and storage of approximately 73M parent-child word pairs. Initially, this project was motivated by my desire to learn more about NumPy. Over time it morphed into how quickly and efficiently I could identify the parent/child word pairs. Briefly, each of the six techniques involves subdivision of a \([216K, 26]\) matrix into \(1\) or more sub-matrices of size \([M, 26]\).

The six techniques are listed below along with a brief description and how long it takes to identify the parent/child word pairs. In general, the techniques progress from naive matrix sub-division to more efficient sub-division of the \([216K, 26]\) matrix.

Matrix Extraction Option Subdivision Description Total Time Approx. Speed-Up
1 No subdivision 1 h 33 m 50 s 1x
2 Word length 49 m 33 s 2x
3 First letter 35 m 27 s 3x
4 Single least common letter 15 m 29 s 6x
5 Three least common letters 3 m 2 s 31x
6 Three least common letters and word length 3 m 39 s 25x

Yes, techniques 5 and 6 really are 31 and 25 times faster than option 1! As for specific reasons why? Well, I’ll recommend the project’s thorough README.md for information that answers that question. There are lots of graphics describing the differences in the timing and the size of the sub-matrices generated by each matrix extraction technique.

If you have any ideas for an additional sub-division technique, please let me know. I have no doubt there are additional possibilities out there. Is a sub-minute determination possible? Perhaps a wholly different technique or a different hardware implementation such as using a GPU-accelerated version of NumPY could be used to crack the sub-minute determination.