Finding parent child word relationships

I just updated my project showing how NumPy’s arrays and other python objects can be used to Extract, Transform, and Load (ETL) data into a SQLite database. Specifically, the identification of parent/child word relationships. A parent/child word relationship is one where some or all of the letters of a focal word can be used to spell another word: an anagram!

See this page for a graphic showing some of the parent/child word relationships.

V1.0 can be found here.

The difference between this version and pervious versions is an expanded set of parent/child word determination techniques and better delineated techniques. There were four techniques in v1.0 and six techniques in v2.0. Each of the six techniques produce the same output: the identification and storage of approximately 73M parent-child word pairs. This project was motivated by my desire to learn more about NumPy and over time morphed into how quickly and efficiently I could identify the parent/child word pairs. Briefly, each of the six techniques involves subdivision of a $[216K, 26]$ matrix into $1$ or more sub-matrices of size $[M, 26]$

The six techniques are listed below along with a brief description and how long it takes to identify the parent/child word pairs. In general, the techniques progress from naive matrix sub-division to more efficient sub-division of the $[216K, 26]$ matrix.

Matrix Extraction Option Subdivision Description Total Time
1 No subdivision 1 h 33 m 50 s
2 Word length 49 m 33 s
3 First letter 35 m 27 s
4 Single least common letter 15 m 29 s
5 Three least common letters 3 m 2 s
6 Three least common letters and word length 3 m 39 s

Yes, techniques 5 and 6 really are 30 times faster than option 1! As for specific reasons why? Well, I’ll recommend the project’s thorough (and verbose) README.md for information that answers that question. There are lots of graphics describing the differences in the timing and the size of the sub-matrices generated by each matrix extraction technique.

If you have any ideas for an additional sub-division technique, please let me know. I have no doubt there are additional possibilities out there. Is a sub-minute determination possible? Perhaps a wholly different technique or a different hardware implementation such as using a GPU-accelerated version of NumPY could be used to crack the sub-minute determination.