Once an idea starts taking shape, the first tendency is to blow up the trumpets of imagination in terms of what it can possibly shape up into, instead of hunkering down on the task at hand.
After endless thoughts and debates about the disruptive possibilities of text production marrying voice distribution, I have decided to focus more on Deep fake tool implementation instead of reading research papers or looking at angel investment.
Overdub by Descript needs an upfront signup, so would exhaust the 7 day free trial more judiciously. Resemble AI sounds more promising in the longer term since they have an API integration possibility, instead of Overdub which has to be used on the local machine. But the bot training did not produce very encouraging output which leads me to suspect if anything off the shelf is powerful enough to voice over anything long form. A tactical work around would be to focus on byte sized text/ audio a la Naval’s tweet storm. However, that makes the product closer to Rupi Kaur than Pablo Neruda.
The next step is to quickly (very quickly) implement a clean deep fake voice over for the next edition and set up Anchor/ Soundtrap for distribution. Most of the conversation online about voice cloning has tended to revolve around the tool’s efficacy instead of usage in art production. But the arbitrage in this case lies in figuring out the distribution. The million dollar question is not how well can deep fakes replicate Jay Z or Aaron Sorkin. We know the answer to that already. The non trivial question is how would information be consumed in the future? Would AR kill text? Is audio resurgence in the Youtube era a pattern we would see often? Or does Neuralink disrupt sensory perception completely?
I’d let the market educate me on that one.