Multi-Modal Methods: Recent Intersections Between Computer Vision and Natural Language Processing

“Multi-Modal Methods” is the second publication series from The M Tank — a collaborative research group specialising in the creation of unique resources for education in artificial intelligence.

Their inaugural publication series, “A Year in Computer Vision,” achieved viral status with over 100,000 hits, trending across Reddit ML, Hacker News, and generating thousands of reads from the global AI community.

Multi-Modal Methods focuses on Lip-Reading and Image Captioning

This second series focuses on some of the recent intersections between Computer Vision and Natural Language Processing (NLP), two seemingly distinct branches of modern AI research. The research group curates an impressive amount of open-source research and multimedia resources in each of their publication series thus far; a project clearly focused on comprehensiveness and understanding.

The series is slated for release in three parts, focusing on the fields of Lip-Reading and Image Captioning. The former, is exactly as it sounds, a deep-dive into the history and recent breakthroughs enabling machines to read our lips with disturbing accuracy. Whereas Image Captioning shepherds the reader through the history, as well as technical and philosophical considerations in teaching machines to describe images with language.

The publications are written to inform the reader of the traditional hurdles within each field, and the dissolution of said hurdles with the application of newer deep learning techniques. In this respect, the authors attempt to contextualise these advancements to better understand the progress of research at present.

Discussing high-level takeaways from the latest series, The M-Tank team describes their motivations:

“The topics were chosen to reflect some key themes that are ever-present in AI research. For instance, the topic of lip reading is interesting because, seemingly out of nowhere, breakthroughs in deep learning quickly toppled traditional human dominance in this area. With image captioning, we get to glimpse both the combinatorial nature of modern deep learning techniques, but also some of the limitations.

“With lip reading, there’s a clear moment where you realise we’ve lost primacy forever, whereas with image captioning, we get to see the value of context and where the [current] line in the sand is drawn.”

The first two parts of “Multi-Modal Methods” are available now for free via their dedicated site

It is also available on Medium and on Twitter @TheMTank.

AI Awards 2018: Applications Now Open
Connecting Ireland’s AI ecosystem for a higher purpose

Applications are now open for you to take part in the inaugural AI Awards and showcase your skills and expertise to the AI community on the island of Ireland.

The AI Awards is a not-for-profit organisation dedicated to promoting, advancing and rewarding corporate and academic excellence in Artificial Intelligence (AI) innovation, research & development and products.

It is our mission to support the AI community by recognising the hard work and dedication of those working in the field and to help promote and market Ireland globally as the first-choice home for AI investment and job creation in research and product development.

For more information about the awards, head over to

By AI Ireland

AI Ireland's mission is to increase the use of AI for the benefit of our society, our competitiveness, and for everyone living in Ireland.

Leave a Reply