A colleague showed me Google’s NotebookLM (NLM from now on) service and I just tried it out and tested it with the meanest and gnarliest test I could figure out. Here’s my review:
What is it?
Basically it’s a service that allows you to upload a set of documents and use the content that you upload as a knowledge base to ask questions to. Much like ChatGPT but with your own dataset. This is similar to other services Google already offers in GCP but with a friendly and easy to use consumer user interface. Uploading of files accepts multiple formats and also allows for selecting documents from Google drive (docs and presentations).
The test
I uploaded five books at around 50K words in total. That means the total material is similar in length to two average murder mystery books or one thick fantasy novel. I chose to upload my own books specifically because they are not known by AI services like ChatGPT, this allows me to test the learning and reasoning skills of the service on a material that I (together with a few fans) is a top authority on. I know of things that are vaguely described and developments that take place over multiple books.
Amazon link to the test material, target audience are casual fantasy readers. No spoilers below, character names are replaced with aliases, the same aliases are reused in multiple questions for different characters.
The results
For basic factual questions NLM is surprisingly good and gets even tricky questions that require understanding a full paragraph to successfully generate a correct answer. The references it provides to where it found the information are spot on, which if nothing else saves you time to find relevant sections in the material. However it frequently messes up small details in facts.
Example: Where was a certain sword found?
Answer: A, B and C went to place X and person E handed them her sword.
Correct answer: A, B, C and D went to place X and person E gave them the sword that belonged to place X.
NLM can map facts and draw simple conclusions in multiple directions.
Example: Who are X’s sisters?
Answer: A, B, C are X’s sisters.
This is correct despite the fact that Y is mentioned as the mother of A, B, C and then in a different section mentioned as the mother of X. In one separate paragraph C is distinctly mentioned as the sister of X, and this is specifically mentioned by NLM as well.
For developments over multiple chapters or books NLM easily finds even vague hints or evidence to responses for a question. Clearly it’s not relying on searching through for words or terms that you ask for, it’s much better than that.
Example: Who is X attracted to? Answer:
All these are spot on examples with correct references throughout multiple books.
NLM is fairly good at looking at longer passages as a single piece of information. Facts or developments that are described over a paragraph up to a page or two are easily understood as a complete text. As per the above example of X’s sisters it seems to be able to add facts from different areas as long as they are simple enough. However when multiple pages or understanding of a complete chapter it fails.
Example: What’s X’s real name?
Answer: Even though we get insight into the history of X, her real name is never mentioned.
The name is mentioned in the first sentence of the first chapter, which then goes on to describe an event which at the end reveals her new taken name X. Ironically when later asked who’s X’s mother, it replies that it is described that X changes her identity, referring to the correct section.
Reasoning and calculating, using facts is something that NLM can do, but with low precision.
Example: Who’s the oldest elf and how old is she?
Answer: The oldest elf is X and she’s 2700 years old. Elves are immortal and can live for thousands of years. Elves are not described with exact age but mentioned in relation to events, eg: X and Y lived before the war, which took place more than two thousand years ago, and they were around 500 years old at the time.
Correct answer: K is the oldest elf mentioned, he is over six thousand years old. The oldest female elf is G, who’s the grandmother of X although her age is not mentioned. Interestingly if asked who’s the mother of X, NLM correctly replies A. If asked about A’s age, NLM reasons that since X is the daughter at 2700 years old, A must be born before the world war, and at least more than 500 years earlier just like X and must then be at least 2500 years old. This a lot of needless reasoning that ends up with the wrong information. When asked about Z’s age, it suggests he’s 4407 years old based on completely correct facts (with the exception of the “7” which it just randomly adds), calculated in the wrong way. It fails to handle what happens before and after historic events, and it fails to factor in clear references to when an event is described to occur. It can also reason about the fact that N is the grandfather of Z and must be at least 11000 years old, which is not wrong although if correctly calculated could be concluded to be at least 17000 years old (if still alive, which is not clear).
NLM can detect patterns when used frequently although it has trouble figuring out easily derived information based on context. Although not described in the books all elven names follow a strict <First name> ai/ei <First name of a parent> pattern. NLM concludes this easily and when asked: “What’s the structure of elven first and last names?” It describes the above in a wordy way. It correctly detects that “ai” and “ei” is used to bind the two pieces together but fails to figure out that “ai” means “daughter of” and “ei” means “son of” which is easily deduced when reading the names in context of the story.
Conclusion
When it comes to finding basic facts in a data set this size, NLM performs very well. When it comes to reasoning about the material it demonstrates several basic relationships of various kinds. There are some gaps in how it can connect facts but depending on how questions are asked it can puzzle together sections of different books to arrive at correct answers. However even basic arithmetics is hit and miss experience, at best. Applying mathematical operations based on context seems to be limited to—or at least default to—addition.