While working with customers building unstructured document processing models (formerly known as document understanding models) in Microsoft Syntex, I learned a great deal of practical things along the way. To give full credit, I received assistance from Microsoft on many of these answers. I’ve provided some context and further explanation around some of the answers and am sharing in case you have the same questions!
An unstructured document processing model uses AI to process documents. The documents must have text that can be identified based on phrases or patterns. The text you identify designates both the TYPE of file (i.e., Model name which manifests as a Content Type classification) and what you’d like to extract from it (extractors which each manifest as a document property).
- If 5 positive/1 negative training files are all that is required to build a model, is MORE than that better?
- Can you use an existing content type for your model instead of creating a new one?
- How do you manage false model matches in the library where the model is published?
- Can you programmatically retrieve the confidence score on the library?
- Can you programmatically run a Syntex model on a file?
- Are explanations OR’d or AND’d together?
- What are tokens?
- Can you migrate a Syntex model from one tenant to another?
- Can I have more than 1 model published to the same library?
- Can I publish a model to the main OneDrive library?
If 5 positive/1 negative examples are all that is required to build a model, is MORE than that better?
Answer: usually, yes! The goal of picking your training files is to be representative of the variation in the end data set (your document library). If you have a homogenous data set, the minimums are fine. Real world… focus less on the # and more on capturing representative breadth of the content that may be added to the library. A model’s quality is a product of 2 elements: the labeling and the explanations. Having a variety in the training files provides the algorithm more to work with in building the AI model.
Can you use an existing content type for your model instead of creating a new one?
Answer: yes. During model creation, you can associate it to an existing site content type under Advanced settings. The site content type can be created directly on the site or created in the tenant-level Content Type Gallery
How do you manage false matches in the library where the model is published?
Answer: There’s a couple of ways you can manage these. In all cases, if you’ve decided to use Syntex models in your tenant to classify content, you must also recognize that building a model is NOT a once-and-done activity. It will require constant care and feeding by the users who understand the content (such as Knowledge managers or content owners). Over time, if the model becomes less “confident” in determining a match, that’s likely a good indication that you need to:
- provide more training files to the model that are representative of the content being added to the library that you consider a positive match and then republish the model to the library
- add more explanations in your model and then republish the model to the library
Some ideas for managing false matches in the library where the model is published:
- If the default content type is Document, create a separate Content Type/view filtered on that content type. Ensure content in that view is monitored regularly.
- Make a new content type called Unclassified and set it as the default. When new documents are added to the library, they will automatically be set to Unclassified. Once classified, it will only change the content type to the Syntex model name if it is confident it has a match; otherwise, it will be left as Unclassified. You can then take appropriate action on the Unclassified content type value
Remember… an unstructured document processing model is an iterative journey of identifying false matches and feeding them back into the training model for refinement.
How do you manage low confidence scores for documents in the published library?
Answer: You can use the confidence score to inform your training process. You should decide what threshold will trigger a knowledge manager to look at the model to determine if it needs more refinement. This continuous feedback process supports the idea that an unstructured document processing model is NOT a once-and-done activity.
A couple of ideas to help:
- Use List view/Column formatting to highlight the row/document if it falls below your confidence score threshold
- Use the new Automate… Rules…Create a Rule document library toolbar feature to automatically send an email to the knowledge manager once the confidence score goes below a certain threshold
Can you programmatically retrieve the confidence score on the library?
Answer: Yes, it’s stored as a library column, so Power Automate and API options will have access to its value.
Can you programmatically run a Syntex model on a file?
Answer: Yes. You can apply a model using PowerShell or API.
Why would you want to do this? A model will automatically run only on the upload event which means if you change something in the document content and a different value should be extracted as metadata because of the change, you will need to rerun the model. In this case, a Power Automate flow triggered on the change item event could be used to retrigger the classification process.
- PowerShell: Request processing by a custom model
- API: Create file classification request
Are explanations OR’d or AND’d together?
Answer: Phrase list explanations are OR’d together and interpreted with the label (positive/negative) by an algorithm to generate the AI model and ultimately determine if it is an overall match. It’s not necessary to satisfy all explanations in the document, they are simply a “potential” indicator.
it is nearly impossible to predict a linear impact from any one label or explanation to the outcome. Resist the temptation to “reverse engineer” the algorithm to determine a positive or negative outcome. Your best offense is to provide as many explanations as possible since the more explanations with a positive match will generate a higher confidence score.
This again reiterates the importance of having a sufficient number of training files to cover the “representative breadth” of content that will enter the library and will help inform the number of explanations you must create.
What are tokens?
Answer: tokens are used to parse the content and are comprised of letters/digits, special characters, and spaces. When you are training the model, you will see each white line indicating a token separator (image). You can look for an extractor within a token range.
Can you migrate a Syntex model from one tenant to another?
Answer: Yes. You can use either PnP Powershell or the PnP Core SDK to migrate across content centers or tenants. You can also use PowerShell to move custom explanation templates.
- PnP PowerShell:
- PnP Core SDK: Working with Microsoft Syntex
Can I have more than 1 model published to the same library?
Answer: Yes. All models will run against any file uploaded to the library. The one with the highest confidence score wins! It is a common scenario to have multiple models published to the same library as business processes often have different types of content being stored in the same library, each relating to a different model.
Can I publish a model to the OneDrive library?
Answer: No. I was asked this in a recent session I gave but didn’t ask their use-case. I’d love to know the use-case for this if you have one! 🙂
I hope you found these FAQs helpful as you’re working thru your own models. I may add to this list over time as I learn new things.
Thanks for reading.