Answered

Copilot Studio public website knowledge source returning "No information was found"

(0) Share

Report

Posted on by CT-20042235-0

Hi all,

I'm building a Copilot Studio agent for our company's support site that retrieves product documentation PDFs from their public DAM. I've set up a public website knowledge source pointing to: https://www.solidigm.com/content/dam/solidigm/en/site/products/documents/

When users ask for a document through a Search and Summarize node, the agent consistently returns "No information was found."

The files I'm trying to retrieve are publicly accessible PDFs sitting directly under that path. The knowledge source status shows as "Ready."

Has anyone successfully used a public website knowledge source to retrieve PDFs from a DAM-style path like this? Any advice on configuration, crawl behavior, or troubleshooting would be appreciated.

Examples:

Categories:

AI Builder

I have the same question (0)

All responses (7)

Answers (2)

Sort by

Vish WR 1,208 on at

Like (0)

Report
Copy link

Link copied!

@CT-20042235-0

Are those PDFs placed on the website as links in the pages ?, or is it only the website content folder?

Was this reply helpful? Yes No
Verified answer

Sunil Kumar Pashikanti 1,870 Moderator on at

Like (2)

Report
Copy link

Link copied!
Hi @CT-20042235-0,

If you have added a public website URL (like /content/dam/.../documents/) as a Knowledge Source, and it shows "Ready" but returns "No information found," you are likely hitting a Crawl Discovery Gap.

The Root Cause: Crawlers are Link-Followers, not File-Explorers
Copilot Studio’s public website crawler is designed to mimic a human browsing a site. It follows HTML links (<a> tags) to find content.

HTML Pages: Easily discoverable via navigation.

DAM/Binary Folders: These are "Asset Stores." They usually lack an HTML interface.

The Result: The crawler hits your folder URL, sees a blank response (because directory browsing is disabled on the server), and assumes there is nothing to index. It cannot "guess" the filenames of your PDFs.

How to Fix It (Proven Options)
Option 1: The "Index Page" (Fastest Low-Code Fix)
Create a simple HTML landing page (e.g., yoursite.com/support/docs) that contains direct links to every PDF you want indexed.

Why it works: When the crawler hits this page, it sees the links, follows them, and begins indexing the PDF content.

Tip: Ensure the links are standard <a href="..."> tags and not hidden behind JavaScript buttons.

Option 2: Upload Files Directly
If your document set is under 500 files and individual files are smaller than 20MB:

Go to: Knowledge > Add Knowledge > Files.

Why it works: This bypasses the crawler entirely. Copilot Studio will immediately chunk and index the full text of the PDFs.

Option 3: SharePoint Integration
If your PDFs are internal or sensitive, move them to a SharePoint Document Library.

Why it works: Copilot Studio uses the Microsoft Graph API for SharePoint, which performs a direct "file crawl" rather than a "web crawl." It is significantly more reliable for deep directory structures.

Option 4: The XML Sitemap (Advanced)
If you cannot create a public HTML page, add the direct URLs of every PDF to your site’s sitemap.xml.

Why it works: The Copilot crawler checks the sitemap to find "deep links" it might have missed during the standard crawl.

What will NOT work:
Waiting longer: If it hasn't indexed in 24 hours, it never will because it can't find the path.

Changing the Prompt: This is a data-source issue, not a language-model issue.

Adding more sub-folders: More folders only make it harder for a crawler to guess the path.

Bottom Line: A web crawler needs a map (HTML links). If you point it at a "closed" folder, it will report as "Ready" (because the URL works) but index zero documents.

✅ If this answer helped resolve your issue, please mark it as Accepted so it can help others with the same problem.
👍 Feel free to Like the post if you found it useful.

Sunil Kumar Pashikanti, Moderator
Blog: https://sunilpashikanti.com/posts/

Was this reply helpful? Yes No
CT-20042235-0 16 on at

Like (0)

Report
Copy link

Link copied!

@Vish WR,

Yes, the links to the PDFs can be found on our Document Management System page.

https://www.solidigm.com/products/document-management-system.html

Was this reply helpful? Yes No
CT-20042235-0 16 on at

Like (1)

Report
Copy link

Link copied!
@Sunil Kumar Pashikanti

Thank you for the detailed response.

I've since confirmed that the PDFs are linked from two places on our site:

The Document Management System page at: https://www.solidigm.com/products/document-management-system.html

Individual product pages across the site:

Example 1: https://www.solidigm.com/products/data-center/d7/ps1010.html

Example 2: https://www.solidigm.com/products/data-center/d7/p5810.html

I have my agent's knowledge source pointed at www.solidigm.com, but it is still struggling to find these documents.

To try to improve retrieval, I attempted to narrow the knowledge source specifically to the DMS page in a topic dedicated to document retrieval. However I ran into a couple of issues:

The knowledge source URL field doesn't appear to accept a .html file extension, so I'm unable to point it directly at https://www.solidigm.com/products/document-management-system.html. Our web team is working on setting up a redirect from the extensionless URL to the .html version. Would a redirect work for the crawler, or does it need to hit the final destination URL directly?

Looking at the page source for the DMS page, the document table appears to be powered by the DataTables library. Could this cause an issue where the crawler sees an empty table because the data is loaded dynamically via JavaScript after page load, rather than being server-rendered in the HTML?

When testing the agent with this configuration, the Search and Summarize node returns "No information was found that could help answer this", suggesting the knowledge source is not returning any content despite the knowledge source status showing as "Ready."

For reference I've attached a simplified version of the topic YAML showing the Search and Summarize node pointed at the DMS knowledge source.

Any guidance would be appreciated.

Documents Topic.txt

Was this reply helpful? Yes No
Verified answer

Sunil Kumar Pashikanti 1,870 Moderator on at

Like (1)

Report
Copy link

Link copied!
Hi @CT-20042235-0,

Thanks for the detailed follow‑up, your analysis is correct, and you’ve essentially identified the root cause already.
To address your questions point‑by‑point:
1. Redirects vs .html URLs
A redirect from the extensionless URL to the .html version should work from a crawler perspective. However, in this case the redirect itself is not the blocker. Even when the crawler reaches the final page successfully, it can only index server‑rendered HTML.

2. JavaScript / DataTables rendering
Yes, this is the primary issue.
Because the document table on the DMS page is populated via the DataTables library after page load, the Copilot Studio website crawler sees only an empty table at crawl time. JavaScript is not executed, so none of the PDF links or metadata are discovered.
The same constraint applies if those links are injected dynamically on individual product pages.

3. Why the knowledge source shows “Ready” but returns no results
“Ready” indicates that the crawl completed successfully, not that meaningful content was extracted. If the crawler doesn’t see static links or readable text during the initial HTML load, the Search and Summarize node correctly returns “No information was found.”

Recommended fix (confirmed)
The Index Page approach we outlined is the correct and proven solution:

Create a simple, server‑rendered HTML page that contains direct ... links to each PDF you want indexed

Ensure the links are standard ... elements and not populated via JavaScript

Point the website knowledge source at that page (or allow it to be discovered during the site crawl)

This works because the crawler can discover the static links, follow them, and then index the PDF content itself.

Alternative (often more reliable)
If feasible, adding the PDFs directly as a file‑based knowledge source (or via SharePoint/OneDrive) avoids all crawler visibility issues and generally produces the most consistent results.

In short: your diagnosis is accurate, and creating a static index page is the right fix for this scenario.

Was this reply helpful? Yes No
CT-20042235-0 16 on at

Like (1)

Report
Copy link

Link copied!

@Sunil Kumar Pashikanti

Thanks again. This thread has been very helpful. I've liked and marked both your responses as accepted thus far.

I have two related questions to sanity‑check my approach as I move forward:

1) Deprioritizing specific pages within a knowledge source

My current working assumption is that the right workaround is to dedicate separate website knowledge sources to specific use cases, and then have document‑retrieval topics reference only those sources (for example, product pages + DMS pages, excluding support KAs).

In our case, that pattern depends on being able to point a knowledge source at static entry pages such as:

https://www.solidigm.com/products/document-management-system (which we expect to redirect to the .html page), so that those sources are discoverable and can be cleanly scoped to document retrieval.

However, assuming that approach isn’t feasible or doesn’t behave as expected, is there any supported way, or recommended best practice, to effectively deprioritize or exclude specific pages (such as explanatory KAs) within a single website knowledge source? Or is knowledge‑source separation the only reliable mechanism today?

2) System instructions vs. Generative Answer instructions

When a topic uses a Generative Answer / Search and Summarize action with additional instructions, are the agent’s system instructions still fully applied in that context? I want to confirm that system‑level rules (source restrictions, document handling, link requirements, etc.) remain authoritative, and that generative instructions simply layer on top rather than replace them.

Appreciate any guidance. Thank you again for sharing the detailed explanations.

Was this reply helpful? Yes No
Sunil Kumar Pashikanti 1,870 Moderator on at

Like (0)

Report
Copy link

Link copied!
Hi @CT-20042235-0,

Great questions and your assumptions are mostly right.

1) In Copilot Studio today, the only reliable way to control what content is used from a website is by separating it into different knowledge sources. Once a website source is added, there isn’t a supported way to selectively exclude or deprioritize specific URLs within that source.

Example scenario
Your website has:

Product documentation pages

Marketing pages

Support FAQs and troubleshooting articles

You want the copilot to answer product questions only from official docs, not from FAQs or marketing copy.
What you do

Create Knowledge Source A: docs.company.com/products

Create Knowledge Source B: www.company.com/marketing

Create Knowledge Source C: support.company.com/faqs

How it helps

A “Document Retrieval” topic references only Knowledge Source A
Support or FAQ content is never considered, even though it’s on the same domain
There’s no need to fight relevance or worry about the copilot pulling a nearby but wrong page

Why this matters
There’s no way to say “ignore /faqs” or “deprioritize /support” inside a single website source. Splitting sources is what gives you control.

2) System instructions should be treated as hard guardrails. Topic‑level instructions and Generative Answer guidance can refine how responses are generated, but they don’t override agent‑level rules. If there’s ever a conflict, the system instructions will take precedence rather than being resolved based on which instruction was defined most recently.

Example 1: System instruction wins
System instruction

Only answer using content from approved knowledge sources.
Do not include external links.

Topic / Generative Answer instruction

Summarize the answer and include a link to the full article.

What happens

The answer may be summarized

No external link is included, even if the topic asks for one

The system rule is enforced

Example 2: Topic instructions refine behavior (but don’t override)
System instruction

Respond in a professional tone and only reference internal documentation.

Topic instruction

Explain the steps in simple language for non‑technical users.

What happens

The response stays professional

It uses simpler wording

It still only cites internal documentation

This works because the topic instruction stays within the system guardrails.

Example 3: Conflicting instructions don’t “negotiate”
System instruction

Never provide pricing information.

Topic instruction

Answer the user’s question and include pricing details if available.

What happens

Pricing is omitted or the answer is partially withheld

The system rule is not overridden

The conflict does not resolve based on which instruction was added later

Simple takeaway

Knowledge sources control what content is even eligible
System instructions control hard rules
Topic and Generative Answer instructions shape how answers are written, not what rules are broken

Was this reply helpful? Yes No