web
You’re offline. This is a read only version of the page.
close
Skip to main content

Announcements

News and Announcements icon
Community site session details

Community site session details

Session Id :
Power Platform Community / Forums / Power Apps / Copilot Studio public ...
Power Apps
Answered

Copilot Studio public website knowledge source returning "No information was found"

(0) ShareShare
ReportReport
Posted on by 16
Hi all,
 
I'm building a Copilot Studio agent for our company's support site that retrieves product documentation PDFs from their public DAM. I've set up a public website knowledge source pointing to: https://www.solidigm.com/content/dam/solidigm/en/site/products/documents/
 
When users ask for a document through a Search and Summarize node, the agent consistently returns "No information was found."
 
The files I'm trying to retrieve are publicly accessible PDFs sitting directly under that path. The knowledge source status shows as "Ready."
 
Has anyone successfully used a public website knowledge source to retrieve PDFs from a DAM-style path like this? Any advice on configuration, crawl behavior, or troubleshooting would be appreciated.
 
Categories:
I have the same question (0)
  • Vish WR Profile Picture
    1,208 on at
     
    Are those PDFs placed on the website as links in the pages ?, or is it only the website content folder?
  • Verified answer
    Sunil Kumar Pashikanti Profile Picture
    1,870 Moderator on at
     
    If you have added a public website URL (like /content/dam/.../documents/) as a Knowledge Source, and it shows "Ready" but returns "No information found," you are likely hitting a Crawl Discovery Gap.
     
    The Root Cause: Crawlers are Link-Followers, not File-Explorers
    Copilot Studio’s public website crawler is designed to mimic a human browsing a site. It follows HTML links (<a> tags) to find content.
    • HTML Pages: Easily discoverable via navigation.
    • DAM/Binary Folders: These are "Asset Stores." They usually lack an HTML interface.
    • The Result: The crawler hits your folder URL, sees a blank response (because directory browsing is disabled on the server), and assumes there is nothing to index. It cannot "guess" the filenames of your PDFs.
    How to Fix It (Proven Options)
    Option 1: The "Index Page" (Fastest Low-Code Fix)
    Create a simple HTML landing page (e.g., yoursite.com/support/docs) that contains direct links to every PDF you want indexed.
    Why it works: When the crawler hits this page, it sees the links, follows them, and begins indexing the PDF content.
    Tip: Ensure the links are standard <a href="..."> tags and not hidden behind JavaScript buttons.
     
    Option 2: Upload Files Directly
    If your document set is under 500 files and individual files are smaller than 20MB:
    Go to: Knowledge > Add Knowledge > Files.
    Why it works: This bypasses the crawler entirely. Copilot Studio will immediately chunk and index the full text of the PDFs.
     
    Option 3: SharePoint Integration
    If your PDFs are internal or sensitive, move them to a SharePoint Document Library.
    Why it works: Copilot Studio uses the Microsoft Graph API for SharePoint, which performs a direct "file crawl" rather than a "web crawl." It is significantly more reliable for deep directory structures.
     
    Option 4: The XML Sitemap (Advanced)
    If you cannot create a public HTML page, add the direct URLs of every PDF to your site’s sitemap.xml.
    Why it works: The Copilot crawler checks the sitemap to find "deep links" it might have missed during the standard crawl.
     
    What will NOT work:
    Waiting longer: If it hasn't indexed in 24 hours, it never will because it can't find the path.
    Changing the Prompt: This is a data-source issue, not a language-model issue.
    Adding more sub-folders: More folders only make it harder for a crawler to guess the path.
     
    Bottom Line: A web crawler needs a map (HTML links). If you point it at a "closed" folder, it will report as "Ready" (because the URL works) but index zero documents.
     
    ✅ If this answer helped resolve your issue, please mark it as Accepted so it can help others with the same problem.
    👍 Feel free to Like the post if you found it useful.

    Sunil Kumar Pashikanti, Moderator
    Blog:
     https://sunilpashikanti.com/posts/
  • CT-20042235-0 Profile Picture
    16 on at
     
    Yes, the links to the PDFs can be found on our Document Management System page. 

    https://www.solidigm.com/products/document-management-system.html
  • CT-20042235-0 Profile Picture
    16 on at
     
    Thank you for the detailed response.
     
    I've since confirmed that the PDFs are linked from two places on our site:
    1. The Document Management System page at: https://www.solidigm.com/products/document-management-system.html
    2. Individual product pages across the site:
      1. Example 1: https://www.solidigm.com/products/data-center/d7/ps1010.html
      2. Example 2: https://www.solidigm.com/products/data-center/d7/p5810.html
    I have my agent's knowledge source pointed at www.solidigm.com, but it is still struggling to find these documents.
    To try to improve retrieval, I attempted to narrow the knowledge source specifically to the DMS page in a topic dedicated to document retrieval. However I ran into a couple of issues:
    1. The knowledge source URL field doesn't appear to accept a .html file extension, so I'm unable to point it directly at https://www.solidigm.com/products/document-management-system.html. Our web team is working on setting up a redirect from the extensionless URL to the .html version. Would a redirect work for the crawler, or does it need to hit the final destination URL directly?
    2. Looking at the page source for the DMS page, the document table appears to be powered by the DataTables library. Could this cause an issue where the crawler sees an empty table because the data is loaded dynamically via JavaScript after page load, rather than being server-rendered in the HTML?
     
    When testing the agent with this configuration, the Search and Summarize node returns "No information was found that could help answer this", suggesting the knowledge source is not returning any content despite the knowledge source status showing as "Ready."
     
    For reference I've attached a simplified version of the topic YAML showing the Search and Summarize node pointed at the DMS knowledge source.

    Any guidance would be appreciated.
  • Verified answer
    Sunil Kumar Pashikanti Profile Picture
    1,870 Moderator on at
     
    Thanks for the detailed follow‑up, your analysis is correct, and you’ve essentially identified the root cause already.
    To address your questions point‑by‑point:
    1. Redirects vs .html URLs
    A redirect from the extensionless URL to the .html version should work from a crawler perspective. However, in this case the redirect itself is not the blocker. Even when the crawler reaches the final page successfully, it can only index server‑rendered HTML.

    2. JavaScript / DataTables rendering
    Yes, this is the primary issue.
    Because the document table on the DMS page is populated via the DataTables library after page load, the Copilot Studio website crawler sees only an empty table at crawl time. JavaScript is not executed, so none of the PDF links or metadata are discovered.
    The same constraint applies if those links are injected dynamically on individual product pages.

    3. Why the knowledge source shows “Ready” but returns no results
    “Ready” indicates that the crawl completed successfully, not that meaningful content was extracted. If the crawler doesn’t see static links or readable text during the initial HTML load, the Search and Summarize node correctly returns “No information was found.”
     
    Recommended fix (confirmed)
    The Index Page approach we outlined is the correct and proven solution:
    1. Create a simple, server‑rendered HTML page that contains direct ... links to each PDF you want indexed
    2. Ensure the links are standard ... elements and not populated via JavaScript
    3. Point the website knowledge source at that page (or allow it to be discovered during the site crawl)
    This works because the crawler can discover the static links, follow them, and then index the PDF content itself.

    Alternative (often more reliable)
    If feasible, adding the PDFs directly as a file‑based knowledge source (or via SharePoint/OneDrive) avoids all crawler visibility issues and generally produces the most consistent results.

    In short: your diagnosis is accurate, and creating a static index page is the right fix for this scenario.
  • CT-20042235-0 Profile Picture
    16 on at
     
    Thanks again. This thread has been very helpful. I've liked and marked both your responses as accepted thus far.
     
    I have two related questions to sanity‑check my approach as I move forward:
     
    1) Deprioritizing specific pages within a knowledge source
    My current working assumption is that the right workaround is to dedicate separate website knowledge sources to specific use cases, and then have document‑retrieval topics reference only those sources (for example, product pages + DMS pages, excluding support KAs).
     
    In our case, that pattern depends on being able to point a knowledge source at static entry pages such as:
    https://www.solidigm.com/products/document-management-system (which we expect to redirect to the .html page), so that those sources are discoverable and can be cleanly scoped to document retrieval.
     
    However, assuming that approach isn’t feasible or doesn’t behave as expected, is there any supported way, or recommended best practice, to effectively deprioritize or exclude specific pages (such as explanatory KAs) within a single website knowledge source? Or is knowledge‑source separation the only reliable mechanism today?
     
    2) System instructions vs. Generative Answer instructions
    When a topic uses a Generative Answer / Search and Summarize action with additional instructions, are the agent’s system instructions still fully applied in that context? I want to confirm that system‑level rules (source restrictions, document handling, link requirements, etc.) remain authoritative, and that generative instructions simply layer on top rather than replace them.
     
    Appreciate any guidance. Thank you again for sharing the detailed explanations.
  • Sunil Kumar Pashikanti Profile Picture
    1,870 Moderator on at
     
    Great questions and your assumptions are mostly right.
     
    1) In Copilot Studio today, the only reliable way to control what content is used from a website is by separating it into different knowledge sources. Once a website source is added, there isn’t a supported way to selectively exclude or deprioritize specific URLs within that source.
    Example scenario
    Your website has:
    • Product documentation pages
    • Marketing pages
    • Support FAQs and troubleshooting articles
    You want the copilot to answer product questions only from official docs, not from FAQs or marketing copy.
    What you do
    1. Create Knowledge Source A: docs.company.com/products
    2. Create Knowledge Source B: www.company.com/marketing
    3. Create Knowledge Source C: support.company.com/faqs
    How it helps
    A “Document Retrieval” topic references only Knowledge Source A
    Support or FAQ content is never considered, even though it’s on the same domain
    There’s no need to fight relevance or worry about the copilot pulling a nearby but wrong page
     
    Why this matters
    There’s no way to say “ignore /faqs” or “deprioritize /support” inside a single website source. Splitting sources is what gives you control.

    2) System instructions should be treated as hard guardrails. Topic‑level instructions and Generative Answer guidance can refine how responses are generated, but they don’t override agent‑level rules. If there’s ever a conflict, the system instructions will take precedence rather than being resolved based on which instruction was defined most recently.
     
    Example 1: System instruction wins
    System instruction
         Only answer using content from approved knowledge sources.
          Do not include external links.
    Topic / Generative Answer instruction
          Summarize the answer and include a link to the full article.
    What happens
    • The answer may be summarized
    • No external link is included, even if the topic asks for one
    • The system rule is enforced
    Example 2: Topic instructions refine behavior (but don’t override)
    System instruction
          Respond in a professional tone and only reference internal documentation.
    Topic instruction
          Explain the steps in simple language for non‑technical users.
    What happens
    • The response stays professional
    • It uses simpler wording
    • It still only cites internal documentation
    This works because the topic instruction stays within the system guardrails.
     
    Example 3: Conflicting instructions don’t “negotiate”
    System instruction
          Never provide pricing information.
    Topic instruction
          Answer the user’s question and include pricing details if available.
    What happens
    • Pricing is omitted or the answer is partially withheld
    • The system rule is not overridden
    • The conflict does not resolve based on which instruction was added later
     
    Simple takeaway
    Knowledge sources control what content is even eligible
    System instructions control hard rules
    Topic and Generative Answer instructions shape how answers are written, not what rules are broken

Under review

Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.

Helpful resources

Quick Links

Introducing the 2026 Season 1 community Super Users

Congratulations to our 2026 Super Users!

Kudos to our 2025 Community Spotlight Honorees

Congratulations to our 2025 community superstars!

Congratulations to the March Top 10 Community Leaders!

These are the community rock stars!

Leaderboard > Power Apps

#1
11manish Profile Picture

11manish 516

#2
Vish WR Profile Picture

Vish WR 444

#3
WarrenBelz Profile Picture

WarrenBelz 434 Most Valuable Professional

Last 30 days Overall leaderboard