published at 2025/10/18

If asked what prompt engineering is, I can readily answer: It's the process of guiding LLMs through natural language to understand objectives and ultimately generate desired outputs as you wish. I can also quickly mention concepts like few-shot learning and chain-of-thought(CoT). But if you ask about the underlying mechanisms or evaluation methods, I currently have no answers. It simply works when it works—especially after diving deep into research projects and repeatedly designing tests with different prompts. While I've discovered techniques to enhance model performance, I still struggle to grasp the deeper logic behind it.

Shall we think about this in a different way? Prompt engineering itself, like LLMs, is just a tool to use. We can adopt a pragmatic approach and follow result-driven methodology. Yeah, these tools are quite abstract — unlike hammers and axes, their usage isn't immediately intuitive. In other words, just use the ones bringing better outputs. I still wish to create my own paradigm that can be reproduced and easily applied to different topics. I don’t have the final answer yet but still want to note down something helpful from projects I have worked on.

Prompt as the basis

First of all, it is very effective and probably is the simplest way everyone can do to get better response from LLMs. A few sentences are already enough to make a huge difference. A quick and lightweight framework I use is like:

Role-play:
<Assign the LLM to adopt a specific persona>

Goal:
<Describe the task/goal you need help to>

Response requirement:
<List the desired output from LLM, e.g. format, tone or style>

It works nicely for relatively simple tasks involving information searching without relying on up-do-date things. For example, I have the following setting to learn about English:

image.png

When getting words I don’t know, I can quickly know its meaning and how to use that words with examples.

image.png

One thing we shouldn’t ignore is how reliable the response is. Tasks such as English learning are trustworthy because the relevant knowledge is both readily accessible and openly shared, and it is precisely this knowledge that forms the foundation for training LLMs. Imagine Without LLM, what you would do to learn a new word? Searching for a dictionary(or more) and check the word, or just typing it in Google translate. LLM is good at information retrieval and it does way quicker than you and me.

When it comes to complex tasks, more efforts are required to tailor the prompt and more things should be added to it. Few-shot learning and CoT are definitely effective. My understanding is that the logic behind things might not be intuitive but if disclosed by natural language, LLM could follow the path you've laid out to achieve your goal.

Recently I developed a feature using LLM to group and classify OCR text from a receipt image into product groups for accounting. Signifiant improvement was achieved by adding few-shot examples with CoT plus one more thing I will talk about later. Formulating the CoT is really time-consuming, cause I need to figured it out the logic first from the hard-to-read OCR text, which is achieved by prompting LLM after iterations. I learned about keyword matching and numeric consistency checks through calculations then added the CoT part myself based on my understanding. Those examples indeed helped a lot, although without proper testing(lack of evaluation samples) I cannot say performance got improved from x to y. Tried it out through A/B testing and with the help of another tester, the enhancement was proved.

One more thing that makes a huge impact is workflow - steps to follow to solve the problem, kind of like a high-level CoT where a reasoning framework is provided to execute CoT. I was really surprised after switching to Beast Mode in Github Copilot and the mind blowing part is the workflow. Since then this weapon is added to my arsenal and it never let me down. It is the next-level of CoT prompt and the improvement is obvious.

From prompt to context

After reading this article and the mentioned paper, I realized that the application part of LLM has already been evolved so quickly. The term context engineering perfectly articulates something I can sense but cannot put into words. In the end LLM is “guessing” things, and we could somehow increase the best guesses by providing more relevant and precise information. Look back at the workflow from Beast Mode, we can see that it pushes the LLM to use the fetch tool to do web search, from where more updated information can be retrieved, preventing only internal world knowledge is used which is from the outdated training data.

Then the problems switches to how to provide context effectively. Up-to-date information is very important. LLM without web search functionality is pretty much useless. Another aspect is relevance. When testing about AGENTS.md for my projects, the more details I add, including what dependency manager or which React component library to use, the better output I will get later. Plus, sessions like development guidelines can make a big impact. Tool calling and MCP server are good ways to provide context dynamically and luckily there are more and more official MCP servers released. Further towards this direction, if we could build up a knowledge base with useful contents and clear categories, we can expect better output from LLMs(Actually there are already products on the market). Therefore, RAG related systems can be really helpful if token consumption is not a big deal. Just digress a bit, token consumption is actually something I start thinking about recently, especially after once use Codex that consumes 800+k tokens within a few minutes while getting bad shits. Will write thing later.

One more thing is memory management. At least for now when running AI tools for code generation, I haven’t relied on this that much. I always open a new tab for different task and wish the previous ones can be forgotten. If I do see valuable stuff, I would update the doc or instructions. For collaboration tools, this functionality should be useful, as updating docs this type of hassle won’t be appreciated.

what’s next