-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Implement two qubit unitary to operations algorithm from https://arxiv.org/abs/quant-ph/0406176 #6777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @NoureldinYosri, can I take this up too since I am trying to find a solution for #6770? |
Cirq Cynq: This will improve the performance by reducing the number of rotations and depth of the circuit. |
@RahilJain1366, thanks for offering, let's discuss after #6770 is complete |
Hi, I’m also interested in working on this if @RahilJain1366 hasn’t taken it up yet @senecameeks . |
I implemented the A.2 case. However, when comparing unoptimized circuit depth for the current Cirq package with my change, I didn't observe any improvement:
To obtain these numbers I generated a random unitary for the given number of qubits and then ran Shannon decomposition. Note that these numbers are the same as those here. The reason for this is that the 3-qubit case already uses the A.2 optimization. The crux of A.2 was already implemented by @balopat in this method. Removing the 3-qubit base case and using A.2 for However, when running the optimizer:
which is better than what was reported here. Furthermore, I discovered that quantum Shannon decomposition is very slow. I traced this issue back to
You can see my code here. To summarize:
@NoureldinYosri @ACE07-Sev can you please have a look and let me know your thoughts? |
I implemented A.2 a few months ago for I'll have a look soon but in the meantime kindly check yours with mine. |
@ACE07-Sev as I explain in the comment A.2 is already implemented here, but it's not obvious, because it's part of the If you have a colab or script to share, I'm happy to run it. |
@codrut3 I'm comparing with mine now. Can you run yours with transpile afterwards please? I want to compare the depths with mine. Here are my results:
This is the depth of a circuit with U3 and CX gates as done before (I don't have my old script anymore, but that one was the same). |
Are these Qiskit numbers? The Cirq numbers are:
Would you consider this to be close enough? |
Yes, those are the qiskit numbers (after transpile). Did you run transpile on yours as well? |
Yes, this is after the transpiler (I call it optimizer :)). Basically, I think the transpiler does the heavy work and so further improvements would be realized by improving the transpiler. |
I see. I mean, for small number of qubits sure, but when you go above 10 it's gonna make a big difference. Especially with practical use-cases given the deeper the circuit is the noisier it's going to get. With qubits as large as 20-30, it'll make a big difference down the line. QSD is already exponential, so we really can use any trimming we can get. |
I'll put this on my TODO list to properly go through the codebase ASAP. Right now my laptop is kaput and freezes every few minutes so can't properly go through it. |
Maybe. But A.2 doesn't help here, because it's already part of the 3-qubit base case. I did a sanity check that the number of CZ gates decreases in my implementation, so I did manage to apply A.2 lol. It's just that the big difference is made by the transpiler. I'm not sure what to do there. |
I'll run it without transpile right now. You do the same please. We can then compare. If they match then LGTM, but otherwise we could really use that small improvement. P.S : I get you on the laptop being slow hehe. God knows I have had code that ran for a week on my laptop. I would definitely try either using multi-threading to make the loops for each qubit to run in parallel (you'd have to use 3.14 with nogil to do this properly) or to get a CPU with really fast single-core speed. That works wonders. I'm not sure if you're a community member like moi or you work at cirq, but if the latter, definitely try to run the code on sth like i7 14700k or above (the office should have a CPU like that). It can cut the runtime in 4 usually and make it easier to work on. |
By the way, before you start optimizing, please run a profiler to see what is taking the most amount of time. cProfiler is great for that (let me know if you'd like some assistance with this, I love profiling). You can even make a call graph png with gprof2dot to visualize it and make it easier to read (it also makes it easier to catch redundancies based on number of calls to a function/method). Also, keep in mind python is not the best language for runtime speed. Hence why some packages have been porting the heavy-duty stuff to rust and only use python as the interface. |
My numbers are:
Thank you! Using multi-threading: doesn't this require writing the shannon decomposition to be multi-threaded?
I work in a different part of Google. I contribute to cirq in my free time on my personal laptop. For now I'll try to stick to this :)
I used a profiler and it did help to narrow down but I'm still not sure of the root cause. I wrote an optimization for |
Should be fine then. I checked the older messages and I had said 70k there without transpile so I guess it could be transpile bringing it to 60k. I'll have to properly check this to make sure. I think one way for you to be sure (I tried, I get frozen so can't run it myself, looking to get a new hardware soon to not be so useless like now) is to convert the cirq circuit you're getting before optimization to qiskit via qasm or via my package if you prefer that and then call transpile on both and then compare. That way if they don't match then you know for a fact that the decomposition are not the same. That should remove the effect of transpile from the comparison. |
No no what I meant by multiprocessing is to just use it for when you were generating the circuit depths for a list of qubit sizes. Basically since each iteration is independent then you can run them in parallel to save time. Like instead of generating circuit for 1 qubit then 2 qubit then 3 qubit and so on you would instead generate all of them in parallel since they are independent tasks. That way it would take as long as the longest iteration. It would save considerable time. I wouldn't recommend making cirq itself use multiprocessing given it's a very user-specific thing (some have very few cores or very slow clock speed so for them it would be slower than no multiprocessing due to the overhead of the communication between the threads). |
Oh yeah gprof2dot is amazing. It color codes the call graph from green to red where red is sth that takes the most time. It shows exactly how many times sth is being called and how expensive it is. It's very useful for packages this big since you can visually just see what's red and then trail it from there and see what is in your control (I.e. you can't optimize performance of your dependencies) and optimize that. To save time, just make a constant function and put the synthesis or whatever you want to profile there. Then call Then you can use gprof2dot to generate a png from this .prof file. I closed the laptop and went to bed, I am really sorry I didn't write proper snippets and exact command. But that's the overall steps. |
thanks @codrut3 and @ACE07-Sev for the analysis and discussion ... looking forward to your final verdict @ACE07-Sev your optimized @ACE07-Sev and @codrut3 I don't know if you are aware but we have a biweekly cirq-sync meeting, if you want to attend you can join the cirq-dev group https://groups.google.com/g/cirq-dev which will send you an invite to the meeting. |
Thanks @NoureldinYosri ! I'll come to the next Cirq Sync to discuss this. @ACE07-Sev would you be able to attend too? It's on Wednesday 28th May. |
So so sorry for the delay. I had some court issues and forgot about this. Sure, I'll try to be there. |
@codrut3 Did you get a chance to run the experiment I asked? That's the main thing I would like to see to finally provide a meaningful comparison. Would be great to have that before the biweekly meeting on 28th. |
@NoureldinYosri You are too kind. I have been a bit useless with my laptop being glitchy. I have put this issue on my TODO if it's still open by the time I get a new hardware. |
Thank you for the suggestion! I did this. More precisely, I ran the Cirq Shannon decomposition, converted to qasm and then ran Qiskit transpiler with basis gates u3 and cz. Then I ran Cirq transpiler with the same target gateset. Both transpilers give the same depth! For the current Cirq package:
With my change that implements A.2:
Here there is a One thing to note is that the Cirq transpiler creates more u3 gates (but this doesn't change the depth). The cz gate numbers after the transpiler match. Do I interpret this correctly that it's not the transpiler, but the Cirq Shannon decomposition that still has some inefficiency that causes higher depth? But it can't be A.2, because I implemented it and it doesn't change things. |
I found the problem :) I had to implement A.1 too. After doing this, the circuit depth after transpiler matches that of Qiskit exactly. @ACE07-Sev thank you for pushing to figure this out! I'll send a PR soon. |
The paper https://arxiv.org/abs/quant-ph/0406176 introduces an algorithm for performing quantum shannon decomposition. We have this algorithm implemented in Cirq in
Cirq/cirq-core/cirq/transformers/analytical_decompositions/quantum_shannon_decomposition.py
Line 44 in 351a08e
However, one of the base cases of the algorithms is not implemented. Namely A.2 the base case for a two qubit unitary, instead cirq uses another decomposition
Cirq/cirq-core/cirq/transformers/analytical_decompositions/quantum_shannon_decomposition.py
Lines 101 to 113 in 351a08e
Which is not as efficient as the decompositin described in A.2. It would be nice to implement A.2 and uses it in QSD.
What is the urgency from your perspective for this issue? Is it blocking important work?
P2 - we should do it in the next couple of quarters
The text was updated successfully, but these errors were encountered: